🧠 All Things AI
Advanced

Mechanistic Interpretability — What & Why

Mechanistic interpretability is the scientific program of reverse-engineering neural networks at the algorithmic level — understanding not just what a model outputs but how it arrives there. Rather than treating a model as a black box and measuring its behavior from the outside, mechanistic interpretability opens the box and attempts to read the circuit diagrams inside. As of 2025 it is one of the most active research frontiers in AI safety.

Core Concepts

Three foundational ideas underpin the field. Each one is both a conceptual claim about how neural networks work and a target for empirical investigation.

Features

Directions in a model's activation space that represent human-interpretable concepts. "Banana," "Paris," "negation" — each may correspond to a specific direction in the high-dimensional space of neuron activations. When that direction is active, the concept is present in the model's computation.

Circuits

Computational subgraphs — specific neurons and attention heads connected by the weight matrix — that implement a particular algorithm. A circuit is to a neural network what a subroutine is to software: a reusable, identifiable computational unit that performs one job.

Universality

The hypothesis that the same circuits emerge independently in different models trained on different data. If true, this would mean there are convergent solutions to common computational problems — and that findings in one model transfer to others.

The Superposition Hypothesis

Neural networks have far more concepts to represent than they have dimensions. A model with 4,096 hidden dimensions might need to track millions of distinct features. The superposition hypothesis (Elhage et al., 2022) explains how this is possible: models store features as non-orthogonal directions in activation space, accepting interference between features that rarely co-occur.

The core insight:

  • If two features are never active at the same time, their directions can be nearly parallel — the interference is never "paid"
  • Models exploit sparsity in the real world: most concepts are irrelevant in any given context
  • This allows exponentially more features than dimensions — but at a cost: individual neurons become polysemantic
  • A polysemantic neuron fires for multiple unrelated concepts simultaneously, making it impossible to interpret neuron-by-neuron

Superposition explains a longstanding puzzle: why are individual neurons in large models so hard to interpret? They are not monosemantic (dedicated to one concept) — they are simultaneously encoding many concepts in superposition. This is why reading neuron activations directly is insufficient for mechanistic understanding.

Circuits Research — Key Findings

The circuits research program began with convolutional vision models and was extended to transformers. Elhage et al. (2021) "A Mathematical Framework for Transformer Circuits" established the formal vocabulary for analyzing transformers at the circuit level. Several circuits have been discovered and verified:

CircuitWhat it doesSignificance
Induction headsAttend to the token following the previous occurrence of the current token — implements pattern completionMechanistically responsible for in-context learning; found in virtually all transformers
Previous token headsAttend to the immediately previous token; create a shifted copy of the sequenceInput to induction heads; part of a two-layer circuit for few-shot copying
IOI circuit (Wang et al. 2022)Implements indirect object identification: in "John gave Mary the ball. He gave it to ___" — identifies MaryFirst complete circuit reverse-engineering of a real factual recall task in GPT-2
Copy suppressionReduces the probability of directly copying a recent token when context calls for a different outputExplains how models avoid over-copying in in-context completion tasks

Sparse Autoencoders — The Key Tool

If superposition means features are mixed across neurons, how do you unmix them? The leading method as of 2024–2025 is the sparse autoencoder (SAE). An SAE is a shallow neural network trained on frozen model activations with two objectives: reconstruct the activations faithfully, and use a sparse hidden layer to do so.

How SAEs work

  • Encoder maps a residual stream activation (e.g., 4,096-d) to a much larger hidden layer (e.g., 16,384-d)
  • An L1 sparsity penalty forces most hidden units to be zero at any given time
  • Decoder maps the sparse hidden layer back to reconstruct the original activation
  • Sparse hidden units that are non-zero correspond to monosemantic features
  • Each feature direction in the SAE corresponds to one interpretable concept

What Anthropic found

  • Applied SAEs to a one-layer transformer (2023) and found 34M monosemantic features
  • Scaled to Claude 3 Sonnet (2024) — found features for specific people, cities, scientific concepts
  • Found features corresponding to safety-relevant concepts: deception, harm, manipulation, concealment
  • Clamping these features changes model behavior in predictable, causal ways
  • Confirmed features are causally active — not just correlated with outputs

The monosemanticity findings are significant because they confirm that models do learn structured, human-interpretable representations — they are just entangled via superposition. SAEs decompose the superposition, revealing a high-dimensional but legible feature space.

Why It Matters for Safety

Mechanistic interpretability is motivated heavily by AI safety concerns. The core problem is that behavioral evaluations — testing a model on benchmarks — cannot guarantee safe behavior in novel situations. A model could appear aligned on all tested distributions while harboring deceptive patterns that activate under other conditions.

Safety applications

  • Alignment verification: inspect whether the model's internal representation of its goal matches its stated goal
  • Deception detection: look for features representing "user believes X but model believes Y" type patterns
  • Capability auditing: identify what circuits exist before a model is deployed
  • Targeted intervention: clamp or ablate harmful features without full retraining

Current limitations

  • Full mechanistic understanding of even a small transformer is intractable today
  • Most SAE-extracted features are inscrutable — only a small fraction have clear human interpretations
  • Circuit analysis doesn't yet scale to production-size models (70B+ parameters)
  • Progress is slow relative to the rate at which model capabilities are growing
  • Adversarial inputs can activate unexpected circuits not captured by standard analysis

Tools of the Trade

ToolPurposeMaintained by
TransformerLensHook into any layer/head of GPT-2, Llama, Mistral etc.; inspect and patch activationsEleutherAI / Neel Nanda (open source)
SAELensTraining and analysis library for sparse autoencoders; works with TransformerLensJoseph Bloom / community
NeuronpediaWeb interface for browsing SAE features; steering experiments; crowd-sourced feature labelingJohnny Lin / community
Anthropic Interpretability APIAccess to Claude model internals for research; activation patching via APIAnthropic (research access)

Checklist: Do You Understand This?

  • Can you explain what a "feature" means in the context of mechanistic interpretability — how does it differ from a neuron?
  • What is superposition, and why does it make individual neurons hard to interpret?
  • How does a sparse autoencoder (SAE) decompose superposed features? What is the role of the L1 sparsity penalty?
  • What are induction heads, and what computational function do they implement?
  • Why is behavioral evaluation alone insufficient to verify model alignment? What does mechanistic interpretability add?
  • What are two concrete safety applications of mechanistic interpretability, and what are two current limits of the approach?