Mechanistic Interpretability — What & Why
Mechanistic interpretability is the scientific program of reverse-engineering neural networks at the algorithmic level — understanding not just what a model outputs but how it arrives there. Rather than treating a model as a black box and measuring its behavior from the outside, mechanistic interpretability opens the box and attempts to read the circuit diagrams inside. As of 2025 it is one of the most active research frontiers in AI safety.
Core Concepts
Three foundational ideas underpin the field. Each one is both a conceptual claim about how neural networks work and a target for empirical investigation.
Features
Directions in a model's activation space that represent human-interpretable concepts. "Banana," "Paris," "negation" — each may correspond to a specific direction in the high-dimensional space of neuron activations. When that direction is active, the concept is present in the model's computation.
Circuits
Computational subgraphs — specific neurons and attention heads connected by the weight matrix — that implement a particular algorithm. A circuit is to a neural network what a subroutine is to software: a reusable, identifiable computational unit that performs one job.
Universality
The hypothesis that the same circuits emerge independently in different models trained on different data. If true, this would mean there are convergent solutions to common computational problems — and that findings in one model transfer to others.
The Superposition Hypothesis
Neural networks have far more concepts to represent than they have dimensions. A model with 4,096 hidden dimensions might need to track millions of distinct features. The superposition hypothesis (Elhage et al., 2022) explains how this is possible: models store features as non-orthogonal directions in activation space, accepting interference between features that rarely co-occur.
The core insight:
- If two features are never active at the same time, their directions can be nearly parallel — the interference is never "paid"
- Models exploit sparsity in the real world: most concepts are irrelevant in any given context
- This allows exponentially more features than dimensions — but at a cost: individual neurons become polysemantic
- A polysemantic neuron fires for multiple unrelated concepts simultaneously, making it impossible to interpret neuron-by-neuron
Superposition explains a longstanding puzzle: why are individual neurons in large models so hard to interpret? They are not monosemantic (dedicated to one concept) — they are simultaneously encoding many concepts in superposition. This is why reading neuron activations directly is insufficient for mechanistic understanding.
Circuits Research — Key Findings
The circuits research program began with convolutional vision models and was extended to transformers. Elhage et al. (2021) "A Mathematical Framework for Transformer Circuits" established the formal vocabulary for analyzing transformers at the circuit level. Several circuits have been discovered and verified:
| Circuit | What it does | Significance |
|---|---|---|
| Induction heads | Attend to the token following the previous occurrence of the current token — implements pattern completion | Mechanistically responsible for in-context learning; found in virtually all transformers |
| Previous token heads | Attend to the immediately previous token; create a shifted copy of the sequence | Input to induction heads; part of a two-layer circuit for few-shot copying |
| IOI circuit (Wang et al. 2022) | Implements indirect object identification: in "John gave Mary the ball. He gave it to ___" — identifies Mary | First complete circuit reverse-engineering of a real factual recall task in GPT-2 |
| Copy suppression | Reduces the probability of directly copying a recent token when context calls for a different output | Explains how models avoid over-copying in in-context completion tasks |
Sparse Autoencoders — The Key Tool
If superposition means features are mixed across neurons, how do you unmix them? The leading method as of 2024–2025 is the sparse autoencoder (SAE). An SAE is a shallow neural network trained on frozen model activations with two objectives: reconstruct the activations faithfully, and use a sparse hidden layer to do so.
How SAEs work
- Encoder maps a residual stream activation (e.g., 4,096-d) to a much larger hidden layer (e.g., 16,384-d)
- An L1 sparsity penalty forces most hidden units to be zero at any given time
- Decoder maps the sparse hidden layer back to reconstruct the original activation
- Sparse hidden units that are non-zero correspond to monosemantic features
- Each feature direction in the SAE corresponds to one interpretable concept
What Anthropic found
- Applied SAEs to a one-layer transformer (2023) and found 34M monosemantic features
- Scaled to Claude 3 Sonnet (2024) — found features for specific people, cities, scientific concepts
- Found features corresponding to safety-relevant concepts: deception, harm, manipulation, concealment
- Clamping these features changes model behavior in predictable, causal ways
- Confirmed features are causally active — not just correlated with outputs
The monosemanticity findings are significant because they confirm that models do learn structured, human-interpretable representations — they are just entangled via superposition. SAEs decompose the superposition, revealing a high-dimensional but legible feature space.
Why It Matters for Safety
Mechanistic interpretability is motivated heavily by AI safety concerns. The core problem is that behavioral evaluations — testing a model on benchmarks — cannot guarantee safe behavior in novel situations. A model could appear aligned on all tested distributions while harboring deceptive patterns that activate under other conditions.
Safety applications
- Alignment verification: inspect whether the model's internal representation of its goal matches its stated goal
- Deception detection: look for features representing "user believes X but model believes Y" type patterns
- Capability auditing: identify what circuits exist before a model is deployed
- Targeted intervention: clamp or ablate harmful features without full retraining
Current limitations
- Full mechanistic understanding of even a small transformer is intractable today
- Most SAE-extracted features are inscrutable — only a small fraction have clear human interpretations
- Circuit analysis doesn't yet scale to production-size models (70B+ parameters)
- Progress is slow relative to the rate at which model capabilities are growing
- Adversarial inputs can activate unexpected circuits not captured by standard analysis
Tools of the Trade
| Tool | Purpose | Maintained by |
|---|---|---|
| TransformerLens | Hook into any layer/head of GPT-2, Llama, Mistral etc.; inspect and patch activations | EleutherAI / Neel Nanda (open source) |
| SAELens | Training and analysis library for sparse autoencoders; works with TransformerLens | Joseph Bloom / community |
| Neuronpedia | Web interface for browsing SAE features; steering experiments; crowd-sourced feature labeling | Johnny Lin / community |
| Anthropic Interpretability API | Access to Claude model internals for research; activation patching via API | Anthropic (research access) |
Checklist: Do You Understand This?
- Can you explain what a "feature" means in the context of mechanistic interpretability — how does it differ from a neuron?
- What is superposition, and why does it make individual neurons hard to interpret?
- How does a sparse autoencoder (SAE) decompose superposed features? What is the role of the L1 sparsity penalty?
- What are induction heads, and what computational function do they implement?
- Why is behavioral evaluation alone insufficient to verify model alignment? What does mechanistic interpretability add?
- What are two concrete safety applications of mechanistic interpretability, and what are two current limits of the approach?