Advanced

Mechanistic Interpretability — What & Why

Mechanistic interpretability is the scientific program of reverse-engineering neural networks at the algorithmic level — understanding not just what a model outputs but how it arrives there. Rather than treating a model as a black box and measuring its behavior from the outside, mechanistic interpretability opens the box and attempts to read the circuit diagrams inside. As of 2025 it is one of the most active research frontiers in AI safety.

Input

Tokens

Features

Learned directions in activation space

Circuits

Attention heads

MLP neurons

Behaviours

Factual recall

Induction

Emotion probes

Mechanistic interpretability maps tokens → features → circuits → behaviours — tracing how models compute specific outputs

Core Concepts

Three foundational ideas underpin the field. Each one is both a conceptual claim about how neural networks work and a target for empirical investigation.

Features

Directions in a model's activation space that represent human-interpretable concepts. "Banana," "Paris," "negation" — each may correspond to a specific direction in the high-dimensional space of neuron activations. When that direction is active, the concept is present in the model's computation.

Circuits

Computational subgraphs — specific neurons and attention heads connected by the weight matrix — that implement a particular algorithm. A circuit is to a neural network what a subroutine is to software: a reusable, identifiable computational unit that performs one job.

Universality

The hypothesis that the same circuits emerge independently in different models trained on different data. If true, this would mean there are convergent solutions to common computational problems — and that findings in one model transfer to others.

The Superposition Hypothesis

Neural networks have far more concepts to represent than they have dimensions. A model with 4,096 hidden dimensions might need to track millions of distinct features. The superposition hypothesis (Elhage et al., 2022) explains how this is possible: models store features as non-orthogonal directions in activation space, accepting interference between features that rarely co-occur.

The core insight:

If two features are never active at the same time, their directions can be nearly parallel — the interference is never "paid"
Models exploit sparsity in the real world: most concepts are irrelevant in any given context
This allows exponentially more features than dimensions — but at a cost: individual neurons become polysemantic
A polysemantic neuron fires for multiple unrelated concepts simultaneously, making it impossible to interpret neuron-by-neuron

Superposition explains a longstanding puzzle: why are individual neurons in large models so hard to interpret? They are not monosemantic (dedicated to one concept) — they are simultaneously encoding many concepts in superposition. This is why reading neuron activations directly is insufficient for mechanistic understanding.

Circuits Research — Key Findings

The circuits research program began with convolutional vision models and was extended to transformers. Elhage et al. (2021) "A Mathematical Framework for Transformer Circuits" established the formal vocabulary for analyzing transformers at the circuit level. Several circuits have been discovered and verified:

Circuit	What it does	Significance
Induction heads	Attend to the token following the previous occurrence of the current token — implements pattern completion	Mechanistically responsible for in-context learning; found in virtually all transformers
Previous token heads	Attend to the immediately previous token; create a shifted copy of the sequence	Input to induction heads; part of a two-layer circuit for few-shot copying
IOI circuit (Wang et al. 2022)	Implements indirect object identification: in "John gave Mary the ball. He gave it to ___" — identifies Mary	First complete circuit reverse-engineering of a real factual recall task in GPT-2
Copy suppression	Reduces the probability of directly copying a recent token when context calls for a different output	Explains how models avoid over-copying in in-context completion tasks

Sparse Autoencoders — The Key Tool

If superposition means features are mixed across neurons, how do you unmix them? The leading method as of 2024–2025 is the sparse autoencoder (SAE). An SAE is a shallow neural network trained on frozen model activations with two objectives: reconstruct the activations faithfully, and use a sparse hidden layer to do so.

How SAEs work

Encoder maps a residual stream activation (e.g., 4,096-d) to a much larger hidden layer (e.g., 16,384-d)
An L1 sparsity penalty forces most hidden units to be zero at any given time
Decoder maps the sparse hidden layer back to reconstruct the original activation
Sparse hidden units that are non-zero correspond to monosemantic features
Each feature direction in the SAE corresponds to one interpretable concept

What Anthropic found

Applied SAEs to a one-layer transformer (2023) and found 34M monosemantic features
Scaled to Claude 3 Sonnet (2024) — found features for specific people, cities, scientific concepts
Found features corresponding to safety-relevant concepts: deception, harm, manipulation, concealment
Clamping these features changes model behavior in predictable, causal ways
Confirmed features are causally active — not just correlated with outputs

The monosemanticity findings are significant because they confirm that models do learn structured, human-interpretable representations — they are just entangled via superposition. SAEs decompose the superposition, revealing a high-dimensional but legible feature space.

Why It Matters for Safety

Mechanistic interpretability is motivated heavily by AI safety concerns. The core problem is that behavioral evaluations — testing a model on benchmarks — cannot guarantee safe behavior in novel situations. A model could appear aligned on all tested distributions while harboring deceptive patterns that activate under other conditions.

Safety applications

Alignment verification: inspect whether the model's internal representation of its goal matches its stated goal
Deception detection: look for features representing "user believes X but model believes Y" type patterns
Capability auditing: identify what circuits exist before a model is deployed
Targeted intervention: clamp or ablate harmful features without full retraining

Current limitations

Full mechanistic understanding of even a small transformer is intractable today
Most SAE-extracted features are inscrutable — only a small fraction have clear human interpretations
Circuit analysis doesn't yet scale to production-size models (70B+ parameters)
Progress is slow relative to the rate at which model capabilities are growing
Adversarial inputs can activate unexpected circuits not captured by standard analysis

Tools of the Trade

Tool	Purpose	Maintained by
TransformerLens	Hook into any layer/head of GPT-2, Llama, Mistral etc.; inspect and patch activations	EleutherAI / Neel Nanda (open source)
SAELens	Training and analysis library for sparse autoencoders; works with TransformerLens	Joseph Bloom / community
Neuronpedia	Web interface for browsing SAE features; steering experiments; crowd-sourced feature labeling	Johnny Lin / community
Anthropic Interpretability API	Access to Claude model internals for research; activation patching via API	Anthropic (research access)

Checklist: Do You Understand This?

Can you explain what a "feature" means in the context of mechanistic interpretability — how does it differ from a neuron?
What is superposition, and why does it make individual neurons hard to interpret?
How does a sparse autoencoder (SAE) decompose superposed features? What is the role of the L1 sparsity penalty?
What are induction heads, and what computational function do they implement?
Why is behavioral evaluation alone insufficient to verify model alignment? What does mechanistic interpretability add?
What are two concrete safety applications of mechanistic interpretability, and what are two current limits of the approach?