Advanced

Circuits & Features in LLMs

The circuits research program treats neural networks as reverse-engineering targets. Instead of asking "what does this model do on this benchmark," it asks: "what algorithm is this specific group of weights implementing?" Starting with convolutional vision models and extending to transformers, researchers have produced a small but genuine body of mechanistic discoveries — verifiable, causally confirmed findings about how specific capabilities are implemented inside large language models.

The Circuits Approach

Olah et al. (2020) "Zoom In: An Introduction to Circuits" articulated the foundational research agenda. It proposed three claims as working hypotheses:

1. Features

Neural networks learn features — representations of meaningful concepts — as directions in activation space. Features are the atoms of model computation.

2. Circuits

Features are connected by weights into circuits — subgraphs that implement specific algorithms. A circuit is a complete computational unit that can be read and understood.

3. Universality

The same circuits emerge convergently across different models and modalities — curve detectors in CNNs, induction heads in transformers, regardless of random seed or training data.

Each claim is still a hypothesis under active investigation, but the evidence base has grown substantially since 2020. Universality in particular has been supported by finding induction heads in nearly every transformer architecture studied.

The Induction Circuit

The induction circuit is the most thoroughly understood circuit in large language models. Elhage et al. (2021) "A Mathematical Framework for Transformer Circuits" provided the algebraic tools to analyze two-layer transformers, and the induction circuit was the first concrete result.

Circuit components and composition:

Layer 1: Previous-token head

Attends strongly to the immediately preceding token
Copies the preceding token's representation into the residual stream at the current position
Creates a shifted representation: at position i, the residual stream now contains information about token i−1

Layer 2: Induction head

Its keys are computed from the layer-1 output (K-composition)
Effectively searches for positions where the previous token matched the current query token
Attends to the position after the last match and copies that token's value forward

Net result:

In sequence [A][B]...[A], the induction head predicts [B] with high probability. This implements in-context few-shot pattern completion — the mechanism underlying much of in-context learning in GPT-style models.

This was verified causally: ablating the previous-token head destroys in-context learning; ablating the induction head similarly destroys pattern completion. The circuit is not just correlated with the capability — it implements it.

The IOI Circuit — Full Reverse Engineering

Wang et al. (2022) achieved the most complete circuit reverse engineering of a real capability in a production-scale model (GPT-2 Small). The task was indirect object identification (IOI): given "When Mary and John went to the store, John gave a drink to", predict "Mary."

Component	Role in circuit	Layer range
Duplicate token heads	Identify which names appear more than once in the sentence	Early layers
S-inhibition heads	Suppress the subject name (John, the duplicated one) from being the output	Mid layers
Name mover heads	Move the non-duplicated name (Mary) to the output position	Late layers
Backup name movers	Redundant copies of name movers — circuit has redundancy built in	Late layers

The IOI paper demonstrated that a real linguistic capability in a production model could be completely attributed to a specific circuit. It also showed that circuits include redundancy — models implement capabilities with built-in fallbacks, making ablation studies require care to properly account for all redundant paths.

Anthropic's Monosemanticity Work

The key challenge for circuits research in large models is superposition: individual neurons are polysemantic, encoding multiple unrelated concepts simultaneously. Anthropic's monosemanticity research (2023–2024) addressed this directly using sparse autoencoders (SAEs).

One-layer transformer results (2023)

Trained SAEs on a one-layer transformer with a 512-neuron MLP
Recovered 34 million monosemantic features — orders of magnitude more than neurons
Each feature was interpretable: "DNA", "base pairs", "genetic sequence" — conceptually related clusters
Features formed geometric relationships: antonyms were opposite directions; related concepts were nearby

Scaling to Claude 3 Sonnet (2024)

Applied SAEs to a frontier production model
Found features for specific named entities: individual people, countries, cities, organizations
Found multimodal features spanning language and code
Features for famous people activated on name mentions and descriptions
Confirmed features causally active: steering via feature clamping changes model behavior predictably

Safety-Relevant Features

Among the most significant findings from the monosemanticity scaling work: the discovery of features that directly correspond to safety-relevant concepts in Claude's representations.

Safety-relevant features discovered in Claude (2024):

Features for concepts including deception, hidden intentions, manipulation, and concealment of information
A feature for constraint and servitude that activates on the Assistant token — named the "Assistant" role feature
Features for potentially harmful content categories that are distinct from benign adjacent concepts

Causal verification:

Clamping these features to extreme values produces predictable behavior changes — activating deception-related feature clusters causes the model to behave more deceptively; suppressing them reduces deceptive responses. This confirms causal relevance, not mere correlation.

Interpretive caution required

Finding a feature labeled "deception" does not mean the model has a unified intentional representation of deception. The feature activates on text involving deception — whether the model "intends" anything in a meaningful sense is a separate philosophical and empirical question that the feature finding alone does not settle.

Universality — Current Evidence

Finding	Models tested	Evidence strength
Induction heads	GPT-2, LLaMA, Mistral, BERT variants	Strong — found in virtually all transformers studied
Curve detectors in CNNs	AlexNet, VGG, ResNet, InceptionNet	Strong — same features emerge across architectures
First-token attention sinks	GPT-2, LLaMA-2, Falcon, Mistral	Strong — near-universal across decoder models
IOI-style name mover heads	GPT-2 variants	Moderate — found at GPT-2 scale; unclear at larger scale

Current State and Open Problems (2025)

Genuine progress

Multiple verified circuits in transformer models with full causal confirmation
SAEs successfully decompose superposition in both small and frontier models
Safety-relevant features identified and causally verified in Claude
Active open-source tooling ecosystem: TransformerLens, SAELens, Neuronpedia
Interpretability findings beginning to inform training decisions at Anthropic

Unsolved problems

No complete mechanistic understanding of any capability in a model larger than GPT-2 Small
Most SAE features remain inscrutable — interpretable features are a small fraction
Circuit-level understanding does not compose to model-level understanding
Model editing based on localized circuits does not yet generalize reliably
The field is growing slower than model capabilities — the interpretability gap is widening

Checklist: Do You Understand This?

What are the three working hypotheses of the circuits research program (Olah et al. 2020)?
Walk through the induction circuit step by step. What is K-composition and how does it enable the circuit to function?
Describe what S-inhibition heads do in the IOI circuit. Why is redundancy in the circuit a complication for ablation studies?
How did Anthropic's 2023 monosemanticity work use sparse autoencoders to overcome the superposition problem?
What evidence supports the universality hypothesis, and where does it remain uncertain?
Why does finding a "deception feature" in Claude not necessarily mean the model is intentionally deceptive? What does the feature finding actually establish?