🧠 All Things AI
Advanced

Circuits & Features in LLMs

The circuits research program treats neural networks as reverse-engineering targets. Instead of asking "what does this model do on this benchmark," it asks: "what algorithm is this specific group of weights implementing?" Starting with convolutional vision models and extending to transformers, researchers have produced a small but genuine body of mechanistic discoveries — verifiable, causally confirmed findings about how specific capabilities are implemented inside large language models.

The Circuits Approach

Olah et al. (2020) "Zoom In: An Introduction to Circuits" articulated the foundational research agenda. It proposed three claims as working hypotheses:

1. Features

Neural networks learn features — representations of meaningful concepts — as directions in activation space. Features are the atoms of model computation.

2. Circuits

Features are connected by weights into circuits — subgraphs that implement specific algorithms. A circuit is a complete computational unit that can be read and understood.

3. Universality

The same circuits emerge convergently across different models and modalities — curve detectors in CNNs, induction heads in transformers, regardless of random seed or training data.

Each claim is still a hypothesis under active investigation, but the evidence base has grown substantially since 2020. Universality in particular has been supported by finding induction heads in nearly every transformer architecture studied.

The Induction Circuit

The induction circuit is the most thoroughly understood circuit in large language models. Elhage et al. (2021) "A Mathematical Framework for Transformer Circuits" provided the algebraic tools to analyze two-layer transformers, and the induction circuit was the first concrete result.

Circuit components and composition:

Layer 1: Previous-token head

  • Attends strongly to the immediately preceding token
  • Copies the preceding token's representation into the residual stream at the current position
  • Creates a shifted representation: at position i, the residual stream now contains information about token iāˆ’1

Layer 2: Induction head

  • Its keys are computed from the layer-1 output (K-composition)
  • Effectively searches for positions where the previous token matched the current query token
  • Attends to the position after the last match and copies that token's value forward

Net result:

In sequence [A][B]...[A], the induction head predicts [B] with high probability. This implements in-context few-shot pattern completion — the mechanism underlying much of in-context learning in GPT-style models.

This was verified causally: ablating the previous-token head destroys in-context learning; ablating the induction head similarly destroys pattern completion. The circuit is not just correlated with the capability — it implements it.

The IOI Circuit — Full Reverse Engineering

Wang et al. (2022) achieved the most complete circuit reverse engineering of a real capability in a production-scale model (GPT-2 Small). The task was indirect object identification (IOI): given "When Mary and John went to the store, John gave a drink to", predict "Mary."

ComponentRole in circuitLayer range
Duplicate token headsIdentify which names appear more than once in the sentenceEarly layers
S-inhibition headsSuppress the subject name (John, the duplicated one) from being the outputMid layers
Name mover headsMove the non-duplicated name (Mary) to the output positionLate layers
Backup name moversRedundant copies of name movers — circuit has redundancy built inLate layers

The IOI paper demonstrated that a real linguistic capability in a production model could be completely attributed to a specific circuit. It also showed that circuits include redundancy — models implement capabilities with built-in fallbacks, making ablation studies require care to properly account for all redundant paths.

Anthropic's Monosemanticity Work

The key challenge for circuits research in large models is superposition: individual neurons are polysemantic, encoding multiple unrelated concepts simultaneously. Anthropic's monosemanticity research (2023–2024) addressed this directly using sparse autoencoders (SAEs).

One-layer transformer results (2023)

  • Trained SAEs on a one-layer transformer with a 512-neuron MLP
  • Recovered 34 million monosemantic features — orders of magnitude more than neurons
  • Each feature was interpretable: "DNA", "base pairs", "genetic sequence" — conceptually related clusters
  • Features formed geometric relationships: antonyms were opposite directions; related concepts were nearby

Scaling to Claude 3 Sonnet (2024)

  • Applied SAEs to a frontier production model
  • Found features for specific named entities: individual people, countries, cities, organizations
  • Found multimodal features spanning language and code
  • Features for famous people activated on name mentions and descriptions
  • Confirmed features causally active: steering via feature clamping changes model behavior predictably

Safety-Relevant Features

Among the most significant findings from the monosemanticity scaling work: the discovery of features that directly correspond to safety-relevant concepts in Claude's representations.

Safety-relevant features discovered in Claude (2024):

  • Features for concepts including deception, hidden intentions, manipulation, and concealment of information
  • A feature for constraint and servitude that activates on the Assistant token — named the "Assistant" role feature
  • Features for potentially harmful content categories that are distinct from benign adjacent concepts

Causal verification:

Clamping these features to extreme values produces predictable behavior changes — activating deception-related feature clusters causes the model to behave more deceptively; suppressing them reduces deceptive responses. This confirms causal relevance, not mere correlation.

Interpretive caution required

Finding a feature labeled "deception" does not mean the model has a unified intentional representation of deception. The feature activates on text involving deception — whether the model "intends" anything in a meaningful sense is a separate philosophical and empirical question that the feature finding alone does not settle.

Universality — Current Evidence

FindingModels testedEvidence strength
Induction headsGPT-2, LLaMA, Mistral, BERT variantsStrong — found in virtually all transformers studied
Curve detectors in CNNsAlexNet, VGG, ResNet, InceptionNetStrong — same features emerge across architectures
First-token attention sinksGPT-2, LLaMA-2, Falcon, MistralStrong — near-universal across decoder models
IOI-style name mover headsGPT-2 variantsModerate — found at GPT-2 scale; unclear at larger scale

Current State and Open Problems (2025)

Genuine progress

  • Multiple verified circuits in transformer models with full causal confirmation
  • SAEs successfully decompose superposition in both small and frontier models
  • Safety-relevant features identified and causally verified in Claude
  • Active open-source tooling ecosystem: TransformerLens, SAELens, Neuronpedia
  • Interpretability findings beginning to inform training decisions at Anthropic

Unsolved problems

  • No complete mechanistic understanding of any capability in a model larger than GPT-2 Small
  • Most SAE features remain inscrutable — interpretable features are a small fraction
  • Circuit-level understanding does not compose to model-level understanding
  • Model editing based on localized circuits does not yet generalize reliably
  • The field is growing slower than model capabilities — the interpretability gap is widening

Checklist: Do You Understand This?

  • What are the three working hypotheses of the circuits research program (Olah et al. 2020)?
  • Walk through the induction circuit step by step. What is K-composition and how does it enable the circuit to function?
  • Describe what S-inhibition heads do in the IOI circuit. Why is redundancy in the circuit a complication for ablation studies?
  • How did Anthropic's 2023 monosemanticity work use sparse autoencoders to overcome the superposition problem?
  • What evidence supports the universality hypothesis, and where does it remain uncertain?
  • Why does finding a "deception feature" in Claude not necessarily mean the model is intentionally deceptive? What does the feature finding actually establish?