Advanced

Attention Visualization & Probing

Attention weights are the most directly accessible internal signal in a transformer model — they are produced as part of the forward pass and can be read without any additional tooling. This accessibility made attention visualization the first widespread interpretability technique. But accessibility turned out to be misleading: attention weights alone do not tell you why a model made a prediction. This page covers what visualization and probing actually reveal, where they fall short, and how causal tracing fills some of the gaps.

What Attention Patterns Look Like

An attention heatmap plots the attention weight matrix for a single head at a single layer. Rows are query positions (output tokens), columns are key positions (input tokens), and the cell value is the proportion of attention each query allocates to each key. Several qualitatively distinct patterns appear across heads in real models:

Pattern type	Visual signature	Likely role
Diagonal	Weight concentrated on adjacent tokens (previous or next)	Local context; positional smoothing
First-token sink	Most positions attend strongly to position 0 (often [BOS])	Attention sink — "null" allocation when no useful position exists
Induction pattern	Each token attends to the token that followed the last occurrence of the current token	In-context pattern completion; few-shot learning mechanism
Syntactic heads	Verbs attend to their subjects; nouns attend to modifiers	Grammatical agreement and dependency tracking
Copy heads	Exact diagonal — each token attends only to itself	Straight-through token copying for verbatim repetition

Tools such as BertViz (for BERT-style models) and TransformerLens (for GPT-style models) render these heatmaps interactively. They are genuinely useful for qualitative exploration and hypothesis formation. The failure mode is treating the visualization as an explanation.

Attention Is Not Explanation

Jain & Wallace (2019) published the landmark critique. They tested whether high attention weight on an input token reliably indicated that the token causally influenced the output. The results were negative on both counts:

What the experiments showed

Attention weights and gradient-based attribution (which does measure causal influence) frequently disagreed — same prediction, different "important" tokens
You can permute the attention weights to be adversarially different while keeping the output nearly unchanged
High attention to a token doesn't mean the model's prediction changes if that token is removed
Different randomly-seeded trained models show different attention patterns for identical predictions

Why this matters

Attention weights aggregate over value vectors — what matters is the weighted combination of values, not the weights alone
A token with low attention weight can still inject critical information through a high-magnitude value vector
Multiple heads operate in parallel — one head's attention pattern is only a fraction of the computation at any layer
Attention is an intermediate representation, not a direct window onto model reasoning

Induction Heads — A Verified Example

Induction heads are the clearest example of a circuit that was identified through attention analysis and then mechanistically verified. They appear in virtually all transformer models from GPT-2 scale upward. The circuit is a two-layer composition:

The induction circuit step by step:

A previous-token head in layer 1 attends to the immediately preceding token and copies its key into the residual stream
An induction head in layer 2 uses K-composition: its keys are computed from the layer-1 output, so it effectively attends to positions where the previous token matched the current query token
Result: the induction head attends to the token that follows the most recent previous occurrence of the current token, and copies that token's value forward as the prediction
This implements in-context pattern completion: if the sequence contains [A][B]...[A], the induction head predicts [B]

This circuit was not merely identified by looking at attention maps — it was verified by ablating each component and measuring the effect on in-context learning performance. The causal verification is what makes it a circuit rather than a correlation.

Probing Classifiers

Probing is a complementary method for investigating what information is encoded in hidden representations. A probing classifier is a simple linear model (often logistic regression) trained to predict some property from frozen hidden states. If the linear probe achieves high accuracy, the property is linearly decodable from that layer.

What probing has revealed

Grammatical properties (POS tags, dependency relations) are linearly encoded and peak in early-to-mid layers
Semantic features and world-knowledge associations appear in later layers
Coreference information is encoded and accessible linearly across many layers
Named entity types (person, location, organization) are distinctly encoded

Probing limitations

High probe accuracy shows the property is present, not that the model uses it for predictions
A powerful enough probe can extract information the model doesn't use by overfitting to irrelevant correlations
Probing is correlational, not causal — it does not tell you whether removing the encoded property changes the output
Probing results vary with probe architecture and training set size; comparisons across papers are unreliable without controlling these

Causal Tracing — Locating Factual Memories

Meng et al. (2022) introduced causal tracing (used in the ROME paper — Rank-One Model Editing) as a method to locate where factual associations are stored in a GPT-style model. Unlike probing, causal tracing is explicitly designed to measure causal influence.

The causal tracing procedure:

Clean run: run the model on a factual prompt (e.g., "The Eiffel Tower is in") and record all activations
Corrupted run: corrupt the subject token ("Eiffel Tower" → random noise) and record the now-wrong activations
Restoration run: run the corrupted model but restore activations from the clean run at one specific (layer, position) at a time
Measurement: if restoring that one activation recovers the correct output, that (layer, position) is causally responsible for storing the fact

The finding: factual associations are stored in the MLP layers at mid-to-late layers, specifically at the last token of the subject span. Attention layers redistribute information but do not appear to be the primary storage location. This motivated the ROME and MEMIT model editing methods, which directly rewrite MLP weight matrices to update stored facts.

What causal tracing adds over probing

Explicitly tests causal influence, not just correlation
Identifies the specific layer and token position responsible — not just "somewhere in the model"
Supports downstream interventions (model editing) based on the localization finding

Known limitations

Corruption via random noise is a strong intervention — results may not generalize to natural input variation
Later work (Henighan et al.) showed that factual recall is distributed across layers; causal tracing locates a primary site, not an exclusive one
Different facts may have different storage patterns

Method Comparison

Method	Causal?	What it reveals	Main limitation
Attention visualization	No	Which tokens are attended to; qualitative patterns	Does not indicate causal influence on output
Probing classifiers	No	What properties are linearly decodable from representations	Presence ≠ usage; can overfit to artifacts
Activation patching	Yes	Which components causally mediate a specific behavior	Computationally expensive; intervention strength affects results
Causal tracing (ROME)	Yes	Where facts are stored in the weight matrix	Strong corruption assumption; distributed storage underestimated

Checklist: Do You Understand This?

Can you describe what a first-token attention sink is, and why it appears even when position 0 is irrelevant to the prediction?
What is the core argument of Jain & Wallace (2019), and why does it matter for using attention as an explanation?
Explain the two-layer induction circuit in your own words. What does K-composition mean in this context?
What does a probing classifier test? Why does high probe accuracy not mean the model uses that information?
Describe the causal tracing procedure from the ROME paper. What did it find about where factual associations are stored?
If you had to pick one method to investigate whether a specific token causally influenced a model's output, which would you use and why?