Attention Visualization & Probing
Attention weights are the most directly accessible internal signal in a transformer model — they are produced as part of the forward pass and can be read without any additional tooling. This accessibility made attention visualization the first widespread interpretability technique. But accessibility turned out to be misleading: attention weights alone do not tell you why a model made a prediction. This page covers what visualization and probing actually reveal, where they fall short, and how causal tracing fills some of the gaps.
What Attention Patterns Look Like
An attention heatmap plots the attention weight matrix for a single head at a single layer. Rows are query positions (output tokens), columns are key positions (input tokens), and the cell value is the proportion of attention each query allocates to each key. Several qualitatively distinct patterns appear across heads in real models:
| Pattern type | Visual signature | Likely role |
|---|---|---|
| Diagonal | Weight concentrated on adjacent tokens (previous or next) | Local context; positional smoothing |
| First-token sink | Most positions attend strongly to position 0 (often [BOS]) | Attention sink — "null" allocation when no useful position exists |
| Induction pattern | Each token attends to the token that followed the last occurrence of the current token | In-context pattern completion; few-shot learning mechanism |
| Syntactic heads | Verbs attend to their subjects; nouns attend to modifiers | Grammatical agreement and dependency tracking |
| Copy heads | Exact diagonal — each token attends only to itself | Straight-through token copying for verbatim repetition |
Tools such as BertViz (for BERT-style models) and TransformerLens (for GPT-style models) render these heatmaps interactively. They are genuinely useful for qualitative exploration and hypothesis formation. The failure mode is treating the visualization as an explanation.
Attention Is Not Explanation
Jain & Wallace (2019) published the landmark critique. They tested whether high attention weight on an input token reliably indicated that the token causally influenced the output. The results were negative on both counts:
What the experiments showed
- Attention weights and gradient-based attribution (which does measure causal influence) frequently disagreed — same prediction, different "important" tokens
- You can permute the attention weights to be adversarially different while keeping the output nearly unchanged
- High attention to a token doesn't mean the model's prediction changes if that token is removed
- Different randomly-seeded trained models show different attention patterns for identical predictions
Why this matters
- Attention weights aggregate over value vectors — what matters is the weighted combination of values, not the weights alone
- A token with low attention weight can still inject critical information through a high-magnitude value vector
- Multiple heads operate in parallel — one head's attention pattern is only a fraction of the computation at any layer
- Attention is an intermediate representation, not a direct window onto model reasoning
Induction Heads — A Verified Example
Induction heads are the clearest example of a circuit that was identified through attention analysis and then mechanistically verified. They appear in virtually all transformer models from GPT-2 scale upward. The circuit is a two-layer composition:
The induction circuit step by step:
- A previous-token head in layer 1 attends to the immediately preceding token and copies its key into the residual stream
- An induction head in layer 2 uses K-composition: its keys are computed from the layer-1 output, so it effectively attends to positions where the previous token matched the current query token
- Result: the induction head attends to the token that follows the most recent previous occurrence of the current token, and copies that token's value forward as the prediction
- This implements in-context pattern completion: if the sequence contains [A][B]...[A], the induction head predicts [B]
This circuit was not merely identified by looking at attention maps — it was verified by ablating each component and measuring the effect on in-context learning performance. The causal verification is what makes it a circuit rather than a correlation.
Probing Classifiers
Probing is a complementary method for investigating what information is encoded in hidden representations. A probing classifier is a simple linear model (often logistic regression) trained to predict some property from frozen hidden states. If the linear probe achieves high accuracy, the property is linearly decodable from that layer.
What probing has revealed
- Grammatical properties (POS tags, dependency relations) are linearly encoded and peak in early-to-mid layers
- Semantic features and world-knowledge associations appear in later layers
- Coreference information is encoded and accessible linearly across many layers
- Named entity types (person, location, organization) are distinctly encoded
Probing limitations
- High probe accuracy shows the property is present, not that the model uses it for predictions
- A powerful enough probe can extract information the model doesn't use by overfitting to irrelevant correlations
- Probing is correlational, not causal — it does not tell you whether removing the encoded property changes the output
- Probing results vary with probe architecture and training set size; comparisons across papers are unreliable without controlling these
Causal Tracing — Locating Factual Memories
Meng et al. (2022) introduced causal tracing (used in the ROME paper — Rank-One Model Editing) as a method to locate where factual associations are stored in a GPT-style model. Unlike probing, causal tracing is explicitly designed to measure causal influence.
The causal tracing procedure:
- Clean run: run the model on a factual prompt (e.g., "The Eiffel Tower is in") and record all activations
- Corrupted run: corrupt the subject token ("Eiffel Tower" → random noise) and record the now-wrong activations
- Restoration run: run the corrupted model but restore activations from the clean run at one specific (layer, position) at a time
- Measurement: if restoring that one activation recovers the correct output, that (layer, position) is causally responsible for storing the fact
The finding: factual associations are stored in the MLP layers at mid-to-late layers, specifically at the last token of the subject span. Attention layers redistribute information but do not appear to be the primary storage location. This motivated the ROME and MEMIT model editing methods, which directly rewrite MLP weight matrices to update stored facts.
What causal tracing adds over probing
- Explicitly tests causal influence, not just correlation
- Identifies the specific layer and token position responsible — not just "somewhere in the model"
- Supports downstream interventions (model editing) based on the localization finding
Known limitations
- Corruption via random noise is a strong intervention — results may not generalize to natural input variation
- Later work (Henighan et al.) showed that factual recall is distributed across layers; causal tracing locates a primary site, not an exclusive one
- Different facts may have different storage patterns
Method Comparison
| Method | Causal? | What it reveals | Main limitation |
|---|---|---|---|
| Attention visualization | No | Which tokens are attended to; qualitative patterns | Does not indicate causal influence on output |
| Probing classifiers | No | What properties are linearly decodable from representations | Presence ≠ usage; can overfit to artifacts |
| Activation patching | Yes | Which components causally mediate a specific behavior | Computationally expensive; intervention strength affects results |
| Causal tracing (ROME) | Yes | Where facts are stored in the weight matrix | Strong corruption assumption; distributed storage underestimated |
Checklist: Do You Understand This?
- Can you describe what a first-token attention sink is, and why it appears even when position 0 is irrelevant to the prediction?
- What is the core argument of Jain & Wallace (2019), and why does it matter for using attention as an explanation?
- Explain the two-layer induction circuit in your own words. What does K-composition mean in this context?
- What does a probing classifier test? Why does high probe accuracy not mean the model uses that information?
- Describe the causal tracing procedure from the ROME paper. What did it find about where factual associations are stored?
- If you had to pick one method to investigate whether a specific token causally influenced a model's output, which would you use and why?