Interpretability
Understanding what neural networks are computing — from mechanistic circuit analysis to sparse autoencoders, probing classifiers, and what Anthropic's research has found.
In This Section
Mechanistic Interpretability — What & Why
Features, circuits, superposition, and the goal of reverse-engineering model algorithms.
Attention Visualization & Probing
Induction heads, probing classifiers, causal tracing, and the limits of attention as explanation.
Circuits & Features in LLMs
Induction circuits, GPT-2 IOI task, monosemanticity, and safety-relevant features.