Interpretability

Understanding what neural networks are computing — from mechanistic circuit analysis to sparse autoencoders, probing classifiers, and what Anthropic's research has found.

In This Section

Mechanistic Interpretability — What & Why

Features, circuits, superposition, and the goal of reverse-engineering model algorithms.

Attention Visualization & Probing

Induction heads, probing classifiers, causal tracing, and the limits of attention as explanation.

Circuits & Features in LLMs

Induction circuits, GPT-2 IOI task, monosemanticity, and safety-relevant features.