Transformer Architecture
The transformer architecture that underlies every modern LLM — from the core attention mechanism to positional encodings, architectural variants, and the full transformer block.
In This Section
Attention Mechanism — The Core Idea
Queries, keys, values, scaled dot-product attention, and Flash Attention.
Multi-Head Self-Attention
Why multiple heads, what each learns, GQA, and MQA.
The Full Transformer Block
LayerNorm, FFN, residuals, SwiGLU, and pre-norm vs post-norm.
Positional Encoding & Variants
Sinusoidal, learned, RoPE, ALiBi, and how context length extension works.
Transformer Variants — BERT, GPT, T5
Encoder-only, decoder-only, encoder-decoder — the three paradigms.