Transformer Architecture

The transformer architecture that underlies every modern LLM — from the core attention mechanism to positional encodings, architectural variants, and the full transformer block.

In This Section

Attention Mechanism — The Core Idea

Queries, keys, values, scaled dot-product attention, and Flash Attention.

Multi-Head Self-Attention

Why multiple heads, what each learns, GQA, and MQA.

The Full Transformer Block

LayerNorm, FFN, residuals, SwiGLU, and pre-norm vs post-norm.

Positional Encoding & Variants

Sinusoidal, learned, RoPE, ALiBi, and how context length extension works.

Transformer Variants — BERT, GPT, T5

Encoder-only, decoder-only, encoder-decoder — the three paradigms.