Advanced

Llama 3 — Architecture & Design Choices

Meta's Llama series is the most consequential open-weight model family in AI history. By releasing model weights publicly, Meta enabled a global ecosystem of researchers, fine-tuners, and product builders who otherwise could not afford to train foundation models. Understanding Llama's architectural choices reveals the design decisions that define most serious open-weight models in 2024–2025.

Llama 1 (2023) — Democratizing Open Research

Llama 1 came in four sizes: 7B, 13B, 33B, and 65B parameters. All were trained exclusively on publicly available data — Common Crawl, C4, GitHub, Wikipedia, Books, ArXiv, and StackExchange — avoiding the proprietary or licensed datasets used by closed-model labs.

The training followed Chinchilla-optimal scaling: rather than training a huge model on too few tokens (as GPT-3 did), Llama 1 trained smaller models on far more tokens (1–1.4 trillion tokens). The insight from DeepMind's Chinchilla paper was that for a given compute budget, you get better performance by training a smaller model longer. A Chinchilla-optimal 7B model significantly outperforms a 175B model trained with the same FLOP budget.

Meta released Llama 1 weights for research use. Within days the model was running on consumer laptops via llama.cpp, and the open-source fine-tuning community exploded — Alpaca, Vicuna, WizardLM, and dozens of derivatives followed.

Llama 2 (2023) — RLHF and Broader Commercial Use

Llama 2 extended the range to 7B, 13B, 34B, and 70B. Two major changes:

RLHF-Trained Chat Models

Llama-2-chat variants were trained with Reinforcement Learning from Human Feedback — the same pipeline as InstructGPT. This produced aligned, instruction-following models competitive with proprietary assistants for many tasks.

Commercial Use Allowed

Llama 2's license permitted commercial use for companies with under 700M monthly active users. This was a significant expansion from Llama 1's research-only restriction, enabling product companies to build on the weights directly.

Architecturally, Llama 2 70B introduced Grouped Query Attention (GQA) for inference efficiency. Smaller Llama 2 models still used standard multi-head attention. The 70B model's GQA was a preview of what Llama 3 would apply universally.

Llama 3 (2024) — Scale, Vocabulary, and Universal GQA

Llama 3 came in 8B and 70B (later 405B) sizes. Every architectural decision was upgraded from Llama 2:

Feature	Llama 2	Llama 3
Vocabulary size	32,000 tokens	128,000 tokens (4× larger)
Context length	4,096 tokens	8,192 → 128K (Llama 3.1)
GQA	70B only	All sizes
Positional embeddings	RoPE	RoPE (extended scaling)
Training tokens	2T tokens	15T+ tokens
Training compute (405B)	—	16,000 H100 GPUs

The 128K vocabulary is a notable change. More tokens in the vocabulary means common words and subwords tokenize to fewer tokens — better token efficiency, especially for code, math, and non-English languages. Llama 2's 32K vocabulary was on the small side even for its era; 128K matches the scale used by recent frontier models.

The 405B model was trained on 15.6 trillion tokens — an unprecedentedly large training run for an open-weight model. The data mix was approximately 50% general web content, 25% code, and 25% math and science documents, with extensive deduplication and quality filtering.

Grouped Query Attention (GQA) — Why It Matters

Standard Multi-Head Attention (MHA) maintains a separate key (K) and value (V) matrix per attention head. During inference, these K/V pairs must be stored in the KV cache for each token in the context window. Memory consumption grows with the number of heads and the sequence length.

For a 70B model with 80 attention heads and a 128K context window, the KV cache can exceed the memory available for model weights themselves — a major practical bottleneck for long-context inference.

Grouped Query Attention (GQA) solves this by sharing K/V heads across groups of query heads. If you have 32 query heads split into 8 groups, each group shares 1 K/V pair — instead of 32 K/V pairs per layer, you need only 8. Memory for the KV cache drops proportionally, enabling much longer contexts on the same hardware with minimal quality degradation.

MHA (Multi-Head Attention)

1 K/V head per query head. Maximum expressiveness, maximum KV cache memory. GPT-1/2/3, early Llama.

GQA (Grouped Query Attention)

1 K/V head per group of query heads. Balance between expressiveness and memory. Llama 3, Mistral, Qwen, Gemma — the current standard.

MQA (Multi-Query Attention)

1 shared K/V head for all query heads. Maximum memory savings, some quality degradation. Used by early Falcon, PaLM.

RoPE — Rotary Position Embeddings

Llama uses RoPE (Rotary Position Embeddings) instead of the absolute learned positional embeddings used by GPT-1/2/3. RoPE encodes position by rotating the query and key vectors in the complex plane by an angle proportional to the token's position index.

The key advantage: the dot product of two RoPE-rotated vectors naturally encodes their relative position — the farther apart two tokens are, the more their rotated Q/K vectors diverge. This makes RoPE a form of relative positional encoding, which generalizes better to sequence lengths longer than those seen during training (with appropriate scaling adjustments like YaRN or RoPE frequency scaling).

Virtually every open-weight model released in 2024–2025 uses RoPE: Llama, Mistral, Qwen, Gemma, DeepSeek, Phi. It has become the de facto standard for transformer language models.

Llama 3.1, 3.2, 3.3 — Multimodal and Edge Variants

Llama 3.1 (2024)

Extended context to 128K tokens across 8B, 70B, and 405B. Added tool use (function calling). The 405B model became a genuine frontier-class open model, competitive with GPT-4 on many benchmarks.

Llama 3.2 (2024)

Added vision (multimodal) variants at 11B and 90B. Added 1B and 3B edge variants designed for on-device deployment — efficient enough to run on mobile chips without quantization for many tasks.

Llama 3.3 (2024)

70B-only release with improved instruction tuning and tool-use. Achieved performance comparable to Llama 3.1 405B on most benchmarks at a fraction of the inference cost — a compelling cost-quality trade-off for production use.

Checklist: Do You Understand This?

What is Chinchilla-optimal training, and how does it differ from the approach GPT-3 took?
Why did Llama's open-weight release matter for the broader AI ecosystem?
What is Grouped Query Attention (GQA), and how does it reduce KV cache memory requirements?
Explain the difference between MHA, GQA, and MQA in terms of the number of K/V heads per query head group.
Why does RoPE generalize better to longer contexts than learned absolute positional embeddings?
What changed between Llama 2 and Llama 3 in vocabulary size, and why does vocabulary size affect model efficiency?
What capabilities did Llama 3.2 add that earlier Llama models lacked?