Llama 3 — Architecture & Design Choices
Meta's Llama series is the most consequential open-weight model family in AI history. By releasing model weights publicly, Meta enabled a global ecosystem of researchers, fine-tuners, and product builders who otherwise could not afford to train foundation models. Understanding Llama's architectural choices reveals the design decisions that define most serious open-weight models in 2024–2025.
Llama 1 (2023) — Democratizing Open Research
Llama 1 came in four sizes: 7B, 13B, 33B, and 65B parameters. All were trained exclusively on publicly available data — Common Crawl, C4, GitHub, Wikipedia, Books, ArXiv, and StackExchange — avoiding the proprietary or licensed datasets used by closed-model labs.
The training followed Chinchilla-optimal scaling: rather than training a huge model on too few tokens (as GPT-3 did), Llama 1 trained smaller models on far more tokens (1–1.4 trillion tokens). The insight from DeepMind's Chinchilla paper was that for a given compute budget, you get better performance by training a smaller model longer. A Chinchilla-optimal 7B model significantly outperforms a 175B model trained with the same FLOP budget.
Meta released Llama 1 weights for research use. Within days the model was running on consumer laptops via llama.cpp, and the open-source fine-tuning community exploded — Alpaca, Vicuna, WizardLM, and dozens of derivatives followed.
Llama 2 (2023) — RLHF and Broader Commercial Use
Llama 2 extended the range to 7B, 13B, 34B, and 70B. Two major changes:
Llama-2-chat variants were trained with Reinforcement Learning from Human Feedback — the same pipeline as InstructGPT. This produced aligned, instruction-following models competitive with proprietary assistants for many tasks.
Llama 2's license permitted commercial use for companies with under 700M monthly active users. This was a significant expansion from Llama 1's research-only restriction, enabling product companies to build on the weights directly.
Architecturally, Llama 2 70B introduced Grouped Query Attention (GQA) for inference efficiency. Smaller Llama 2 models still used standard multi-head attention. The 70B model's GQA was a preview of what Llama 3 would apply universally.
Llama 3 (2024) — Scale, Vocabulary, and Universal GQA
Llama 3 came in 8B and 70B (later 405B) sizes. Every architectural decision was upgraded from Llama 2:
| Feature | Llama 2 | Llama 3 |
|---|---|---|
| Vocabulary size | 32,000 tokens | 128,000 tokens (4× larger) |
| Context length | 4,096 tokens | 8,192 → 128K (Llama 3.1) |
| GQA | 70B only | All sizes |
| Positional embeddings | RoPE | RoPE (extended scaling) |
| Training tokens | 2T tokens | 15T+ tokens |
| Training compute (405B) | — | 16,000 H100 GPUs |
The 128K vocabulary is a notable change. More tokens in the vocabulary means common words and subwords tokenize to fewer tokens — better token efficiency, especially for code, math, and non-English languages. Llama 2's 32K vocabulary was on the small side even for its era; 128K matches the scale used by recent frontier models.
The 405B model was trained on 15.6 trillion tokens — an unprecedentedly large training run for an open-weight model. The data mix was approximately 50% general web content, 25% code, and 25% math and science documents, with extensive deduplication and quality filtering.
Grouped Query Attention (GQA) — Why It Matters
Standard Multi-Head Attention (MHA) maintains a separate key (K) and value (V) matrix per attention head. During inference, these K/V pairs must be stored in the KV cache for each token in the context window. Memory consumption grows with the number of heads and the sequence length.
For a 70B model with 80 attention heads and a 128K context window, the KV cache can exceed the memory available for model weights themselves — a major practical bottleneck for long-context inference.
Grouped Query Attention (GQA) solves this by sharing K/V heads across groups of query heads. If you have 32 query heads split into 8 groups, each group shares 1 K/V pair — instead of 32 K/V pairs per layer, you need only 8. Memory for the KV cache drops proportionally, enabling much longer contexts on the same hardware with minimal quality degradation.
1 K/V head per query head. Maximum expressiveness, maximum KV cache memory. GPT-1/2/3, early Llama.
1 K/V head per group of query heads. Balance between expressiveness and memory. Llama 3, Mistral, Qwen, Gemma — the current standard.
1 shared K/V head for all query heads. Maximum memory savings, some quality degradation. Used by early Falcon, PaLM.
RoPE — Rotary Position Embeddings
Llama uses RoPE (Rotary Position Embeddings) instead of the absolute learned positional embeddings used by GPT-1/2/3. RoPE encodes position by rotating the query and key vectors in the complex plane by an angle proportional to the token's position index.
The key advantage: the dot product of two RoPE-rotated vectors naturally encodes their relative position — the farther apart two tokens are, the more their rotated Q/K vectors diverge. This makes RoPE a form of relative positional encoding, which generalizes better to sequence lengths longer than those seen during training (with appropriate scaling adjustments like YaRN or RoPE frequency scaling).
Virtually every open-weight model released in 2024–2025 uses RoPE: Llama, Mistral, Qwen, Gemma, DeepSeek, Phi. It has become the de facto standard for transformer language models.
Llama 3.1, 3.2, 3.3 — Multimodal and Edge Variants
Extended context to 128K tokens across 8B, 70B, and 405B. Added tool use (function calling). The 405B model became a genuine frontier-class open model, competitive with GPT-4 on many benchmarks.
Added vision (multimodal) variants at 11B and 90B. Added 1B and 3B edge variants designed for on-device deployment — efficient enough to run on mobile chips without quantization for many tasks.
70B-only release with improved instruction tuning and tool-use. Achieved performance comparable to Llama 3.1 405B on most benchmarks at a fraction of the inference cost — a compelling cost-quality trade-off for production use.
Checklist: Do You Understand This?
- What is Chinchilla-optimal training, and how does it differ from the approach GPT-3 took?
- Why did Llama's open-weight release matter for the broader AI ecosystem?
- What is Grouped Query Attention (GQA), and how does it reduce KV cache memory requirements?
- Explain the difference between MHA, GQA, and MQA in terms of the number of K/V heads per query head group.
- Why does RoPE generalize better to longer contexts than learned absolute positional embeddings?
- What changed between Llama 2 and Llama 3 in vocabulary size, and why does vocabulary size affect model efficiency?
- What capabilities did Llama 3.2 add that earlier Llama models lacked?