🧠 All Things AI
Advanced

Llama 3 — Architecture & Design Choices

Meta's Llama series is the most consequential open-weight model family in AI history. By releasing model weights publicly, Meta enabled a global ecosystem of researchers, fine-tuners, and product builders who otherwise could not afford to train foundation models. Understanding Llama's architectural choices reveals the design decisions that define most serious open-weight models in 2024–2025.

Llama 1 (2023) — Democratizing Open Research

Llama 1 came in four sizes: 7B, 13B, 33B, and 65B parameters. All were trained exclusively on publicly available data — Common Crawl, C4, GitHub, Wikipedia, Books, ArXiv, and StackExchange — avoiding the proprietary or licensed datasets used by closed-model labs.

The training followed Chinchilla-optimal scaling: rather than training a huge model on too few tokens (as GPT-3 did), Llama 1 trained smaller models on far more tokens (1–1.4 trillion tokens). The insight from DeepMind's Chinchilla paper was that for a given compute budget, you get better performance by training a smaller model longer. A Chinchilla-optimal 7B model significantly outperforms a 175B model trained with the same FLOP budget.

Meta released Llama 1 weights for research use. Within days the model was running on consumer laptops via llama.cpp, and the open-source fine-tuning community exploded — Alpaca, Vicuna, WizardLM, and dozens of derivatives followed.

Llama 2 (2023) — RLHF and Broader Commercial Use

Llama 2 extended the range to 7B, 13B, 34B, and 70B. Two major changes:

RLHF-Trained Chat Models

Llama-2-chat variants were trained with Reinforcement Learning from Human Feedback — the same pipeline as InstructGPT. This produced aligned, instruction-following models competitive with proprietary assistants for many tasks.

Commercial Use Allowed

Llama 2's license permitted commercial use for companies with under 700M monthly active users. This was a significant expansion from Llama 1's research-only restriction, enabling product companies to build on the weights directly.

Architecturally, Llama 2 70B introduced Grouped Query Attention (GQA) for inference efficiency. Smaller Llama 2 models still used standard multi-head attention. The 70B model's GQA was a preview of what Llama 3 would apply universally.

Llama 3 (2024) — Scale, Vocabulary, and Universal GQA

Llama 3 came in 8B and 70B (later 405B) sizes. Every architectural decision was upgraded from Llama 2:

FeatureLlama 2Llama 3
Vocabulary size32,000 tokens128,000 tokens (4× larger)
Context length4,096 tokens8,192 → 128K (Llama 3.1)
GQA70B onlyAll sizes
Positional embeddingsRoPERoPE (extended scaling)
Training tokens2T tokens15T+ tokens
Training compute (405B)16,000 H100 GPUs

The 128K vocabulary is a notable change. More tokens in the vocabulary means common words and subwords tokenize to fewer tokens — better token efficiency, especially for code, math, and non-English languages. Llama 2's 32K vocabulary was on the small side even for its era; 128K matches the scale used by recent frontier models.

The 405B model was trained on 15.6 trillion tokens — an unprecedentedly large training run for an open-weight model. The data mix was approximately 50% general web content, 25% code, and 25% math and science documents, with extensive deduplication and quality filtering.

Grouped Query Attention (GQA) — Why It Matters

Standard Multi-Head Attention (MHA) maintains a separate key (K) and value (V) matrix per attention head. During inference, these K/V pairs must be stored in the KV cache for each token in the context window. Memory consumption grows with the number of heads and the sequence length.

For a 70B model with 80 attention heads and a 128K context window, the KV cache can exceed the memory available for model weights themselves — a major practical bottleneck for long-context inference.

Grouped Query Attention (GQA) solves this by sharing K/V heads across groups of query heads. If you have 32 query heads split into 8 groups, each group shares 1 K/V pair — instead of 32 K/V pairs per layer, you need only 8. Memory for the KV cache drops proportionally, enabling much longer contexts on the same hardware with minimal quality degradation.

MHA (Multi-Head Attention)

1 K/V head per query head. Maximum expressiveness, maximum KV cache memory. GPT-1/2/3, early Llama.

GQA (Grouped Query Attention)

1 K/V head per group of query heads. Balance between expressiveness and memory. Llama 3, Mistral, Qwen, Gemma — the current standard.

MQA (Multi-Query Attention)

1 shared K/V head for all query heads. Maximum memory savings, some quality degradation. Used by early Falcon, PaLM.

RoPE — Rotary Position Embeddings

Llama uses RoPE (Rotary Position Embeddings) instead of the absolute learned positional embeddings used by GPT-1/2/3. RoPE encodes position by rotating the query and key vectors in the complex plane by an angle proportional to the token's position index.

The key advantage: the dot product of two RoPE-rotated vectors naturally encodes their relative position — the farther apart two tokens are, the more their rotated Q/K vectors diverge. This makes RoPE a form of relative positional encoding, which generalizes better to sequence lengths longer than those seen during training (with appropriate scaling adjustments like YaRN or RoPE frequency scaling).

Virtually every open-weight model released in 2024–2025 uses RoPE: Llama, Mistral, Qwen, Gemma, DeepSeek, Phi. It has become the de facto standard for transformer language models.

Llama 3.1, 3.2, 3.3 — Multimodal and Edge Variants

Llama 3.1 (2024)

Extended context to 128K tokens across 8B, 70B, and 405B. Added tool use (function calling). The 405B model became a genuine frontier-class open model, competitive with GPT-4 on many benchmarks.

Llama 3.2 (2024)

Added vision (multimodal) variants at 11B and 90B. Added 1B and 3B edge variants designed for on-device deployment — efficient enough to run on mobile chips without quantization for many tasks.

Llama 3.3 (2024)

70B-only release with improved instruction tuning and tool-use. Achieved performance comparable to Llama 3.1 405B on most benchmarks at a fraction of the inference cost — a compelling cost-quality trade-off for production use.

Checklist: Do You Understand This?

  • What is Chinchilla-optimal training, and how does it differ from the approach GPT-3 took?
  • Why did Llama's open-weight release matter for the broader AI ecosystem?
  • What is Grouped Query Attention (GQA), and how does it reduce KV cache memory requirements?
  • Explain the difference between MHA, GQA, and MQA in terms of the number of K/V heads per query head group.
  • Why does RoPE generalize better to longer contexts than learned absolute positional embeddings?
  • What changed between Llama 2 and Llama 3 in vocabulary size, and why does vocabulary size affect model efficiency?
  • What capabilities did Llama 3.2 add that earlier Llama models lacked?