DeepSeek Architecture & Training
DeepSeek is a Chinese AI research lab (a subsidiary of High-Flyer Capital Management) that released a series of open-weight models in 2024–2025 that stunned the industry with their combination of frontier-class capability and dramatically lower training cost. DeepSeek-V3 and DeepSeek-R1 reset expectations about how much compute is actually required to produce models competitive with GPT-4 and o1.
Their results are not primarily from finding cheaper compute — they reflect genuine architectural and training innovations that make the same compute go further. Understanding these innovations is important for any engineer working at scale on LLM infrastructure.
DeepSeek-V2 (2024) — MLA and Efficient MoE
DeepSeek-V2 has 236 billion total parameters and 21 billion active per token. It introduced two major innovations:Multi-head Latent Attention (MLA) and a new MoE design called DeepSeekMoE. The reported training cost was approximately $5 million — compared to estimates of $100 million or more for GPT-4.
Multi-head Latent Attention (MLA)
The KV cache is one of the most significant memory bottlenecks in LLM inference. For a model serving many users with long contexts simultaneously, the KV cache can consume most available GPU memory, limiting batch sizes and therefore throughput.
Standard Multi-Head Attention stores the full K and V matrices for every layer and every token in the context. For a large model at long context, this is enormous. MLA addresses this with a low-rank bottleneck compression: instead of storing K and V directly, compress them through a small latent vector and reconstruct K/V from it at attention time.
Store: K ∈ [seq_len, n_heads, head_dim] and V ∈ [seq_len, n_heads, head_dim] per layer.
For 128K context, 60 layers, 128 heads at 128 head_dim in FP16: ~2.5 GB per sequence just for KV cache.
Store: latent vector c ∈ [seq_len, latent_dim] per layer, where latent_dim << n_heads × head_dim.
DeepSeek reports 5–13× KV cache reduction vs standard MHA. Same context, far less memory per sequence.
The tradeoff: at attention time, K and V must be reconstructed from the compressed latent via an up-projection — a small additional compute cost. But the memory savings are so large (enabling much larger batch sizes and longer contexts) that the net effect is a significant inference throughput improvement.
DeepSeekMoE — Finer-Grained Experts
Standard MoE (as in Mixtral) uses a relatively small number of large experts (e.g., 8 experts, each the full FFN size). DeepSeekMoE uses a different strategy: many small experts (fine-grained expert segmentation). Instead of 8 experts at FFN-size, use 64 experts at 1/8 FFN-size each, still activating K=6 per token.
More, smaller experts provides two benefits:
More expert "slots" means the routing can achieve finer specialization. Different tokens can be routed to different combinations of micro-experts, giving more expressive routing than a handful of large experts.
With many small experts, each expert covers a smaller conceptual domain. This makes it easier to achieve uniform load balancing — there are more distinct niches for experts to specialize into, reducing competition.
DeepSeekMoE also introduces shared experts: a small number of experts that always activate for every token, regardless of routing. These shared experts accumulate general-purpose knowledge, while the routed experts specialize. This mirrors how a generalist FFN behaves, but preserves the efficiency benefits of sparse routing.
DeepSeek-V3 (2024) — FP8 Training and Multi-Token Prediction
DeepSeek-V3 scales the V2 architecture to 671 billion total parameters and 37 billion active per token. It was trained on 14.8 trillion tokens. The claimed training cost was approximately $6 million. Two new techniques contributed significantly to this efficiency:
Most large models train in BF16 (16-bit bfloat) or FP32. DeepSeek-V3 used FP8 (8-bit floating point) for most matrix multiplications during training. FP8 requires roughly half the memory bandwidth and half the compute of BF16 for the same operation.
The challenge: FP8 has limited dynamic range and precision, causing instability with naive use. DeepSeek developed careful per-tile scaling strategies that maintain training stability. H100 GPUs have native FP8 Tensor Core support — this is what made V3's efficiency leap possible at the hardware level.
Standard language model training predicts one next token per position. Multi-token prediction trains the model to predict the next K tokens simultaneously from each position — K separate prediction heads, each trained on a different offset.
Why it helps training: each forward pass produces K times the gradient signal for the same compute. The model learns longer-range dependencies more efficiently. DeepSeek uses K=1 additional prediction head during training, then discards it at inference — no inference cost, better training signal density.
DeepSeek-R1 (2025) — Reasoning via GRPO
DeepSeek-R1 is a reasoning model — like OpenAI's o1/o3, it solves complex math, coding, and logical problems by generating extended chains of reasoning before producing a final answer. What makes R1 architecturally notable is how it was trained to reason.
OpenAI's o1 training details are undisclosed. DeepSeek published R1's approach: GRPO (Group Relative Policy Optimization), a variant of reinforcement learning that avoids the need for a separate trained reward model.
- Train a separate reward model on human preferences
- Use that reward model to score model outputs
- PPO updates model weights to maximize reward
- Requires two large models in memory simultaneously
- Reward model quality caps the final model quality
- Sample a group of outputs for each prompt
- Score each with a rule-based verifier (math: is the answer correct?)
- Normalize scores within the group — better than average = positive reward
- No separate reward model required; no PPO critic network
- Lower memory footprint, simpler training pipeline
The rule-based verifier is the key enabler: for math and code, there is an objective ground truth — the answer is either correct or not. You do not need human preferences to train a reasoning model if you can verify answers automatically. GRPO exploits this, making it a more scalable and cheaper approach for verifiable domains.
DeepSeek also released distilled versions of R1 — R1-7B, R1-14B, and R1-32B — produced by distilling the full 671B R1 reasoning traces into smaller models. The 14B and 32B distilled models match or beat GPT-4o on many reasoning benchmarks, making them the most cost-effective reasoning models available for local deployment.
Training Cost Breakthrough — What It Actually Means
DeepSeek's claim of ~$6M training cost for a GPT-4-class model made headlines. Context is important:
The final training run on 14.8T tokens using their FP8 pipeline. Does not typically include: infrastructure R&D cost, failed training runs, data curation, the earlier V2 work that informed V3, or researcher salaries.
That architectural innovations (MLA, DeepSeekMoE, FP8, MTP) compound to produce a step-change in training efficiency. The algorithmic work is real. Frontier-quality models are not exclusively the domain of labs with $100M+ compute budgets.
Checklist: Do You Understand This?
- What problem does Multi-head Latent Attention (MLA) solve, and how does it compress the KV cache?
- In DeepSeekMoE, why does using many small experts rather than a few large experts reduce expert collapse?
- What are shared experts in DeepSeekMoE, and why are they useful?
- Why does FP8 training reduce compute and memory bandwidth compared to BF16? What hardware feature enables it on H100s?
- Explain multi-token prediction: what does the model predict during training, and why is it discarded at inference?
- What is GRPO, and how does it differ from PPO-based RLHF in terms of the reward signal source?
- Why is a rule-based verifier sufficient for training a reasoning model on math and code tasks?
- What aspects of the $6M training cost claim are real, and what costs does it omit?