Advanced

DeepSeek Architecture & Training

DeepSeek is a Chinese AI research lab (a subsidiary of High-Flyer Capital Management) that released a series of open-weight models in 2024–2025 that stunned the industry with their combination of frontier-class capability and dramatically lower training cost. DeepSeek-V3 and DeepSeek-R1 reset expectations about how much compute is actually required to produce models competitive with GPT-4 and o1.

Their results are not primarily from finding cheaper compute — they reflect genuine architectural and training innovations that make the same compute go further. Understanding these innovations is important for any engineer working at scale on LLM infrastructure.

DeepSeek-V2 (2024) — MLA and Efficient MoE

DeepSeek-V2 has 236 billion total parameters and 21 billion active per token. It introduced two major innovations:Multi-head Latent Attention (MLA) and a new MoE design called DeepSeekMoE. The reported training cost was approximately $5 million — compared to estimates of $100 million or more for GPT-4.

Multi-head Latent Attention (MLA)

The KV cache is one of the most significant memory bottlenecks in LLM inference. For a model serving many users with long contexts simultaneously, the KV cache can consume most available GPU memory, limiting batch sizes and therefore throughput.

Standard Multi-Head Attention stores the full K and V matrices for every layer and every token in the context. For a large model at long context, this is enormous. MLA addresses this with a low-rank bottleneck compression: instead of storing K and V directly, compress them through a small latent vector and reconstruct K/V from it at attention time.

MLA vs Standard MHA — KV Cache Comparison

Standard MHA KV Cache

Store: K ∈ [seq_len, n_heads, head_dim] and V ∈ [seq_len, n_heads, head_dim] per layer.

For 128K context, 60 layers, 128 heads at 128 head_dim in FP16: ~2.5 GB per sequence just for KV cache.

MLA Compressed KV Cache

Store: latent vector c ∈ [seq_len, latent_dim] per layer, where latent_dim << n_heads × head_dim.

DeepSeek reports 5–13× KV cache reduction vs standard MHA. Same context, far less memory per sequence.

The tradeoff: at attention time, K and V must be reconstructed from the compressed latent via an up-projection — a small additional compute cost. But the memory savings are so large (enabling much larger batch sizes and longer contexts) that the net effect is a significant inference throughput improvement.

DeepSeekMoE — Finer-Grained Experts

Standard MoE (as in Mixtral) uses a relatively small number of large experts (e.g., 8 experts, each the full FFN size). DeepSeekMoE uses a different strategy: many small experts (fine-grained expert segmentation). Instead of 8 experts at FFN-size, use 64 experts at 1/8 FFN-size each, still activating K=6 per token.

More, smaller experts provides two benefits:

Better Specialization

More expert "slots" means the routing can achieve finer specialization. Different tokens can be routed to different combinations of micro-experts, giving more expressive routing than a handful of large experts.

Reduced Expert Collapse

With many small experts, each expert covers a smaller conceptual domain. This makes it easier to achieve uniform load balancing — there are more distinct niches for experts to specialize into, reducing competition.

DeepSeekMoE also introduces shared experts: a small number of experts that always activate for every token, regardless of routing. These shared experts accumulate general-purpose knowledge, while the routed experts specialize. This mirrors how a generalist FFN behaves, but preserves the efficiency benefits of sparse routing.

DeepSeek-V3 (2024) — FP8 Training and Multi-Token Prediction

DeepSeek-V3 scales the V2 architecture to 671 billion total parameters and 37 billion active per token. It was trained on 14.8 trillion tokens. The claimed training cost was approximately $6 million. Two new techniques contributed significantly to this efficiency:

FP8 Training

Most large models train in BF16 (16-bit bfloat) or FP32. DeepSeek-V3 used FP8 (8-bit floating point) for most matrix multiplications during training. FP8 requires roughly half the memory bandwidth and half the compute of BF16 for the same operation.

The challenge: FP8 has limited dynamic range and precision, causing instability with naive use. DeepSeek developed careful per-tile scaling strategies that maintain training stability. H100 GPUs have native FP8 Tensor Core support — this is what made V3's efficiency leap possible at the hardware level.

Multi-Token Prediction (MTP)

Standard language model training predicts one next token per position. Multi-token prediction trains the model to predict the next K tokens simultaneously from each position — K separate prediction heads, each trained on a different offset.

Why it helps training: each forward pass produces K times the gradient signal for the same compute. The model learns longer-range dependencies more efficiently. DeepSeek uses K=1 additional prediction head during training, then discards it at inference — no inference cost, better training signal density.

DeepSeek-R1 (2025) — Reasoning via GRPO

DeepSeek-R1 is a reasoning model — like OpenAI's o1/o3, it solves complex math, coding, and logical problems by generating extended chains of reasoning before producing a final answer. What makes R1 architecturally notable is how it was trained to reason.

OpenAI's o1 training details are undisclosed. DeepSeek published R1's approach: GRPO (Group Relative Policy Optimization), a variant of reinforcement learning that avoids the need for a separate trained reward model.

GRPO vs Standard RLHF (PPO)

Standard PPO (InstructGPT / o1)

Train a separate reward model on human preferences
Use that reward model to score model outputs
PPO updates model weights to maximize reward
Requires two large models in memory simultaneously
Reward model quality caps the final model quality

GRPO (DeepSeek-R1)

Sample a group of outputs for each prompt
Score each with a rule-based verifier (math: is the answer correct?)
Normalize scores within the group — better than average = positive reward
No separate reward model required; no PPO critic network
Lower memory footprint, simpler training pipeline

The rule-based verifier is the key enabler: for math and code, there is an objective ground truth — the answer is either correct or not. You do not need human preferences to train a reasoning model if you can verify answers automatically. GRPO exploits this, making it a more scalable and cheaper approach for verifiable domains.

DeepSeek also released distilled versions of R1 — R1-7B, R1-14B, and R1-32B — produced by distilling the full 671B R1 reasoning traces into smaller models. The 14B and 32B distilled models match or beat GPT-4o on many reasoning benchmarks, making them the most cost-effective reasoning models available for local deployment.

Training Cost Breakthrough — What It Actually Means

DeepSeek's claim of ~$6M training cost for a GPT-4-class model made headlines. Context is important:

What the $6M figure includes

The final training run on 14.8T tokens using their FP8 pipeline. Does not typically include: infrastructure R&D cost, failed training runs, data curation, the earlier V2 work that informed V3, or researcher salaries.

What it genuinely demonstrates

That architectural innovations (MLA, DeepSeekMoE, FP8, MTP) compound to produce a step-change in training efficiency. The algorithmic work is real. Frontier-quality models are not exclusively the domain of labs with $100M+ compute budgets.

Checklist: Do You Understand This?

What problem does Multi-head Latent Attention (MLA) solve, and how does it compress the KV cache?
In DeepSeekMoE, why does using many small experts rather than a few large experts reduce expert collapse?
What are shared experts in DeepSeekMoE, and why are they useful?
Why does FP8 training reduce compute and memory bandwidth compared to BF16? What hardware feature enables it on H100s?
Explain multi-token prediction: what does the model predict during training, and why is it discarded at inference?
What is GRPO, and how does it differ from PPO-based RLHF in terms of the reward signal source?
Why is a rule-based verifier sufficient for training a reasoning model on math and code tasks?
What aspects of the $6M training cost claim are real, and what costs does it omit?