🧠 All Things AI
Advanced

DeepSeek Architecture & Training

DeepSeek is a Chinese AI research lab (a subsidiary of High-Flyer Capital Management) that released a series of open-weight models in 2024–2025 that stunned the industry with their combination of frontier-class capability and dramatically lower training cost. DeepSeek-V3 and DeepSeek-R1 reset expectations about how much compute is actually required to produce models competitive with GPT-4 and o1.

Their results are not primarily from finding cheaper compute — they reflect genuine architectural and training innovations that make the same compute go further. Understanding these innovations is important for any engineer working at scale on LLM infrastructure.

DeepSeek-V2 (2024) — MLA and Efficient MoE

DeepSeek-V2 has 236 billion total parameters and 21 billion active per token. It introduced two major innovations:Multi-head Latent Attention (MLA) and a new MoE design called DeepSeekMoE. The reported training cost was approximately $5 million — compared to estimates of $100 million or more for GPT-4.

Multi-head Latent Attention (MLA)

The KV cache is one of the most significant memory bottlenecks in LLM inference. For a model serving many users with long contexts simultaneously, the KV cache can consume most available GPU memory, limiting batch sizes and therefore throughput.

Standard Multi-Head Attention stores the full K and V matrices for every layer and every token in the context. For a large model at long context, this is enormous. MLA addresses this with a low-rank bottleneck compression: instead of storing K and V directly, compress them through a small latent vector and reconstruct K/V from it at attention time.

MLA vs Standard MHA — KV Cache Comparison
Standard MHA KV Cache

Store: K ∈ [seq_len, n_heads, head_dim] and V ∈ [seq_len, n_heads, head_dim] per layer.

For 128K context, 60 layers, 128 heads at 128 head_dim in FP16: ~2.5 GB per sequence just for KV cache.

MLA Compressed KV Cache

Store: latent vector c ∈ [seq_len, latent_dim] per layer, where latent_dim << n_heads × head_dim.

DeepSeek reports 5–13× KV cache reduction vs standard MHA. Same context, far less memory per sequence.

The tradeoff: at attention time, K and V must be reconstructed from the compressed latent via an up-projection — a small additional compute cost. But the memory savings are so large (enabling much larger batch sizes and longer contexts) that the net effect is a significant inference throughput improvement.

DeepSeekMoE — Finer-Grained Experts

Standard MoE (as in Mixtral) uses a relatively small number of large experts (e.g., 8 experts, each the full FFN size). DeepSeekMoE uses a different strategy: many small experts (fine-grained expert segmentation). Instead of 8 experts at FFN-size, use 64 experts at 1/8 FFN-size each, still activating K=6 per token.

More, smaller experts provides two benefits:

Better Specialization

More expert "slots" means the routing can achieve finer specialization. Different tokens can be routed to different combinations of micro-experts, giving more expressive routing than a handful of large experts.

Reduced Expert Collapse

With many small experts, each expert covers a smaller conceptual domain. This makes it easier to achieve uniform load balancing — there are more distinct niches for experts to specialize into, reducing competition.

DeepSeekMoE also introduces shared experts: a small number of experts that always activate for every token, regardless of routing. These shared experts accumulate general-purpose knowledge, while the routed experts specialize. This mirrors how a generalist FFN behaves, but preserves the efficiency benefits of sparse routing.

DeepSeek-V3 (2024) — FP8 Training and Multi-Token Prediction

DeepSeek-V3 scales the V2 architecture to 671 billion total parameters and 37 billion active per token. It was trained on 14.8 trillion tokens. The claimed training cost was approximately $6 million. Two new techniques contributed significantly to this efficiency:

FP8 Training

Most large models train in BF16 (16-bit bfloat) or FP32. DeepSeek-V3 used FP8 (8-bit floating point) for most matrix multiplications during training. FP8 requires roughly half the memory bandwidth and half the compute of BF16 for the same operation.

The challenge: FP8 has limited dynamic range and precision, causing instability with naive use. DeepSeek developed careful per-tile scaling strategies that maintain training stability. H100 GPUs have native FP8 Tensor Core support — this is what made V3's efficiency leap possible at the hardware level.

Multi-Token Prediction (MTP)

Standard language model training predicts one next token per position. Multi-token prediction trains the model to predict the next K tokens simultaneously from each position — K separate prediction heads, each trained on a different offset.

Why it helps training: each forward pass produces K times the gradient signal for the same compute. The model learns longer-range dependencies more efficiently. DeepSeek uses K=1 additional prediction head during training, then discards it at inference — no inference cost, better training signal density.

DeepSeek-R1 (2025) — Reasoning via GRPO

DeepSeek-R1 is a reasoning model — like OpenAI's o1/o3, it solves complex math, coding, and logical problems by generating extended chains of reasoning before producing a final answer. What makes R1 architecturally notable is how it was trained to reason.

OpenAI's o1 training details are undisclosed. DeepSeek published R1's approach: GRPO (Group Relative Policy Optimization), a variant of reinforcement learning that avoids the need for a separate trained reward model.

GRPO vs Standard RLHF (PPO)
Standard PPO (InstructGPT / o1)
  • Train a separate reward model on human preferences
  • Use that reward model to score model outputs
  • PPO updates model weights to maximize reward
  • Requires two large models in memory simultaneously
  • Reward model quality caps the final model quality
GRPO (DeepSeek-R1)
  • Sample a group of outputs for each prompt
  • Score each with a rule-based verifier (math: is the answer correct?)
  • Normalize scores within the group — better than average = positive reward
  • No separate reward model required; no PPO critic network
  • Lower memory footprint, simpler training pipeline

The rule-based verifier is the key enabler: for math and code, there is an objective ground truth — the answer is either correct or not. You do not need human preferences to train a reasoning model if you can verify answers automatically. GRPO exploits this, making it a more scalable and cheaper approach for verifiable domains.

DeepSeek also released distilled versions of R1 — R1-7B, R1-14B, and R1-32B — produced by distilling the full 671B R1 reasoning traces into smaller models. The 14B and 32B distilled models match or beat GPT-4o on many reasoning benchmarks, making them the most cost-effective reasoning models available for local deployment.

Training Cost Breakthrough — What It Actually Means

DeepSeek's claim of ~$6M training cost for a GPT-4-class model made headlines. Context is important:

What the $6M figure includes

The final training run on 14.8T tokens using their FP8 pipeline. Does not typically include: infrastructure R&D cost, failed training runs, data curation, the earlier V2 work that informed V3, or researcher salaries.

What it genuinely demonstrates

That architectural innovations (MLA, DeepSeekMoE, FP8, MTP) compound to produce a step-change in training efficiency. The algorithmic work is real. Frontier-quality models are not exclusively the domain of labs with $100M+ compute budgets.

Checklist: Do You Understand This?

  • What problem does Multi-head Latent Attention (MLA) solve, and how does it compress the KV cache?
  • In DeepSeekMoE, why does using many small experts rather than a few large experts reduce expert collapse?
  • What are shared experts in DeepSeekMoE, and why are they useful?
  • Why does FP8 training reduce compute and memory bandwidth compared to BF16? What hardware feature enables it on H100s?
  • Explain multi-token prediction: what does the model predict during training, and why is it discarded at inference?
  • What is GRPO, and how does it differ from PPO-based RLHF in terms of the reward signal source?
  • Why is a rule-based verifier sufficient for training a reasoning model on math and code tasks?
  • What aspects of the $6M training cost claim are real, and what costs does it omit?