Advanced

DeepSeek-R1 Training Approach

DeepSeek-R1 (January 2025) is the first open-weight reasoning model to match o1-level performance on major benchmarks — and unlike o1, DeepSeek published the complete training methodology. This page covers the GRPO reinforcement learning algorithm, the cold-start problem and how they solved it, the reward signals used, rejection sampling fine-tuning, and the open-source impact of the release.

What DeepSeek-R1 Is

DeepSeek-R1 is a family of reasoning models released in January 2025 by DeepSeek, a Chinese AI lab. The flagship model (DeepSeek-R1 based on DeepSeek-V3, ~671B active parameters in a Mixture-of-Experts architecture) matches or exceeds OpenAI o1 on most standard reasoning benchmarks, at a fraction of the training cost. More importantly for the research community, DeepSeek released a detailed technical report describing the complete training approach — making R1 the first frontier-level reasoning model with a fully documented, reproducible methodology.

Benchmark	DeepSeek-R1	OpenAI o1	Notes
AIME 2024	79.8%	83.3%	Slight edge to o1; both far above GPT-4o (13.4%)
MATH-500	97.3%	96.4%	R1 slightly ahead
Codeforces rating	~2029 (96.3th percentile)	~1891 (89th percentile)	R1 ahead on competitive programming
GPQA Diamond	71.5%	78.3%	o1 ahead on PhD-level science

GRPO: The RL Algorithm

DeepSeek-R1 uses Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm developed by DeepSeek that eliminates the need for a separate critic or value model — which is a significant simplification over Proximal Policy Optimization (PPO), the algorithm used in most RLHF pipelines.

How GRPO works

For each training prompt, sample a group of G outputs from the current policy (e.g., G = 8)
Compute a reward for each output using the reward function (e.g., check math answer correctness)
Normalize rewards within the group: subtract the group mean and divide by group standard deviation
Update the policy to increase the probability of outputs with above-average normalized reward and decrease those with below-average reward
Apply a KL-divergence penalty to prevent the policy from drifting too far from the reference model

The critical insight of GRPO: by normalizing rewards within a group of samples for the same prompt, you get a relative advantage signal without needing to train a separate value function. PPO needs a critic model (roughly as large as the policy model) to estimate the baseline. GRPO replaces this with a within-group average, halving the memory and compute requirements for the RL training step.

	PPO (standard RLHF)	GRPO (DeepSeek-R1)
Critic/value model	Required (separate model)	Not needed — replaced by group mean
Reward baseline	Value function prediction	Within-group reward mean
Memory requirement	~2× policy model (policy + critic)	~1× policy model
Stability	Well-studied, stable	Simpler; avoids critic training instability

The Cold-Start Problem

DeepSeek first tried training a model with pure RL from scratch on reasoning tasks — they called this R1-Zero. The result was instructive: the model did develop reasoning capabilities and even showed an "aha moment" where it spontaneously learned to re-examine its assumptions mid-reasoning. But R1-Zero had significant practical problems: it mixed languages (switching between English and Chinese mid-reasoning), produced hard-to-read output, and was unstable during training.

The cold-start problem in RL training is that starting from a base model with no examples of the target behavior, exploration is extremely inefficient. The model needs to stumble onto long reasoning chains before it can learn they are rewarded — and with random initialization, it rarely produces them spontaneously.

DeepSeek's cold-start solution for R1

Instead of pure RL from scratch, R1 uses a small number of long-form chain-of-thought examples to prime the model before RL begins. This "cold start" SFT phase:

Uses thousands (not millions) of high-quality CoT examples
Establishes the format: reasoning inside a thinking block, final answer after
Makes RL exploration significantly more efficient — the model already knows how to think in the right format
Addresses the readability and language-mixing problems of R1-Zero

Reward Signals: No Human Labels for Core RL

One of the most significant design choices in DeepSeek-R1: the core RL training usesrule-based rewards, not human preference labels. This dramatically reduces the data collection cost.

Accuracy rewards

For math problems: check the final answer against the ground truth. Correct = reward 1, incorrect = reward 0. This requires only question-answer pairs, which exist in abundance from competition math datasets. No human judgment needed — the answer is either right or wrong.

Format rewards

The model is required to produce its reasoning inside a <think> block and its final answer after. A simple heuristic reward penalizes outputs that do not follow this format. This ensures the model produces parseable outputs without requiring human review.

What is deliberately excluded

Human preference labels (RLHF-style comparisons) are not used for the math/code RL phase. Neural reward models trained on human preferences are also not used — because they are susceptible to reward hacking (the policy learns to game the reward model rather than improve actual performance). Rule-based rewards are harder to hack.

Rejection Sampling Fine-Tuning

After the initial RL training phase, DeepSeek uses rejection sampling fine-tuning— an iterative quality improvement process:

Rejection sampling fine-tuning loop

Use the RL-trained model to generate many candidate solutions for each training problem
Verify each solution — keep only the correct ones (for math) or high-quality ones (for general tasks)
Fine-tune the model on this filtered dataset of verified correct solutions (standard SFT)
Repeat the RL training phase on the SFT-initialized model
Repeat the cycle to iteratively improve quality

This combines the best of both approaches: RL for exploration and capability expansion, SFT for stability and quality consolidation. The output of each RL phase is a higher-quality dataset for the next SFT phase. Over multiple iterations, the model improves without requiring new human annotations.

Language Mixing and Readability Issues

When trained with pure RL (R1-Zero), models that were trained on multilingual data tend tomix languages during reasoning — switching between English and Chinese mid-chain, sometimes mid-sentence. This is a particularly striking failure mode: the model is optimizing for correct answers, and using whichever language patterns produce better intermediate tokens for that task, regardless of readability.

This is not a trivial problem. The reasoning chains produced by R1-Zero, while often leading to correct answers, were nearly unreadable to humans — a significant limitation for interpretability and trust. DeepSeek addressed this through:

The cold-start SFT phase on English-language CoT examples to establish a consistent language baseline
Language consistency rewards added to the RL reward function
Additional SFT phases on human-readable chain-of-thought data as part of the iterative training loop

Distillation: Small Models with Big Reasoning

One of the most impactful contributions of the R1 release: the use of R1's reasoning traces to fine-tune much smaller models. DeepSeek generated a large dataset of R1's chain-of-thought solutions to math and reasoning problems, then fine-tuned open base models (Qwen 2.5 and LLaMA 3 series) on this data.

Distilled model	Base	AIME 2024	vs. base model
DeepSeek-R1-Distill-Qwen-7B	Qwen 2.5 7B	55.5%	~4× improvement over base
DeepSeek-R1-Distill-Qwen-14B	Qwen 2.5 14B	69.7%	Exceeds o1-mini
DeepSeek-R1-Distill-Qwen-32B	Qwen 2.5 32B	72.6%	Near o1 level
DeepSeek-R1-Distill-LLaMA-70B	LLaMA 3.1 70B	70.0%	Dramatic improvement over base LLaMA

The distillation results revealed something important: a 7B model fine-tuned on R1's reasoning traces significantly outperforms much larger base models on reasoning benchmarks. The reasoning pattern — how to structure thought, when to re-examine assumptions, how to decompose problems — can be transferred from a large capable model to a small model via supervised fine-tuning on chain-of-thought data. This does not require RL training on the small model.

The distilled R1 models became among the most widely used open-weight reasoning models in 2025, particularly for local deployment — the 7B and 14B variants can run on consumer hardware.

Open-Source Impact

DeepSeek released both the model weights and the training methodology for R1. This had several significant effects on the AI ecosystem:

Validated reproducibility of frontier reasoning

Before R1, many in the community assumed that OpenAI's reasoning capability required proprietary techniques that could not be replicated. R1 demonstrated that the core approach — RL on verifiable rewards with extended CoT — can be executed successfully without those techniques. This was confirmed when community researchers and labs began reproducing aspects of R1's training within weeks of the paper's release.

Enabled community research and iteration

With both weights and methodology public, researchers could build on R1 directly — testing alternative reward signals, different RL algorithms, different base models, and domain-specific fine-tuning. The open release compressed what would have been months of independent parallel research into weeks of collaborative improvement.

Provided small-scale reasoning models

The distilled R1 models gave developers access to capable reasoning models that can run locally — something OpenAI's o1 cannot provide, as it is API-only. This enabled privacy-sensitive use cases, offline deployment, and experimentation without per-token API costs.

Intensified compute cost competition

R1 demonstrated competitive reasoning capability at a reported training cost of approximately $5–6 million — a fraction of what frontier US labs spend. This accelerated pressure on the assumption that capability requires enormous compute budgets, and intensified industry focus on training efficiency.

Checklist: Do You Understand This?

Can you describe GRPO and explain specifically how it differs from PPO — what it eliminates and what it replaces that with?
Can you explain the cold-start problem in RL training and describe DeepSeek's two-phase solution?
Can you name the two types of rule-based rewards used in R1's RL training and explain why human preference labels are deliberately avoided?
Can you describe the rejection sampling fine-tuning loop and explain how it iteratively improves model quality?
Can you explain why R1-Zero produced language-mixed, hard-to-read reasoning chains, and how this was addressed in full R1?
Can you explain how distillation works in the R1 context — what data is used and what process produces the small reasoning models?
Can you state the AIME 2024 score for the 7B distilled model and explain why that result is significant relative to base model size?
Can you describe two specific impacts of the R1 open release on the broader AI ecosystem?