Intermediate

DeepSeek-R1 & Open Reasoning

When DeepSeek released R1 in January 2025, it triggered a global reassessment of AI costs and accessibility: a fully open-weight reasoning model competitive with OpenAI's o1, available for free download. This page explains how it works, how it was trained, and what it means for builders.

Why DeepSeek-R1 Was a Shock

Before January 2025, the implicit assumption was that frontier reasoning capability required:

Massive proprietary training runs (>$100M compute)
Closed model weights (never released)
API-only access with per-token pricing

DeepSeek-R1 broke all three assumptions simultaneously. It matched or exceeded o1 on major benchmarks, was released as open-weight (download freely), and was available via a public API at a fraction of OpenAI's pricing ($0.55/$2.19 per million input/output tokens vs o1's $15/$60).

The release triggered what analysts called a "global AI cost reset" — demonstrating that reasoning capability was not an exclusive property of the largest US labs.

Architecture: MoE with 671B Total / 37B Active

DeepSeek-R1 is built on DeepSeek-V3's architecture — a Mixture of Experts (MoE) model with:

671B total parameters across all expert networks
37B active parameters per forward pass — only a subset of experts activates for each token

This MoE design is why DeepSeek can achieve high capability at lower compute cost: the model is large in total parameters (giving it breadth of knowledge) but only activates a fraction of those parameters per token (keeping inference cost low).

How It Was Trained: RL Without Supervised CoT Data

The training approach is what made DeepSeek-R1 scientifically significant. Most reasoning models at the time were fine-tuned on human-written chain-of-thought examples. DeepSeek took a different path:

DeepSeek-R1-Zero: Pure Reinforcement Learning

R1-Zero was trained using only reinforcement learning (RL), with no supervised fine-tuning on reasoning examples. The reward signal was simple: is the final answer correct?

Remarkably, the model taught itself to reason — generating verification steps, backtracking when it went wrong, and exploring alternative approaches — purely because these behaviours improved its final answer accuracy. On AIME 2024, R1-Zero improved from 15.6% to 71% through RL alone.

This validated a key hypothesis: reasoning capability can emerge from reinforcement learning on outcome signals, without explicit instruction on how to reason.

DeepSeek-R1: Cold Start + RL + Refinement

The full R1 model extended this approach with additional stages:

Cold start fine-tuning — A small set of human-written examples to bootstrap readable, well-formatted reasoning traces (R1-Zero's outputs were occasionally hard to read)
RL training (GRPO) — Group Relative Policy Optimization, a more efficient RL framework than PPO, rewarding only correct final answers
Rejection sampling — Filter out low-quality outputs and retrain on the best-quality reasoning traces
Human preference alignment — A final stage incorporating human feedback to improve helpfulness and safety

Benchmark	DeepSeek-R1	OpenAI o1
AIME 2024 (maths)	79.8%	79.2%
MATH-500	97.3%	96.4%
Codeforces rating	2,029	1,891
GPQA Diamond (science)	71.5%	77.3%

Distilled Models: Reasoning in Smaller Packages

DeepSeek also released six distilled models — smaller models trained to reproduce R1's reasoning patterns at much lower compute cost:

1.5B, 7B, 8B, 14B, 32B, 70B parameter sizes
Based on Qwen2.5 and Llama3 architectures (not DeepSeek's MoE)
The 7B distill outperforms GPT-4o on maths benchmarks
The 70B distill rivals o1 on several benchmarks

These distilled models can run on consumer hardware — a 7B model runs on a modern gaming GPU (16GB VRAM), and 14B models run on higher-end consumer setups.

Running DeepSeek-R1 Locally

The full 671B model requires substantial hardware (multiple A100 80GB GPUs or equivalent). But the distilled models run locally:

Model	Min VRAM	Recommended hardware	Quality level
deepseek-r1:7b	8 GB	RTX 3080 / M1 Pro	Strong maths, moderate general reasoning
deepseek-r1:14b	16 GB	RTX 4080 / M2 Max	Good general reasoning, strong coding
deepseek-r1:32b	32 GB	RTX 4090 / M3 Ultra	Near o1 quality on many tasks
deepseek-r1:70b	48 GB+	2× RTX 4090 / Mac Studio M3 Ultra	Close to full R1 for most use cases

With Ollama, running the 7B model is simple:

ollama pull deepseek-r1:7b
ollama run deepseek-r1:7b

Visible Reasoning Trace

Unlike OpenAI's o-series, which hides or summarises the thinking, DeepSeek-R1 outputs the full reasoning trace between <think> and</think> tags before the answer. This is valuable for:

Debugging: understanding why the model reached a conclusion
Learning: seeing how systematic reasoning approaches complex problems
Auditing: verifying reasoning for high-stakes use cases

Data Residency Advantage

Running DeepSeek-R1 locally (or on your own cloud infrastructure) means reasoning happens without sending data to external APIs. This matters for:

Source code that cannot leave your environment
Customer data subject to GDPR or healthcare privacy regulations
Internal business strategy or unreleased financial information

Limitations and Considerations

Knowledge cutoff: DeepSeek-R1 has a 2024 training cutoff. It doesn't know about post-cutoff events without retrieval augmentation.
Safety alignment differences: As a Chinese lab's model, R1's safety training differs from US models. It may decline politically sensitive Chinese topics more readily, and US-specific safety categories may behave differently. Evaluate for your use case.
English writing quality: Strong on maths and coding; slightly weaker than GPT/Claude on nuanced long-form English prose.
No native tool use: DeepSeek-R1 does not have built-in tool calling during reasoning (unlike o3/o4-mini). Tool use must be added at the orchestration layer.

Checklist: Do You Understand This?

Why did DeepSeek-R1's January 2025 release cause a "global AI cost reset"?
What is the key difference in R1's training compared to most reasoning models?
What does MoE (Mixture of Experts) mean, and why does it matter for R1's cost?
What are distilled models, and how do they differ from the full R1?
What hardware do you need to run deepseek-r1:7b locally?
What data residency advantage does self-hosting R1 provide?