🧠 All Things AI
Intermediate

DeepSeek-R1 & Open Reasoning

When DeepSeek released R1 in January 2025, it triggered a global reassessment of AI costs and accessibility: a fully open-weight reasoning model competitive with OpenAI's o1, available for free download. This page explains how it works, how it was trained, and what it means for builders.

Why DeepSeek-R1 Was a Shock

Before January 2025, the implicit assumption was that frontier reasoning capability required:

  • Massive proprietary training runs (>$100M compute)
  • Closed model weights (never released)
  • API-only access with per-token pricing

DeepSeek-R1 broke all three assumptions simultaneously. It matched or exceeded o1 on major benchmarks, was released as open-weight (download freely), and was available via a public API at a fraction of OpenAI's pricing ($0.55/$2.19 per million input/output tokens vs o1's $15/$60).

The release triggered what analysts called a "global AI cost reset" — demonstrating that reasoning capability was not an exclusive property of the largest US labs.

Architecture: MoE with 671B Total / 37B Active

DeepSeek-R1 is built on DeepSeek-V3's architecture — a Mixture of Experts (MoE) model with:

  • 671B total parameters across all expert networks
  • 37B active parameters per forward pass — only a subset of experts activates for each token

This MoE design is why DeepSeek can achieve high capability at lower compute cost: the model is large in total parameters (giving it breadth of knowledge) but only activates a fraction of those parameters per token (keeping inference cost low).

How It Was Trained: RL Without Supervised CoT Data

The training approach is what made DeepSeek-R1 scientifically significant. Most reasoning models at the time were fine-tuned on human-written chain-of-thought examples. DeepSeek took a different path:

DeepSeek-R1-Zero: Pure Reinforcement Learning

R1-Zero was trained using only reinforcement learning (RL), with no supervised fine-tuning on reasoning examples. The reward signal was simple: is the final answer correct?

Remarkably, the model taught itself to reason — generating verification steps, backtracking when it went wrong, and exploring alternative approaches — purely because these behaviours improved its final answer accuracy. On AIME 2024, R1-Zero improved from 15.6% to 71% through RL alone.

This validated a key hypothesis: reasoning capability can emerge from reinforcement learning on outcome signals, without explicit instruction on how to reason.

DeepSeek-R1: Cold Start + RL + Refinement

The full R1 model extended this approach with additional stages:

  1. Cold start fine-tuning — A small set of human-written examples to bootstrap readable, well-formatted reasoning traces (R1-Zero's outputs were occasionally hard to read)
  2. RL training (GRPO) — Group Relative Policy Optimization, a more efficient RL framework than PPO, rewarding only correct final answers
  3. Rejection sampling — Filter out low-quality outputs and retrain on the best-quality reasoning traces
  4. Human preference alignment — A final stage incorporating human feedback to improve helpfulness and safety
BenchmarkDeepSeek-R1OpenAI o1
AIME 2024 (maths)79.8%79.2%
MATH-50097.3%96.4%
Codeforces rating2,0291,891
GPQA Diamond (science)71.5%77.3%

Distilled Models: Reasoning in Smaller Packages

DeepSeek also released six distilled models — smaller models trained to reproduce R1's reasoning patterns at much lower compute cost:

  • 1.5B, 7B, 8B, 14B, 32B, 70B parameter sizes
  • Based on Qwen2.5 and Llama3 architectures (not DeepSeek's MoE)
  • The 7B distill outperforms GPT-4o on maths benchmarks
  • The 70B distill rivals o1 on several benchmarks

These distilled models can run on consumer hardware — a 7B model runs on a modern gaming GPU (16GB VRAM), and 14B models run on higher-end consumer setups.

Running DeepSeek-R1 Locally

The full 671B model requires substantial hardware (multiple A100 80GB GPUs or equivalent). But the distilled models run locally:

ModelMin VRAMRecommended hardwareQuality level
deepseek-r1:7b8 GBRTX 3080 / M1 ProStrong maths, moderate general reasoning
deepseek-r1:14b16 GBRTX 4080 / M2 MaxGood general reasoning, strong coding
deepseek-r1:32b32 GBRTX 4090 / M3 UltraNear o1 quality on many tasks
deepseek-r1:70b48 GB+2× RTX 4090 / Mac Studio M3 UltraClose to full R1 for most use cases

With Ollama, running the 7B model is simple:

ollama pull deepseek-r1:7b
ollama run deepseek-r1:7b

Visible Reasoning Trace

Unlike OpenAI's o-series, which hides or summarises the thinking, DeepSeek-R1 outputs the full reasoning trace between <think> and</think> tags before the answer. This is valuable for:

  • Debugging: understanding why the model reached a conclusion
  • Learning: seeing how systematic reasoning approaches complex problems
  • Auditing: verifying reasoning for high-stakes use cases

Data Residency Advantage

Running DeepSeek-R1 locally (or on your own cloud infrastructure) means reasoning happens without sending data to external APIs. This matters for:

  • Source code that cannot leave your environment
  • Customer data subject to GDPR or healthcare privacy regulations
  • Internal business strategy or unreleased financial information

Limitations and Considerations

  • Knowledge cutoff: DeepSeek-R1 has a 2024 training cutoff. It doesn't know about post-cutoff events without retrieval augmentation.
  • Safety alignment differences: As a Chinese lab's model, R1's safety training differs from US models. It may decline politically sensitive Chinese topics more readily, and US-specific safety categories may behave differently. Evaluate for your use case.
  • English writing quality: Strong on maths and coding; slightly weaker than GPT/Claude on nuanced long-form English prose.
  • No native tool use: DeepSeek-R1 does not have built-in tool calling during reasoning (unlike o3/o4-mini). Tool use must be added at the orchestration layer.

Checklist: Do You Understand This?

  • Why did DeepSeek-R1's January 2025 release cause a "global AI cost reset"?
  • What is the key difference in R1's training compared to most reasoning models?
  • What does MoE (Mixture of Experts) mean, and why does it matter for R1's cost?
  • What are distilled models, and how do they differ from the full R1?
  • What hardware do you need to run deepseek-r1:7b locally?
  • What data residency advantage does self-hosting R1 provide?