🧠 All Things AI
Intermediate

OpenAI o-Series (o1, o3, o4)

OpenAI's o-series is a separate model family from GPT, designed specifically for reasoning-intensive tasks. Understanding the distinctions between o1, o3, and o4-mini — and how they differ from GPT-4o — helps you route queries to the right model.

Why a Separate o-Series?

GPT models (GPT-4o, GPT-5) are optimised for breadth: fast, multimodal, instruction-following across all types of tasks. The o-series is optimised for depth: spending substantially more compute at inference to achieve much higher accuracy on hard reasoning problems. They are complementary — not replacements for each other.

The "o" originally stood for nothing specific (early internal codename), though it is commonly associated with "omni reasoning" in later usage. There is no o2 — OpenAI skipped that number to avoid confusion with the UK telecom brand O2.

o1: First-Generation Reasoning (September 2024)

o1 was the first publicly released reasoning model. At launch it scored 83.3% on AIME 2024 (a hard maths competition), compared to GPT-4o's ~12% on the same benchmark — a dramatic demonstration that test-time compute could unlock qualitatively different capability.

Key o1 characteristics:

  • Thinking trace is fully hidden — users only see the final answer
  • No tool use during reasoning — could only call tools after completing its thinking
  • High latency: 30–120 seconds typical for hard problems
  • Higher cost than GPT-4o; has since been superseded by o3 and o4-mini
  • o1-mini: a cheaper variant with less capability; also now superseded

Current status

OpenAI has stated that for most real-world use cases, o3 and o4-mini are both smarter and cheaper than o1. There is little reason to use o1 for new work — prefer o4-mini for cost-efficiency and o3 for maximum capability.

o3: Full Reasoning with Tools (April 2025)

o3 represents a substantial advance over o1. Its key improvements:

  • Native tool use during reasoning — o3 can call tools (web search, code interpreter, file analysis, image generation) from within the thinking trace, then incorporate results back into its reasoning
  • Higher benchmark scores — 91.6% on AIME 2024, 88.9% on AIME 2025, making 20% fewer major errors on difficult real-world tasks versus o1
  • Visual reasoning — o3 can integrate images directly into the reasoning chain (not just at input), enabling it to reason about diagrams, charts, and screenshots
  • Deliberative alignment — o3 uses a safety-focused approach where safety-relevant policies are explicitly part of the reasoning trace

o4-mini: Cost-Efficient Reasoning (April 2025)

o4-mini is a smaller model optimised for fast, high-volume reasoning. Despite being "mini," it is not significantly less capable than o3 on most tasks:

Benchmarko3o4-miniNote
AIME 2024 (maths olympiad)91.6%93.4%o4-mini wins
AIME 202588.9%92.7%o4-mini wins
MMMU (multimodal)82.9%81.6%o3 slightly better
MathVista86.8%84.3%o3 slightly better
Cost (input/output per 1M tokens)$10 / $40$1.10 / $4.40o4-mini ~9× cheaper

o4-mini is the default recommendation for most reasoning tasks. Choose o3 when you need the absolute highest quality on multi-step complex reasoning or have visual analysis requirements that demand the larger model.

Thinking Token Budget in the API

Both o3 and o4-mini expose a reasoning_effort parameter (or equivalent) in the API. You can set:

  • low — quick, cheap, good for moderately hard problems
  • medium — default; good balance
  • high — maximum reasoning; best for hardest problems

Alternatively, you can set a maximum thinking token budget directly (e.g., max 8,192 thinking tokens). This caps latency and cost predictably.

Tip: Start with o4-mini at medium effort. If the answer quality is insufficient, try o4-mini high before escalating to o3. Many teams find o4-mini medium covers ~90% of their reasoning needs at 1/10th the o3 cost.

Native Tool Use Within Reasoning

A major capability introduced with o3 and o4-mini is the ability to use toolsduring the reasoning trace — not just after it. This means:

  • Mid-reasoning web search — The model can search for a fact it needs to complete a reasoning step, then continue the trace with that information
  • Code execution in reasoning — Run Python to verify a calculation, then use the result in subsequent reasoning
  • Multi-tool chaining in one call — A single o3/o4-mini call can search the web, run code, and analyse a file as part of its thinking process

This makes o3/o4-mini significantly more useful as the "brain" in agentic pipelines — rather than a planning step that then calls tools, the model can gather information and plan simultaneously.

o-Series vs GPT-4o: Routing Guide

Task typeUse GPT-4o / GPT-5Use o4-mini / o3
Writing, editing, summarisingYesNo (overkill)
Real-time chat / voiceYesNo (too slow)
Complex maths / logicNoYes (dramatically better)
Hard code generationTry firstYes if GPT-4o fails
Multi-step agent planningMarginalYes (much more reliable)
Image understanding (basic)Yeso3 for complex visual reasoning
High-volume simple tasksYes (or GPT-4o mini)No (cost)

Latency Expectations

Set realistic expectations when using o-series models in production:

  • o4-mini (low effort): 5–15 seconds to first token of final answer
  • o4-mini (medium effort): 15–45 seconds
  • o3 (high effort): 1–5+ minutes for the hardest problems

This makes o-series models appropriate for async workflows (batch processing, background jobs) much more often than real-time user-facing interactions. Design your UX accordingly — a progress indicator and async response pattern are usually necessary.

Checklist: Do You Understand This?

  • What is the key difference between the o-series and the GPT series?
  • Why was there no o2?
  • How does o4-mini compare to o3 in terms of benchmark performance and cost?
  • What capability did o3/o4-mini introduce over o1 regarding tool use?
  • When should you route to o4-mini vs GPT-4o in a production system?
  • What UX pattern is appropriate for o-series model calls in user-facing apps?