Intermediate

OpenAI o-Series (o1, o3, o4)

Feb 2026 update: o3 and o4-mini were retired as selectable models in the ChatGPT consumer interface. In ChatGPT (web, mobile, desktop), reasoning is now presented as GPT-5 Thinking — the same underlying capability, surfaced differently. o3 and o4-mini remain fully available via the OpenAI API and are the recommended choice for builders.

OpenAI's o-series is a separate model family from GPT, designed specifically for reasoning-intensive tasks. Understanding the distinctions between o1, o3, and o4-mini — and how they differ from GPT-5 — helps you route queries to the right model when building via the API.

Why a Separate o-Series?

GPT models (GPT-4o, GPT-5) are optimised for breadth: fast, multimodal, instruction-following across all types of tasks. The o-series is optimised for depth: spending substantially more compute at inference to achieve much higher accuracy on hard reasoning problems. They are complementary — not replacements for each other.

The "o" originally stood for nothing specific (early internal codename), though it is commonly associated with "omni reasoning" in later usage. There is no o2 — OpenAI skipped that number to avoid confusion with the UK telecom brand O2.

o1: First-Generation Reasoning (September 2024)

o1 was the first publicly released reasoning model. At launch it scored 83.3% on AIME 2024 (a hard maths competition), compared to GPT-4o's ~12% on the same benchmark — a dramatic demonstration that test-time compute could unlock qualitatively different capability.

Key o1 characteristics:

Thinking trace is fully hidden — users only see the final answer
No tool use during reasoning — could only call tools after completing its thinking
High latency: 30–120 seconds typical for hard problems
Higher cost than GPT-4o; has since been superseded by o3 and o4-mini
o1-mini: a cheaper variant with less capability; also now superseded

Current status

OpenAI has stated that for most real-world use cases, o3 and o4-mini are both smarter and cheaper than o1. There is little reason to use o1 for new work — prefer o4-mini for cost-efficiency and o3 for maximum capability.

o3: Full Reasoning with Tools (April 2025)

o3 represents a substantial advance over o1. Its key improvements:

Native tool use during reasoning — o3 can call tools (web search, code interpreter, file analysis, image generation) from within the thinking trace, then incorporate results back into its reasoning
Higher benchmark scores — 91.6% on AIME 2024, 88.9% on AIME 2025, making 20% fewer major errors on difficult real-world tasks versus o1
Visual reasoning — o3 can integrate images directly into the reasoning chain (not just at input), enabling it to reason about diagrams, charts, and screenshots
Deliberative alignment — o3 uses a safety-focused approach where safety-relevant policies are explicitly part of the reasoning trace

o4-mini: Cost-Efficient Reasoning (April 2025)

o4-mini is a smaller model optimised for fast, high-volume reasoning. Despite being "mini," it is not significantly less capable than o3 on most tasks:

Benchmark	o3	o4-mini	Note
AIME 2024 (maths olympiad)	91.6%	93.4%	o4-mini wins
AIME 2025	88.9%	92.7%	o4-mini wins
MMMU (multimodal)	82.9%	81.6%	o3 slightly better
MathVista	86.8%	84.3%	o3 slightly better
Cost (input/output per 1M tokens)	$10 / $40	$1.10 / $4.40	o4-mini ~9× cheaper

o4-mini is the default recommendation for most reasoning tasks. Choose o3 when you need the absolute highest quality on multi-step complex reasoning or have visual analysis requirements that demand the larger model.

Thinking Token Budget in the API

Both o3 and o4-mini expose a reasoning_effort parameter (or equivalent) in the API. You can set:

low — quick, cheap, good for moderately hard problems
medium — default; good balance
high — maximum reasoning; best for hardest problems

Alternatively, you can set a maximum thinking token budget directly (e.g., max 8,192 thinking tokens). This caps latency and cost predictably.

Tip: Start with o4-mini at medium effort. If the answer quality is insufficient, try o4-mini high before escalating to o3. Many teams find o4-mini medium covers ~90% of their reasoning needs at 1/10th the o3 cost.

Native Tool Use Within Reasoning

A major capability introduced with o3 and o4-mini is the ability to use toolsduring the reasoning trace — not just after it. This means:

Mid-reasoning web search — The model can search for a fact it needs to complete a reasoning step, then continue the trace with that information
Code execution in reasoning — Run Python to verify a calculation, then use the result in subsequent reasoning
Multi-tool chaining in one call — A single o3/o4-mini call can search the web, run code, and analyse a file as part of its thinking process

This makes o3/o4-mini significantly more useful as the "brain" in agentic pipelines — rather than a planning step that then calls tools, the model can gather information and plan simultaneously.

o-Series vs GPT-4o: Routing Guide

Task type	Use GPT-4o / GPT-5	Use o4-mini / o3
Writing, editing, summarising	Yes	No (overkill)
Real-time chat / voice	Yes	No (too slow)
Complex maths / logic	No	Yes (dramatically better)
Hard code generation	Try first	Yes if GPT-4o fails
Multi-step agent planning	Marginal	Yes (much more reliable)
Image understanding (basic)	Yes	o3 for complex visual reasoning
High-volume simple tasks	Yes (or GPT-4o mini)	No (cost)

Latency Expectations

Set realistic expectations when using o-series models in production:

o4-mini (low effort): 5–15 seconds to first token of final answer
o4-mini (medium effort): 15–45 seconds
o3 (high effort): 1–5+ minutes for the hardest problems

This makes o-series models appropriate for async workflows (batch processing, background jobs) much more often than real-time user-facing interactions. Design your UX accordingly — a progress indicator and async response pattern are usually necessary.

Checklist: Do You Understand This?

What is the key difference between the o-series and the GPT series?
Why was there no o2?
How does o4-mini compare to o3 in terms of benchmark performance and cost?
What capability did o3/o4-mini introduce over o1 regarding tool use?
When should you route to o4-mini vs GPT-4o in a production system?
What UX pattern is appropriate for o-series model calls in user-facing apps?