Intermediate

Inference Economics & Cost Curves

AI inference costs have fallen faster than almost any technology in history — over 99% in three years. Understanding what drives these curves, where they're headed, and how to build business models around them is essential for anyone building AI products.

2023 Cost

~$60 / 1M tokens (GPT-4)

2026 Cost

~$0.15–1 / 1M tokens

GPT-4 (Jun 2023) ~$60

Claude 3 Haiku ~$2.50

GPT-4o mini ~$0.60

Gemini Flash ~$0.15

>99% cost reduction in under 3 years — intelligence is commoditising faster than any technology in computing history

The Cost Collapse

In mid-2023, generating 1 million tokens of GPT-4-class output cost around $60. By early 2026, comparable capability costs under $1 — sometimes $0.15–0.40 for the fastest models. This is a >99% cost reduction in under three years.

This isn't a gradual decline — it's an exponential curve driven by simultaneous improvements in hardware efficiency, model architecture, inference software, and competitive pressure. The rate has not slowed; it has accelerated.

Mid 2023 benchmarks

GPT-4: ~$60/1M output tokens
Claude 2: ~$24/1M output tokens
State-of-the-art = expensive and slow

Early 2026 benchmarks

Claude Haiku 4.5: ~$0.40/1M output tokens
DeepSeek V3 API: ~$0.27/1M input tokens
Gemini Flash 2.5: <$0.15/1M tokens

What Drives the Cost Curve

Five compounding forces drive inference costs down:

Hardware improvements (Dennard scaling successor): Each new GPU generation (H100 → B200 → Rubin) delivers more FLOPS per dollar. NVIDIA's H100 delivers ~4× the throughput of the A100; the B200 doubles it again. Hardware gains compound with model efficiency gains.
Model architecture efficiency: Mixture-of-Experts (MoE) allows models with large total parameter counts to activate only a fraction per token. DeepSeek-V3 (671B total / 37B active) processes tokens at the compute cost of a 37B dense model. This 10–20× activation efficiency directly reduces inference cost.
Inference software: vLLM, SGLang, TensorRT-LLM, and Flash Attention dramatically improved GPU utilisation. Techniques like speculative decoding and continuous batching push GPU utilisation from ~40% to 80–90%, halving effective per-token cost.
Quantisation: Running models in INT8 or INT4 precision (vs FP16) reduces memory bandwidth requirements and increases throughput. Quality loss is minimal for most tasks with modern quantisation methods (GPTQ, AWQ).
Competition: DeepSeek's January 2025 R1 release demonstrated frontier-class reasoning at dramatically lower training cost, triggering immediate price cuts from OpenAI, Anthropic, and Google. Competitive dynamics now force price reductions independent of underlying cost improvements.

What This Means for Products

Use cases that become viable

Mass document processing (millions of PDFs)
Per-user personalisation at consumer scale
Background agents running continuously
AI-generated first drafts for every piece of content
Real-time AI in mobile apps (not just cloud)

Business model risks

Pricing power erodes as costs fall faster than revenues
Thin-wrapper products commoditise quickly
Differentiation must come from data, UX, or workflow — not the model
Customers will refuse to pay 2023-era prices by 2026

Training vs Inference Economics

Training and inference have very different economic structures:

Training is a one-time fixed cost that has also fallen dramatically. GPT-3 (2020) cost ~$4.6M to train; DeepSeek-R1 (2025) cost ~$6M to train but with 10× the capability. Frontier models from OpenAI/Anthropic/Google cost $50–150M+ per training run, but open-weight distillations of their capability cost far less.
Inference is the recurring cost — paid every time a user or application calls the model. This is where the 99% collapse has happened and where product cost structures are defined.
The ratio is shifting: As inference gets cheaper, the relative importance of training investment to overall AI economics decreases. A model trained for $5M but served efficiently at $0.10/1M tokens can be highly profitable. A model trained for $150M but priced aggressively faces margin pressure.

Where Is the Curve Going?

The consensus view among AI economists and infrastructure researchers is that the cost curve has not flattened. Drivers of continued decline include:

NVIDIA Blackwell and Rubin GPU generations (2025–2027) delivering 2–4× further hardware efficiency
Edge inference on NPUs (mobile, PC) that shifts cost from cloud to user hardware
Further architectural improvements in model efficiency (post-MoE architectures)
Specialised inference chips from Google (TPUs), AWS (Trainium/Inferentia), and Meta (MTIA)
Increased global GPU supply as TSMC and Samsung expand capacity

The practical implication: if your product relies on AI inference being expensive to protect margins, reconsider. Build on the assumption that by 2027, costs will be another 10× lower than today.

Checklist: Do You Understand This?

AI inference costs have fallen over 99% from 2023 to 2026, driven by hardware, architecture, software, and competition
MoE architectures (DeepSeek, Mixtral) deliver large model capability at small model compute cost
vLLM and similar inference servers dramatically improve GPU utilisation, cutting effective per-token cost
Training is a one-time fixed cost; inference is the recurring margin question
Product differentiation must come from data, workflow, and UX — not the model itself — as commodity inference becomes the norm
The curve is expected to continue: plan for another 10× cost reduction by 2027