Inference Economics & Cost Curves
AI inference costs have fallen faster than almost any technology in history — over 99% in three years. Understanding what drives these curves, where they're headed, and how to build business models around them is essential for anyone building AI products.
>99% cost reduction in under 3 years — intelligence is commoditising faster than any technology in computing history
The Cost Collapse
In mid-2023, generating 1 million tokens of GPT-4-class output cost around $60. By early 2026, comparable capability costs under $1 — sometimes $0.15–0.40 for the fastest models. This is a >99% cost reduction in under three years.
This isn't a gradual decline — it's an exponential curve driven by simultaneous improvements in hardware efficiency, model architecture, inference software, and competitive pressure. The rate has not slowed; it has accelerated.
Mid 2023 benchmarks
- GPT-4: ~$60/1M output tokens
- Claude 2: ~$24/1M output tokens
- State-of-the-art = expensive and slow
Early 2026 benchmarks
- Claude Haiku 4.5: ~$0.40/1M output tokens
- DeepSeek V3 API: ~$0.27/1M input tokens
- Gemini Flash 2.5: <$0.15/1M tokens
What Drives the Cost Curve
Five compounding forces drive inference costs down:
- Hardware improvements (Dennard scaling successor): Each new GPU generation (H100 → B200 → Rubin) delivers more FLOPS per dollar. NVIDIA's H100 delivers ~4× the throughput of the A100; the B200 doubles it again. Hardware gains compound with model efficiency gains.
- Model architecture efficiency: Mixture-of-Experts (MoE) allows models with large total parameter counts to activate only a fraction per token. DeepSeek-V3 (671B total / 37B active) processes tokens at the compute cost of a 37B dense model. This 10–20× activation efficiency directly reduces inference cost.
- Inference software: vLLM, SGLang, TensorRT-LLM, and Flash Attention dramatically improved GPU utilisation. Techniques like speculative decoding and continuous batching push GPU utilisation from ~40% to 80–90%, halving effective per-token cost.
- Quantisation: Running models in INT8 or INT4 precision (vs FP16) reduces memory bandwidth requirements and increases throughput. Quality loss is minimal for most tasks with modern quantisation methods (GPTQ, AWQ).
- Competition: DeepSeek's January 2025 R1 release demonstrated frontier-class reasoning at dramatically lower training cost, triggering immediate price cuts from OpenAI, Anthropic, and Google. Competitive dynamics now force price reductions independent of underlying cost improvements.
What This Means for Products
Use cases that become viable
- Mass document processing (millions of PDFs)
- Per-user personalisation at consumer scale
- Background agents running continuously
- AI-generated first drafts for every piece of content
- Real-time AI in mobile apps (not just cloud)
Business model risks
- Pricing power erodes as costs fall faster than revenues
- Thin-wrapper products commoditise quickly
- Differentiation must come from data, UX, or workflow — not the model
- Customers will refuse to pay 2023-era prices by 2026
Training vs Inference Economics
Training and inference have very different economic structures:
- Training is a one-time fixed cost that has also fallen dramatically. GPT-3 (2020) cost ~$4.6M to train; DeepSeek-R1 (2025) cost ~$6M to train but with 10× the capability. Frontier models from OpenAI/Anthropic/Google cost $50–150M+ per training run, but open-weight distillations of their capability cost far less.
- Inference is the recurring cost — paid every time a user or application calls the model. This is where the 99% collapse has happened and where product cost structures are defined.
- The ratio is shifting: As inference gets cheaper, the relative importance of training investment to overall AI economics decreases. A model trained for $5M but served efficiently at $0.10/1M tokens can be highly profitable. A model trained for $150M but priced aggressively faces margin pressure.
Where Is the Curve Going?
The consensus view among AI economists and infrastructure researchers is that the cost curve has not flattened. Drivers of continued decline include:
- NVIDIA Blackwell and Rubin GPU generations (2025–2027) delivering 2–4× further hardware efficiency
- Edge inference on NPUs (mobile, PC) that shifts cost from cloud to user hardware
- Further architectural improvements in model efficiency (post-MoE architectures)
- Specialised inference chips from Google (TPUs), AWS (Trainium/Inferentia), and Meta (MTIA)
- Increased global GPU supply as TSMC and Samsung expand capacity
The practical implication: if your product relies on AI inference being expensive to protect margins, reconsider. Build on the assumption that by 2027, costs will be another 10× lower than today.
Checklist: Do You Understand This?
- AI inference costs have fallen over 99% from 2023 to 2026, driven by hardware, architecture, software, and competition
- MoE architectures (DeepSeek, Mixtral) deliver large model capability at small model compute cost
- vLLM and similar inference servers dramatically improve GPU utilisation, cutting effective per-token cost
- Training is a one-time fixed cost; inference is the recurring margin question
- Product differentiation must come from data, workflow, and UX — not the model itself — as commodity inference becomes the norm
- The curve is expected to continue: plan for another 10× cost reduction by 2027