Cost & Performance Trade-offs
Not every query needs the most powerful model. Using GPT-4o for every request when GPT-4o-mini would suffice wastes 10β20Γ in cost with no quality gain. Making smart cost-performance decisions requires understanding the latency-throughput-cost triangle, the model tier landscape, and the infrastructure optimisations that reduce per-token cost. This page gives you the framework for these decisions.
The LatencyβThroughputβCost Triangle
Latency and throughput are inversely related in LLM serving β optimising for one degrades the other. Cost scales with both. Understanding this triangle is prerequisite to making good deployment decisions.
| Mode | Optimises for | Trade-off | Best for |
|---|---|---|---|
| Low latency | Fast TTFT, fast response for each request | Lower throughput, higher cost/token | Real-time chat, voice AI, interactive agents |
| High throughput | Maximum tokens/second across all requests | Higher latency per request (batching delay) | Batch processing, nightly jobs, data enrichment |
| Cost optimised | Minimum $ per output token | Often slower model, higher latency | High-volume classification, extraction, summarisation |
Model Tiers in 2025
Every major provider now offers a model family spanning capability tiers. The cost difference between tiers is typically 10β30Γ. Matching tier to task is the single highest-impact cost optimisation.
| Tier | Examples | Relative cost | Best for |
|---|---|---|---|
| Frontier / Reasoning | o3, Claude Opus, Gemini Ultra | 100Γ | Complex reasoning, code generation, novel research |
| Flagship | GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro | 10β20Γ | Production chat, complex RAG, agent orchestration |
| Efficient | GPT-4o-mini, Claude Haiku 4.5, Gemini Flash | 1β3Γ | High-volume classification, extraction, simple Q&A |
| Local / Open | Llama 3.1 8B, Mistral 7B, Phi-3 | Fixed hardware | Privacy-sensitive, offline, high-volume at scale |
Intelligent Model Routing
Rather than using one model for everything, route each request to the cheapest model that can handle it. A classifier (fast and cheap) decides the tier; the chosen model processes the request. This is the most impactful architecture pattern for cost reduction in high-volume systems.
Routing pattern:
- Fast classifier (Haiku / GPT-4o-mini, <100ms) evaluates query complexity
- Simple queries (FAQ, extraction, classification) β efficient model
- Medium queries (multi-step reasoning, nuanced writing) β flagship model
- Complex queries (novel research, complex code, multi-document synthesis) β frontier model
Routing classifiers themselves must be fast and cheap β their cost and latency must not exceed the savings from routing. Target: <5% of total request cost.
Routing validation:
Measure quality on each tier separately β routing to a cheaper model is only correct if quality meets the bar. A/B test the routing logic against all-frontier as baseline before enabling in production.
Key Cost & Performance Metrics
Performance metrics
- TTFT: time-to-first-token β user-perceived responsiveness
- TBT: time between tokens β determines streaming smoothness
- Tokens/sec: generation throughput per model instance
- P95/P99 latency: tail latency β what slow users experience
- Context processing speed: how fast the model reads input tokens
Cost metrics
- Cost per task: USD to complete one end-user action (your unit economics metric)
- Input vs output token ratio: output tokens are 3β5Γ more expensive than input on most APIs
- Cost per hour: monitor for runaway spending; alert on anomalies
- Cache hit rate: prompt caching can reduce cost 50β90% for repeated prefixes
Infrastructure Optimisations
Prompt caching
Cache the static prefix of your system prompt. Anthropic and OpenAI both offer prompt caching β repeated system prompt tokens are charged at 10β20% of the normal input price. For applications with long system prompts, this alone can reduce cost by 50β80%.
Quantisation (local inference)
4-bit quantisation reduces GPU memory requirement by ~4Γ and increases throughput proportionally with minimal quality loss. At INT4, a 7B model fits in 4β6 GB VRAM; a 13B in 8 GB. Use GGUF (llama.cpp) or GPTQ/AWQ for quantised models.
Batching (high-throughput workloads)
Batch multiple requests through the model simultaneously. Continuous batching (vLLM, TGI) maximises GPU utilisation β instead of processing requests sequentially, the model handles many in parallel. Trade-off: adds latency per request (batching delay). Not appropriate for real-time chat.
Speculative decoding
A small draft model generates candidate tokens; the large model validates them in parallel. Delivers 2β4Γ speedup with identical output quality. Requires a paired draft model; supported natively in vLLM and TGI. Best for latency-sensitive applications using large models.
Output length control
Output tokens dominate both cost and latency. Constrain max_tokens tightly. Use structured output (JSON) instead of prose where applicable β structured output is typically shorter. Instruct the model explicitly: "Respond in 3 sentences or fewer."
Running a Cost Analysis
Cost per task calculation:
- Measure: average input tokens per task, average output tokens per task
- Apply provider pricing:
(input_tokens Γ input_price) + (output_tokens Γ output_price) - Add retrieval cost if RAG: embedding API calls + vector DB query cost
- Add tool call cost if agents: each LLM call in the agent loop is charged separately
- Annualise at your expected volume:
cost_per_task Γ tasks_per_day Γ 365 - Compare to local inference TCO: hardware amortisation + energy + ops overhead
Use artificialanalysis.ai for up-to-date price/performance benchmarks across providers and models.
Checklist: Do You Understand This?
- What is the latency-throughput trade-off in LLM serving, and what deployment mode suits interactive chat vs batch processing?
- What is the rough cost multiplier between the efficient tier (e.g., Haiku) and the frontier tier (e.g., o3)?
- How does intelligent model routing work, and what constraint governs the routing classifier's cost?
- Why do output tokens cost more than input tokens, and what is the practical implication for system design?
- How does prompt caching work, and what is the typical cost reduction for applications with long system prompts?
- What is speculative decoding and what speedup does it deliver?