Intermediate

Cost & Performance Trade-offs

Not every query needs the most powerful model. Using GPT-4o for every request when GPT-4o-mini would suffice wastes 10–20× in cost with no quality gain. Making smart cost-performance decisions requires understanding the latency-throughput-cost triangle, the model tier landscape, and the infrastructure optimisations that reduce per-token cost. This page gives you the framework for these decisions.

The Latency–Throughput–Cost Triangle

Latency and throughput are inversely related in LLM serving — optimising for one degrades the other. Cost scales with both. Understanding this triangle is prerequisite to making good deployment decisions.

Mode	Optimises for	Trade-off	Best for
Low latency	Fast TTFT, fast response for each request	Lower throughput, higher cost/token	Real-time chat, voice AI, interactive agents
High throughput	Maximum tokens/second across all requests	Higher latency per request (batching delay)	Batch processing, nightly jobs, data enrichment
Cost optimised	Minimum $ per output token	Often slower model, higher latency	High-volume classification, extraction, summarisation

Model Tiers in 2025

Every major provider now offers a model family spanning capability tiers. The cost difference between tiers is typically 10–30×. Matching tier to task is the single highest-impact cost optimisation.

Tier	Examples	Relative cost	Best for
Frontier / Reasoning	o3, Claude Opus, Gemini Ultra	100×	Complex reasoning, code generation, novel research
Flagship	GPT-4o, Claude Sonnet 4.5/4.6, Gemini 2.5 Pro	10–20×	Production chat, complex RAG, agent orchestration
Efficient	GPT-4o-mini, Claude Haiku 4.5, Gemini Flash	1–3×	High-volume classification, extraction, simple Q&A
Local / Open	Llama 3.1 8B, Mistral 7B, Phi-3	Fixed hardware	Privacy-sensitive, offline, high-volume at scale

Intelligent Model Routing

Rather than using one model for everything, route each request to the cheapest model that can handle it. A classifier (fast and cheap) decides the tier; the chosen model processes the request. This is the most impactful architecture pattern for cost reduction in high-volume systems.

Routing pattern:

Fast classifier (Haiku / GPT-4o-mini, <100ms) evaluates query complexity
Simple queries (FAQ, extraction, classification) → efficient model
Medium queries (multi-step reasoning, nuanced writing) → flagship model
Complex queries (novel research, complex code, multi-document synthesis) → frontier model

Routing classifiers themselves must be fast and cheap — their cost and latency must not exceed the savings from routing. Target: <5% of total request cost.

Routing validation:

Measure quality on each tier separately — routing to a cheaper model is only correct if quality meets the bar. A/B test the routing logic against all-frontier as baseline before enabling in production.

Key Cost & Performance Metrics

Performance metrics

TTFT: time-to-first-token — user-perceived responsiveness
TBT: time between tokens — determines streaming smoothness
Tokens/sec: generation throughput per model instance
P95/P99 latency: tail latency — what slow users experience
Context processing speed: how fast the model reads input tokens

Cost metrics

Cost per task: USD to complete one end-user action (your unit economics metric)
Input vs output token ratio: output tokens are 3–5× more expensive than input on most APIs
Cost per hour: monitor for runaway spending; alert on anomalies
Cache hit rate: prompt caching can reduce cost 50–90% for repeated prefixes

Infrastructure Optimisations

Prompt caching

Cache the static prefix of your system prompt. Anthropic and OpenAI both offer prompt caching — repeated system prompt tokens are charged at 10–20% of the normal input price. For applications with long system prompts, this alone can reduce cost by 50–80%.

Quantisation (local inference)

4-bit quantisation reduces GPU memory requirement by ~4× and increases throughput proportionally with minimal quality loss. At INT4, a 7B model fits in 4–6 GB VRAM; a 13B in 8 GB. Use GGUF (llama.cpp) or GPTQ/AWQ for quantised models.

Batching (high-throughput workloads)

Batch multiple requests through the model simultaneously. Continuous batching (vLLM, TGI) maximises GPU utilisation — instead of processing requests sequentially, the model handles many in parallel. Trade-off: adds latency per request (batching delay). Not appropriate for real-time chat.

Speculative decoding

A small draft model generates candidate tokens; the large model validates them in parallel. Delivers 2–4× speedup with identical output quality. Requires a paired draft model; supported natively in vLLM and TGI. Best for latency-sensitive applications using large models.

Output length control

Output tokens dominate both cost and latency. Constrain max_tokens tightly. Use structured output (JSON) instead of prose where applicable — structured output is typically shorter. Instruct the model explicitly: "Respond in 3 sentences or fewer."

Running a Cost Analysis

Cost per task calculation:

Measure: average input tokens per task, average output tokens per task
Apply provider pricing: (input_tokens × input_price) + (output_tokens × output_price)
Add retrieval cost if RAG: embedding API calls + vector DB query cost
Add tool call cost if agents: each LLM call in the agent loop is charged separately
Annualise at your expected volume: cost_per_task × tasks_per_day × 365
Compare to local inference TCO: hardware amortisation + energy + ops overhead

Use artificialanalysis.ai for up-to-date price/performance benchmarks across providers and models.

Checklist: Do You Understand This?

What is the latency-throughput trade-off in LLM serving, and what deployment mode suits interactive chat vs batch processing?
What is the rough cost multiplier between the efficient tier (e.g., Haiku) and the frontier tier (e.g., o3)?
How does intelligent model routing work, and what constraint governs the routing classifier's cost?
Why do output tokens cost more than input tokens, and what is the practical implication for system design?
How does prompt caching work, and what is the typical cost reduction for applications with long system prompts?
What is speculative decoding and what speedup does it deliver?