🧠 All Things AI
Intermediate

Cost & Performance Trade-offs

Not every query needs the most powerful model. Using GPT-4o for every request when GPT-4o-mini would suffice wastes 10–20Γ— in cost with no quality gain. Making smart cost-performance decisions requires understanding the latency-throughput-cost triangle, the model tier landscape, and the infrastructure optimisations that reduce per-token cost. This page gives you the framework for these decisions.

The Latency–Throughput–Cost Triangle

Latency and throughput are inversely related in LLM serving β€” optimising for one degrades the other. Cost scales with both. Understanding this triangle is prerequisite to making good deployment decisions.

ModeOptimises forTrade-offBest for
Low latencyFast TTFT, fast response for each requestLower throughput, higher cost/tokenReal-time chat, voice AI, interactive agents
High throughputMaximum tokens/second across all requestsHigher latency per request (batching delay)Batch processing, nightly jobs, data enrichment
Cost optimisedMinimum $ per output tokenOften slower model, higher latencyHigh-volume classification, extraction, summarisation

Model Tiers in 2025

Every major provider now offers a model family spanning capability tiers. The cost difference between tiers is typically 10–30Γ—. Matching tier to task is the single highest-impact cost optimisation.

TierExamplesRelative costBest for
Frontier / Reasoningo3, Claude Opus, Gemini Ultra100Γ—Complex reasoning, code generation, novel research
FlagshipGPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro10–20Γ—Production chat, complex RAG, agent orchestration
EfficientGPT-4o-mini, Claude Haiku 4.5, Gemini Flash1–3Γ—High-volume classification, extraction, simple Q&A
Local / OpenLlama 3.1 8B, Mistral 7B, Phi-3Fixed hardwarePrivacy-sensitive, offline, high-volume at scale

Intelligent Model Routing

Rather than using one model for everything, route each request to the cheapest model that can handle it. A classifier (fast and cheap) decides the tier; the chosen model processes the request. This is the most impactful architecture pattern for cost reduction in high-volume systems.

Routing pattern:

  1. Fast classifier (Haiku / GPT-4o-mini, <100ms) evaluates query complexity
  2. Simple queries (FAQ, extraction, classification) β†’ efficient model
  3. Medium queries (multi-step reasoning, nuanced writing) β†’ flagship model
  4. Complex queries (novel research, complex code, multi-document synthesis) β†’ frontier model

Routing classifiers themselves must be fast and cheap β€” their cost and latency must not exceed the savings from routing. Target: <5% of total request cost.

Routing validation:

Measure quality on each tier separately β€” routing to a cheaper model is only correct if quality meets the bar. A/B test the routing logic against all-frontier as baseline before enabling in production.

Key Cost & Performance Metrics

Performance metrics

  • TTFT: time-to-first-token β€” user-perceived responsiveness
  • TBT: time between tokens β€” determines streaming smoothness
  • Tokens/sec: generation throughput per model instance
  • P95/P99 latency: tail latency β€” what slow users experience
  • Context processing speed: how fast the model reads input tokens

Cost metrics

  • Cost per task: USD to complete one end-user action (your unit economics metric)
  • Input vs output token ratio: output tokens are 3–5Γ— more expensive than input on most APIs
  • Cost per hour: monitor for runaway spending; alert on anomalies
  • Cache hit rate: prompt caching can reduce cost 50–90% for repeated prefixes

Infrastructure Optimisations

Prompt caching

Cache the static prefix of your system prompt. Anthropic and OpenAI both offer prompt caching β€” repeated system prompt tokens are charged at 10–20% of the normal input price. For applications with long system prompts, this alone can reduce cost by 50–80%.

Quantisation (local inference)

4-bit quantisation reduces GPU memory requirement by ~4Γ— and increases throughput proportionally with minimal quality loss. At INT4, a 7B model fits in 4–6 GB VRAM; a 13B in 8 GB. Use GGUF (llama.cpp) or GPTQ/AWQ for quantised models.

Batching (high-throughput workloads)

Batch multiple requests through the model simultaneously. Continuous batching (vLLM, TGI) maximises GPU utilisation β€” instead of processing requests sequentially, the model handles many in parallel. Trade-off: adds latency per request (batching delay). Not appropriate for real-time chat.

Speculative decoding

A small draft model generates candidate tokens; the large model validates them in parallel. Delivers 2–4Γ— speedup with identical output quality. Requires a paired draft model; supported natively in vLLM and TGI. Best for latency-sensitive applications using large models.

Output length control

Output tokens dominate both cost and latency. Constrain max_tokens tightly. Use structured output (JSON) instead of prose where applicable β€” structured output is typically shorter. Instruct the model explicitly: "Respond in 3 sentences or fewer."

Running a Cost Analysis

Cost per task calculation:

  • Measure: average input tokens per task, average output tokens per task
  • Apply provider pricing: (input_tokens Γ— input_price) + (output_tokens Γ— output_price)
  • Add retrieval cost if RAG: embedding API calls + vector DB query cost
  • Add tool call cost if agents: each LLM call in the agent loop is charged separately
  • Annualise at your expected volume: cost_per_task Γ— tasks_per_day Γ— 365
  • Compare to local inference TCO: hardware amortisation + energy + ops overhead

Use artificialanalysis.ai for up-to-date price/performance benchmarks across providers and models.

Checklist: Do You Understand This?

  • What is the latency-throughput trade-off in LLM serving, and what deployment mode suits interactive chat vs batch processing?
  • What is the rough cost multiplier between the efficient tier (e.g., Haiku) and the frontier tier (e.g., o3)?
  • How does intelligent model routing work, and what constraint governs the routing classifier's cost?
  • Why do output tokens cost more than input tokens, and what is the practical implication for system design?
  • How does prompt caching work, and what is the typical cost reduction for applications with long system prompts?
  • What is speculative decoding and what speedup does it deliver?