Understanding Token Costs
Every API call to an LLM is billed by token count. Understanding exactly what gets billed β and how the pricing levers (caching, batching, model choice) interact β is the foundation of AI cost management. The headline price per million tokens is just the starting point.
Input vs Output Pricing
All major providers split pricing into input tokens (what you send) and output tokens (what the model generates back). Output tokens are consistently more expensive β typically 3Γ to 5Γ the input price.
| Model | Input / 1M | Output / 1M | Output multiplier |
|---|---|---|---|
| Gemini 2.5 Flash-Lite | $0.10 | $0.40 | 4Γ |
| DeepSeek V3 | $0.27 | $1.10 | 4Γ |
| Gemini 2.5 Flash | $0.30 | $2.50 | 8Γ |
| GPT-4.1 Mini | $0.40 | $1.60 | 4Γ |
| Claude Haiku 4.5 | $1.00 | $5.00 | 5Γ |
| Gemini 2.5 Pro | $1.25 | $10.00 | 8Γ |
| GPT-4o | $2.50 | $10.00 | 4Γ |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 5Γ |
| GPT-4.1 | $5.00 | $15.00 | 3Γ |
| Claude Opus 4.7 | $5.00 | $25.00 | 5Γ |
| o3 (reasoning) | $2.00 | $8.00 | 4Γ (+ hidden reasoning tokens) |
Prices per 1M tokens as of May 2026. Note: reasoning models (o3, o4-mini) also generate internal βthinking tokensβ before the final answer. These are billed at input rates and can significantly increase total cost on complex tasks.
What Counts as Input
Every token sent in the API request body is billed as input:
- System prompt β billed on every single request, even if identical each time
- Chat history β all prior turns in the conversation are re-sent and re-billed
- Retrieved context (RAG) β document chunks passed in as context
- User message β the actual user input
- Tool definitions β if you define tools/functions, their JSON schemas count as input tokens
A system prompt of 2,000 tokens on a model at $3/1M input costs $0.006 per request β or $6 per 1,000 requests, $6,000 per 1M requests. For high-volume systems, system prompt length is a significant cost driver.
Prompt Caching β 90% Off Repeated Input
All major providers now offer prompt caching: if the same prefix (system prompt, context, conversation history) appeared in a recent request, cached tokens are re-used at a massive discount.
Anthropic (Claude)
- Cache read: 10% of standard input
- Cache write: 1.25Γ input (5-min TTL) or 2.0Γ input (1-hr TTL)
- Requires explicit cache_control breakpoints in the API call
OpenAI
- Cache read: 10% of standard input
- Automatic β no API changes needed
- Activates for prompts with 1,024+ tokens
Google (Gemini)
- Cache read: 10% of standard input
- Explicit context caching API
- Supports very large context windows efficiently
The key insight: a 2,000-token system prompt billed at $3/1M normally costs $0.006 per request. With caching at 10%, it costs $0.0006 per cache hit β a 90% reduction. For high-volume deployments with a stable system prompt, caching alone can cut costs by 40β80%.
Batch API β 50% Off Async Processing
Both Anthropic and OpenAI offer 50% discounts on all tokens when you use their batch APIs for asynchronous, non-real-time workloads:
Use batch when:
- Processing a queue of documents
- Nightly or scheduled data processing jobs
- Generating embeddings or classifications at scale
- Background summarization pipelines
- Evals running against a test set
Batch is NOT suitable when:
- A user is waiting for the response (interactive)
- Your SLA requires sub-second latency
- You need real-time streaming
- Requests depend on each other's outputs
Typical batch processing time: <1 hour (24-hour maximum). Both providers support up to 100,000 requests per batch submission.
Stacking Discounts
Prompt caching and batch discounts apply independently β they can be combined:
| Optimization applied | GPT-4o cost (input/1M) | Savings vs standard |
|---|---|---|
| Standard (no optimization) | $2.50 | β |
| Batch only (50% off) | $1.25 | 50% |
| Cache hits only (90% off input) | $0.25 | 90% |
| Batch + cache hits combined | $0.125 | 95% |
Estimating Costs Before You Build
Before committing to a model for a workload, estimate the cost:
- Measure average input tokens per request (system prompt + context + user message)
- Measure average output tokens per response
- Estimate request volume per day/month
- Calculate: (input_tokens Γ input_price + output_tokens Γ output_price) Γ request_count
- Apply expected cache hit rate (typically 50β80% for stable system prompts)
- Apply batch discount if applicable (50%)
- Compare across 3β4 model tiers to see the range
Tip: run your typical prompts through a tokenizer (OpenAI Tokenizer, Claude's API response token count) to get accurate numbers before making model decisions.
Checklist: Do You Understand This?
- Output tokens are 4β8Γ more expensive than input tokens β outputs drive cost more than inputs
- Input billing includes: system prompt + chat history + retrieved context + tool definitions
- Prompt caching: 90% off repeated input tokens β all major providers support it at 10% of standard rate
- Batch API: 50% off everything for async workloads β use for document processing, evals, nightly jobs
- Caching + batching stack: up to 95% cost reduction on the right workloads