Beginner

Understanding Token Costs

Every API call to an LLM is billed by token count. Understanding exactly what gets billed — and how the pricing levers (caching, batching, model choice) interact — is the foundation of AI cost management. The headline price per million tokens is just the starting point.

Input vs Output Pricing

All major providers split pricing into input tokens (what you send) and output tokens (what the model generates back). Output tokens are consistently more expensive — typically 3× to 5× the input price.

Model	Input / 1M	Output / 1M	Output multiplier
Gemini 2.5 Flash-Lite	$0.10	$0.40	4×
DeepSeek V3	$0.27	$1.10	4×
Gemini 2.5 Flash	$0.30	$2.50	8×
GPT-4.1 Mini	$0.40	$1.60	4×
Claude Haiku 4.5	$1.00	$5.00	5×
Gemini 2.5 Pro	$1.25	$10.00	8×
GPT-4o	$2.50	$10.00	4×
Claude Sonnet 4.6	$3.00	$15.00	5×
GPT-4.1	$5.00	$15.00	3×
Claude Opus 4.7	$5.00	$25.00	5×
o3 (reasoning)	$2.00	$8.00	4× (+ hidden reasoning tokens)

Prices per 1M tokens as of May 2026. Note: reasoning models (o3, o4-mini) also generate internal “thinking tokens” before the final answer. These are billed at input rates and can significantly increase total cost on complex tasks.

What Counts as Input

Every token sent in the API request body is billed as input:

System prompt — billed on every single request, even if identical each time
Chat history — all prior turns in the conversation are re-sent and re-billed
Retrieved context (RAG) — document chunks passed in as context
User message — the actual user input
Tool definitions — if you define tools/functions, their JSON schemas count as input tokens

A system prompt of 2,000 tokens on a model at $3/1M input costs $0.006 per request — or $6 per 1,000 requests, $6,000 per 1M requests. For high-volume systems, system prompt length is a significant cost driver.

Prompt Caching — 90% Off Repeated Input

All major providers now offer prompt caching: if the same prefix (system prompt, context, conversation history) appeared in a recent request, cached tokens are re-used at a massive discount.

Anthropic (Claude)

Cache read: 10% of standard input
Cache write: 1.25× input (5-min TTL) or 2.0× input (1-hr TTL)
Requires explicit cache_control breakpoints in the API call

OpenAI

Cache read: 10% of standard input
Automatic — no API changes needed
Activates for prompts with 1,024+ tokens

Google (Gemini)

Cache read: 10% of standard input
Explicit context caching API
Supports very large context windows efficiently

The key insight: a 2,000-token system prompt billed at $3/1M normally costs $0.006 per request. With caching at 10%, it costs $0.0006 per cache hit — a 90% reduction. For high-volume deployments with a stable system prompt, caching alone can cut costs by 40–80%.

Batch API — 50% Off Async Processing

Both Anthropic and OpenAI offer 50% discounts on all tokens when you use their batch APIs for asynchronous, non-real-time workloads:

Use batch when:

Processing a queue of documents
Nightly or scheduled data processing jobs
Generating embeddings or classifications at scale
Background summarization pipelines
Evals running against a test set

Batch is NOT suitable when:

A user is waiting for the response (interactive)
Your SLA requires sub-second latency
You need real-time streaming
Requests depend on each other's outputs

Typical batch processing time: <1 hour (24-hour maximum). Both providers support up to 100,000 requests per batch submission.

Stacking Discounts

Prompt caching and batch discounts apply independently — they can be combined:

Optimization applied	GPT-4o cost (input/1M)	Savings vs standard
Standard (no optimization)	$2.50	—
Batch only (50% off)	$1.25	50%
Cache hits only (90% off input)	$0.25	90%
Batch + cache hits combined	$0.125	95%

Estimating Costs Before You Build

Before committing to a model for a workload, estimate the cost:

Measure average input tokens per request (system prompt + context + user message)
Measure average output tokens per response
Estimate request volume per day/month
Calculate: (input_tokens × input_price + output_tokens × output_price) × request_count
Apply expected cache hit rate (typically 50–80% for stable system prompts)
Apply batch discount if applicable (50%)
Compare across 3–4 model tiers to see the range

Tip: run your typical prompts through a tokenizer (OpenAI Tokenizer, Claude's API response token count) to get accurate numbers before making model decisions.

Checklist: Do You Understand This?

Output tokens are 4–8× more expensive than input tokens — outputs drive cost more than inputs
Input billing includes: system prompt + chat history + retrieved context + tool definitions
Prompt caching: 90% off repeated input tokens — all major providers support it at 10% of standard rate
Batch API: 50% off everything for async workloads — use for document processing, evals, nightly jobs
Caching + batching stack: up to 95% cost reduction on the right workloads