Beginner

Understanding Token Costs

Every API call to an LLM is billed by token count. Understanding exactly what gets billed β€” and how the pricing levers (caching, batching, model choice) interact β€” is the foundation of AI cost management. The headline price per million tokens is just the starting point.

Input vs Output Pricing

All major providers split pricing into input tokens (what you send) and output tokens (what the model generates back). Output tokens are consistently more expensive β€” typically 3Γ— to 5Γ— the input price.

ModelInput / 1MOutput / 1MOutput multiplier
Gemini 2.5 Flash-Lite$0.10$0.404Γ—
DeepSeek V3$0.27$1.104Γ—
Gemini 2.5 Flash$0.30$2.508Γ—
GPT-4.1 Mini$0.40$1.604Γ—
Claude Haiku 4.5$1.00$5.005Γ—
Gemini 2.5 Pro$1.25$10.008Γ—
GPT-4o$2.50$10.004Γ—
Claude Sonnet 4.6$3.00$15.005Γ—
GPT-4.1$5.00$15.003Γ—
Claude Opus 4.7$5.00$25.005Γ—
o3 (reasoning)$2.00$8.004Γ— (+ hidden reasoning tokens)

Prices per 1M tokens as of May 2026. Note: reasoning models (o3, o4-mini) also generate internal β€œthinking tokens” before the final answer. These are billed at input rates and can significantly increase total cost on complex tasks.

What Counts as Input

Every token sent in the API request body is billed as input:

  • System prompt β€” billed on every single request, even if identical each time
  • Chat history β€” all prior turns in the conversation are re-sent and re-billed
  • Retrieved context (RAG) β€” document chunks passed in as context
  • User message β€” the actual user input
  • Tool definitions β€” if you define tools/functions, their JSON schemas count as input tokens

A system prompt of 2,000 tokens on a model at $3/1M input costs $0.006 per request β€” or $6 per 1,000 requests, $6,000 per 1M requests. For high-volume systems, system prompt length is a significant cost driver.

Prompt Caching β€” 90% Off Repeated Input

All major providers now offer prompt caching: if the same prefix (system prompt, context, conversation history) appeared in a recent request, cached tokens are re-used at a massive discount.

Anthropic (Claude)

  • Cache read: 10% of standard input
  • Cache write: 1.25Γ— input (5-min TTL) or 2.0Γ— input (1-hr TTL)
  • Requires explicit cache_control breakpoints in the API call

OpenAI

  • Cache read: 10% of standard input
  • Automatic β€” no API changes needed
  • Activates for prompts with 1,024+ tokens

Google (Gemini)

  • Cache read: 10% of standard input
  • Explicit context caching API
  • Supports very large context windows efficiently

The key insight: a 2,000-token system prompt billed at $3/1M normally costs $0.006 per request. With caching at 10%, it costs $0.0006 per cache hit β€” a 90% reduction. For high-volume deployments with a stable system prompt, caching alone can cut costs by 40–80%.

Batch API β€” 50% Off Async Processing

Both Anthropic and OpenAI offer 50% discounts on all tokens when you use their batch APIs for asynchronous, non-real-time workloads:

Use batch when:

  • Processing a queue of documents
  • Nightly or scheduled data processing jobs
  • Generating embeddings or classifications at scale
  • Background summarization pipelines
  • Evals running against a test set

Batch is NOT suitable when:

  • A user is waiting for the response (interactive)
  • Your SLA requires sub-second latency
  • You need real-time streaming
  • Requests depend on each other's outputs

Typical batch processing time: <1 hour (24-hour maximum). Both providers support up to 100,000 requests per batch submission.

Stacking Discounts

Prompt caching and batch discounts apply independently β€” they can be combined:

Optimization appliedGPT-4o cost (input/1M)Savings vs standard
Standard (no optimization)$2.50β€”
Batch only (50% off)$1.2550%
Cache hits only (90% off input)$0.2590%
Batch + cache hits combined$0.12595%

Estimating Costs Before You Build

Before committing to a model for a workload, estimate the cost:

  1. Measure average input tokens per request (system prompt + context + user message)
  2. Measure average output tokens per response
  3. Estimate request volume per day/month
  4. Calculate: (input_tokens Γ— input_price + output_tokens Γ— output_price) Γ— request_count
  5. Apply expected cache hit rate (typically 50–80% for stable system prompts)
  6. Apply batch discount if applicable (50%)
  7. Compare across 3–4 model tiers to see the range

Tip: run your typical prompts through a tokenizer (OpenAI Tokenizer, Claude's API response token count) to get accurate numbers before making model decisions.

Checklist: Do You Understand This?

  • Output tokens are 4–8Γ— more expensive than input tokens β€” outputs drive cost more than inputs
  • Input billing includes: system prompt + chat history + retrieved context + tool definitions
  • Prompt caching: 90% off repeated input tokens β€” all major providers support it at 10% of standard rate
  • Batch API: 50% off everything for async workloads β€” use for document processing, evals, nightly jobs
  • Caching + batching stack: up to 95% cost reduction on the right workloads

Page built: 01 Jun 2026