Token & Cost Drivers for AI Systems
AI cost management starts with understanding the cost equation and what drives each component. Unlike traditional infrastructure where you pay for compute time, LLM APIs charge per token — and the number of tokens is a function of your prompt design, context management strategy, and output verbosity. Cost visibility is a prerequisite for cost control.
The Cost Equation
Request cost = (input_tokens × input_price_per_M) + (output_tokens × output_price_per_M)
Example — Claude Sonnet 4.6 (approximate 2025 pricing):
input_tokens = 5,000 (system prompt 2K + conversation history 1K + user message 2K)
output_tokens = 500
cost = (5,000 × $3/1M) + (500 × $15/1M)
= $0.015 + $0.0075 = $0.0225 per request
At 10,000 requests/day: $225/day = ~$6,750/month
# With prompt caching (90% of system prompt cached):
# cached tokens = 1,800 (system prompt) × $0.30/1M = $0.00054
# uncached input = 3,200 × $3/1M = $0.0096
# output = 500 × $15/1M = $0.0075
# cost = $0.0177 per request (21% saving)
Input Cost Drivers
| Driver | Why it inflates cost | Mitigation |
|---|---|---|
| Long system prompts | Repeated on every request regardless of user message length; 2K token system prompt × 1M requests = 2B tokens | Use prompt caching; keep system prompts concise; move rare instructions to conditional injection |
| Full conversation history | Naive history inclusion grows with conversation length; 20-turn conversation = 10K+ tokens of context | Summarise old turns; keep last N turns only; use memory compression |
| Large RAG context chunks | Over-retrieval — returning 10 chunks when 2 would suffice; large chunk sizes | Tune retrieval to return fewer, more relevant chunks; use reranking to cut chunks before sending to LLM |
| Image inputs | Images consume 85-1,600 tokens each depending on resolution; high-resolution image = 1,600 tokens | Resize images to minimum required resolution; use low-detail mode when visual detail is not critical |
| Tool definitions | Function schemas sent with every request in agentic systems; 20 tools × 200 tokens each = 4K tokens per request | Use tool-selection routing: send only the tools relevant to the task type; cache tool definitions |
Output Cost Drivers
| Driver | Why it inflates cost | Mitigation |
|---|---|---|
| No max_tokens limit | Model generates until stop condition; a verbose model can produce 4,000+ tokens for a simple answer | Always set max_tokens based on expected output; log when responses hit the limit |
| Chain-of-thought in output | Extended thinking or reasoning traces in output can be 5-10× the final answer length | For Claude: thinking tokens are billed but at lower output rate; budget_tokens parameter controls spend; for non-reasoning tasks, do not enable extended thinking |
| Verbose prompt instructions | "Explain in detail..." or "Provide comprehensive..." instructions produce longer outputs | Scope output explicitly: "Answer in 3 sentences maximum" / "Return JSON only, no explanation" |
| Large structured output schemas | JSON with many optional fields — model fills all fields even when empty | Request only required fields; use response_format with minimal schema |
Model Tier Cost Multipliers (2025)
Frontier (100×)
Claude Opus, GPT-4.5, Gemini Ultra
~$15-75 per 1M output tokens
Flagship (10×)
Claude Sonnet, GPT-4o, Gemini 1.5 Pro
~$3-15 per 1M output tokens
Efficient (1×)
Claude Haiku, GPT-4o-mini, Gemini Flash
~$0.40-1.25 per 1M output tokens
Local (0 marginal)
Llama 3, Mistral, Phi-4 via Ollama/vLLM
Infrastructure cost only; no per-token charge
Exact pricing changes frequently. Always check provider pricing pages before budgeting. The multipliers above are illustrative ratios — the relative cost gap between tiers has been narrowing as efficient models improve.
Hidden Costs
- Embedding generation — calls to an embedding model for every document and query in a RAG system; often overlooked in budget calculations
- Vector database storage and query costs — Pinecone, Weaviate Cloud, and similar services charge for storage and read units
- Observability platform — Langfuse cloud, LangSmith, or custom telemetry infrastructure
- Fine-tuning compute — GPU hours for training; often underestimated in initial project budgets
- Human review labour — sampling conversations for quality review; annotation for RLHF or evaluation datasets
- Guardrails latency tax — additional classifier calls per request add indirect cost through infrastructure
Cost Attribution
By use case
Tag every LLM call with a use_case identifier. Essential for finding which feature is driving cost growth. A single use case consuming > 50% of spend warrants investigation.
By team
Tag with team identifier for chargeback. Makes teams accountable for their AI spend and creates incentives to optimise. Required for enterprise-scale governance.
By user
Track per-user spend to detect abuse and enforce fair-use limits. Useful for B2B SaaS: AI cost is a per-customer variable cost that must be factored into pricing.
Checklist: Do You Understand This?
- Calculate the monthly cost of a RAG chatbot with a 3,000-token system prompt, 2,000 average tokens of retrieved context, 500-token user message, and 400-token output, running 50,000 requests per day at Sonnet-tier pricing.
- Why does not setting a max_tokens limit create a cost risk — not just a latency risk?
- Name three hidden costs that are frequently omitted from AI project budget estimates.
- What is the cost multiplier ratio between a frontier model and an efficient model, and what does it imply for routing strategy?
- Why should every LLM call be tagged with a use_case identifier before it reaches the API?
- How does prompt caching change the cost equation for a system with a large static system prompt?