Advanced

Token & Cost Drivers for AI Systems

AI cost management starts with understanding the cost equation and what drives each component. Unlike traditional infrastructure where you pay for compute time, LLM APIs charge per token — and the number of tokens is a function of your prompt design, context management strategy, and output verbosity. Cost visibility is a prerequisite for cost control.

The Cost Equation

Request cost = (input_tokens × input_price_per_M) + (output_tokens × output_price_per_M)

Example — Claude Sonnet 4.6 (approximate 2025 pricing):

input_tokens = 5,000 (system prompt 2K + conversation history 1K + user message 2K)

output_tokens = 500

cost = (5,000 × $3/1M) + (500 × $15/1M)

= $0.015 + $0.0075 = $0.0225 per request

At 10,000 requests/day: $225/day = ~$6,750/month

# With prompt caching (90% of system prompt cached):

# cached tokens = 1,800 (system prompt) × $0.30/1M = $0.00054

# uncached input = 3,200 × $3/1M = $0.0096

# output = 500 × $15/1M = $0.0075

# cost = $0.0177 per request (21% saving)

Input Cost Drivers

Driver	Why it inflates cost	Mitigation
Long system prompts	Repeated on every request regardless of user message length; 2K token system prompt × 1M requests = 2B tokens	Use prompt caching; keep system prompts concise; move rare instructions to conditional injection
Full conversation history	Naive history inclusion grows with conversation length; 20-turn conversation = 10K+ tokens of context	Summarise old turns; keep last N turns only; use memory compression
Large RAG context chunks	Over-retrieval — returning 10 chunks when 2 would suffice; large chunk sizes	Tune retrieval to return fewer, more relevant chunks; use reranking to cut chunks before sending to LLM
Image inputs	Images consume 85-1,600 tokens each depending on resolution; high-resolution image = 1,600 tokens	Resize images to minimum required resolution; use low-detail mode when visual detail is not critical
Tool definitions	Function schemas sent with every request in agentic systems; 20 tools × 200 tokens each = 4K tokens per request	Use tool-selection routing: send only the tools relevant to the task type; cache tool definitions

Output Cost Drivers

Driver	Why it inflates cost	Mitigation
No max_tokens limit	Model generates until stop condition; a verbose model can produce 4,000+ tokens for a simple answer	Always set max_tokens based on expected output; log when responses hit the limit
Chain-of-thought in output	Extended thinking or reasoning traces in output can be 5-10× the final answer length	For Claude: thinking tokens are billed but at lower output rate; budget_tokens parameter controls spend; for non-reasoning tasks, do not enable extended thinking
Verbose prompt instructions	"Explain in detail..." or "Provide comprehensive..." instructions produce longer outputs	Scope output explicitly: "Answer in 3 sentences maximum" / "Return JSON only, no explanation"
Large structured output schemas	JSON with many optional fields — model fills all fields even when empty	Request only required fields; use response_format with minimal schema

Model Tier Cost Multipliers (2025)

Frontier (100×)

Claude Opus, GPT-4.5, Gemini Ultra

~$15-75 per 1M output tokens

Flagship (10×)

Claude Sonnet, GPT-4o, Gemini 1.5 Pro

~$3-15 per 1M output tokens

Efficient (1×)

Claude Haiku, GPT-4o-mini, Gemini Flash

~$0.40-1.25 per 1M output tokens

Local (0 marginal)

Llama 3, Mistral, Phi-4 via Ollama/vLLM

Infrastructure cost only; no per-token charge

Exact pricing changes frequently. Always check provider pricing pages before budgeting. The multipliers above are illustrative ratios — the relative cost gap between tiers has been narrowing as efficient models improve.

Hidden Costs

Embedding generation — calls to an embedding model for every document and query in a RAG system; often overlooked in budget calculations
Vector database storage and query costs — Pinecone, Weaviate Cloud, and similar services charge for storage and read units
Observability platform — Langfuse cloud, LangSmith, or custom telemetry infrastructure
Fine-tuning compute — GPU hours for training; often underestimated in initial project budgets
Human review labour — sampling conversations for quality review; annotation for RLHF or evaluation datasets
Guardrails latency tax — additional classifier calls per request add indirect cost through infrastructure

Cost Attribution

By use case

Tag every LLM call with a use_case identifier. Essential for finding which feature is driving cost growth. A single use case consuming > 50% of spend warrants investigation.

By team

Tag with team identifier for chargeback. Makes teams accountable for their AI spend and creates incentives to optimise. Required for enterprise-scale governance.

By user

Track per-user spend to detect abuse and enforce fair-use limits. Useful for B2B SaaS: AI cost is a per-customer variable cost that must be factored into pricing.

Checklist: Do You Understand This?

Calculate the monthly cost of a RAG chatbot with a 3,000-token system prompt, 2,000 average tokens of retrieved context, 500-token user message, and 400-token output, running 50,000 requests per day at Sonnet-tier pricing.
Why does not setting a max_tokens limit create a cost risk — not just a latency risk?
Name three hidden costs that are frequently omitted from AI project budget estimates.
What is the cost multiplier ratio between a frontier model and an efficient model, and what does it imply for routing strategy?
Why should every LLM call be tagged with a use_case identifier before it reaches the API?
How does prompt caching change the cost equation for a system with a large static system prompt?