🧠 All Things AI
Advanced

Token & Cost Drivers for AI Systems

AI cost management starts with understanding the cost equation and what drives each component. Unlike traditional infrastructure where you pay for compute time, LLM APIs charge per token — and the number of tokens is a function of your prompt design, context management strategy, and output verbosity. Cost visibility is a prerequisite for cost control.

The Cost Equation

Request cost = (input_tokens × input_price_per_M) + (output_tokens × output_price_per_M)

Example — Claude Sonnet 4.6 (approximate 2025 pricing):

input_tokens = 5,000 (system prompt 2K + conversation history 1K + user message 2K)

output_tokens = 500

cost = (5,000 × $3/1M) + (500 × $15/1M)

= $0.015 + $0.0075 = $0.0225 per request

At 10,000 requests/day: $225/day = ~$6,750/month

# With prompt caching (90% of system prompt cached):

# cached tokens = 1,800 (system prompt) × $0.30/1M = $0.00054

# uncached input = 3,200 × $3/1M = $0.0096

# output = 500 × $15/1M = $0.0075

# cost = $0.0177 per request (21% saving)

Input Cost Drivers

DriverWhy it inflates costMitigation
Long system promptsRepeated on every request regardless of user message length; 2K token system prompt × 1M requests = 2B tokensUse prompt caching; keep system prompts concise; move rare instructions to conditional injection
Full conversation historyNaive history inclusion grows with conversation length; 20-turn conversation = 10K+ tokens of contextSummarise old turns; keep last N turns only; use memory compression
Large RAG context chunksOver-retrieval — returning 10 chunks when 2 would suffice; large chunk sizesTune retrieval to return fewer, more relevant chunks; use reranking to cut chunks before sending to LLM
Image inputsImages consume 85-1,600 tokens each depending on resolution; high-resolution image = 1,600 tokensResize images to minimum required resolution; use low-detail mode when visual detail is not critical
Tool definitionsFunction schemas sent with every request in agentic systems; 20 tools × 200 tokens each = 4K tokens per requestUse tool-selection routing: send only the tools relevant to the task type; cache tool definitions

Output Cost Drivers

DriverWhy it inflates costMitigation
No max_tokens limitModel generates until stop condition; a verbose model can produce 4,000+ tokens for a simple answerAlways set max_tokens based on expected output; log when responses hit the limit
Chain-of-thought in outputExtended thinking or reasoning traces in output can be 5-10× the final answer lengthFor Claude: thinking tokens are billed but at lower output rate; budget_tokens parameter controls spend; for non-reasoning tasks, do not enable extended thinking
Verbose prompt instructions"Explain in detail..." or "Provide comprehensive..." instructions produce longer outputsScope output explicitly: "Answer in 3 sentences maximum" / "Return JSON only, no explanation"
Large structured output schemasJSON with many optional fields — model fills all fields even when emptyRequest only required fields; use response_format with minimal schema

Model Tier Cost Multipliers (2025)

Frontier (100×)

Claude Opus, GPT-4.5, Gemini Ultra

~$15-75 per 1M output tokens

Flagship (10×)

Claude Sonnet, GPT-4o, Gemini 1.5 Pro

~$3-15 per 1M output tokens

Efficient (1×)

Claude Haiku, GPT-4o-mini, Gemini Flash

~$0.40-1.25 per 1M output tokens

Local (0 marginal)

Llama 3, Mistral, Phi-4 via Ollama/vLLM

Infrastructure cost only; no per-token charge

Exact pricing changes frequently. Always check provider pricing pages before budgeting. The multipliers above are illustrative ratios — the relative cost gap between tiers has been narrowing as efficient models improve.

Hidden Costs

  • Embedding generation — calls to an embedding model for every document and query in a RAG system; often overlooked in budget calculations
  • Vector database storage and query costs — Pinecone, Weaviate Cloud, and similar services charge for storage and read units
  • Observability platform — Langfuse cloud, LangSmith, or custom telemetry infrastructure
  • Fine-tuning compute — GPU hours for training; often underestimated in initial project budgets
  • Human review labour — sampling conversations for quality review; annotation for RLHF or evaluation datasets
  • Guardrails latency tax — additional classifier calls per request add indirect cost through infrastructure

Cost Attribution

By use case

Tag every LLM call with a use_case identifier. Essential for finding which feature is driving cost growth. A single use case consuming > 50% of spend warrants investigation.

By team

Tag with team identifier for chargeback. Makes teams accountable for their AI spend and creates incentives to optimise. Required for enterprise-scale governance.

By user

Track per-user spend to detect abuse and enforce fair-use limits. Useful for B2B SaaS: AI cost is a per-customer variable cost that must be factored into pricing.

Checklist: Do You Understand This?

  • Calculate the monthly cost of a RAG chatbot with a 3,000-token system prompt, 2,000 average tokens of retrieved context, 500-token user message, and 400-token output, running 50,000 requests per day at Sonnet-tier pricing.
  • Why does not setting a max_tokens limit create a cost risk — not just a latency risk?
  • Name three hidden costs that are frequently omitted from AI project budget estimates.
  • What is the cost multiplier ratio between a frontier model and an efficient model, and what does it imply for routing strategy?
  • Why should every LLM call be tagged with a use_case identifier before it reaches the API?
  • How does prompt caching change the cost equation for a system with a large static system prompt?