Cost Optimization Patterns
Beyond choosing the right model tier, several implementation patterns can dramatically reduce AI costs on existing workloads. Applied together, these patterns routinely achieve 50β90% cost reduction without sacrificing output quality. The key is applying them systematically rather than hoping model prices drop further.
Pattern 1 β Prompt Caching
Savings potential: 40β80% on input costs (when your system prompt and context are large and stable)
Prompt caching stores the KV (key-value) cache of your prompt prefix so repeated requests reuse it at 10% of the standard token price. The biggest wins come when:
- You have a large system prompt (1,000+ tokens) that doesn't change between requests
- You prepend the same documents or context to many different user queries
- You maintain long conversation histories in a persistent chat interface
Best candidates for caching:
- RAG system prompt + retrieved context template
- Large tool/function schema definitions
- Company knowledge base or policy documents
- Conversation history in multi-turn chat
- Few-shot examples in the system prompt
Caching won't help when:
- Every request has a unique, large context (no repeating prefix)
- Your system prompt is short (<500 tokens)
- You rarely hit the same prefix twice within the cache TTL window
Rule of thumb: if more than 50% of your average input tokens are from a stable prefix, prompt caching is your highest-leverage optimization.
Pattern 2 β Batch Processing
Savings potential: 50% flat discount on all tokens (works everywhere Anthropic or OpenAI Batch API is available)
Any workload that doesn't need a real-time response should use batch APIs. The trade-off is simple: accept up to 24-hour processing time, get 50% off everything.
Identifying your batch-eligible workloads is the first step:
- Document processing pipelines (contracts, invoices, reports)
- Nightly data enrichment or classification jobs
- Generating embeddings for a new corpus
- Running evaluation tests against a benchmark
- Pre-generating content variations (A/B test copy, email personalization)
Practical note: most pipelines labelled βreal-timeβ actually tolerate minutes or hours. Challenge the assumption that a workload needs sub-second response. Many can be queued and processed asynchronously.
Pattern 3 β Model Routing (Cascade)
Savings potential: 25β60% on mixed workloads by matching request complexity to model tier
Not every request in a system has the same complexity. A routing layer classifies incoming requests and directs them to the cheapest tier that can handle them:
User query or task arrives at the system
Cheap classifier (T3 model or rule-based) scores the request: simple / medium / complex
Simple β T3 Haiku/Flash-Lite. Medium β T2 Sonnet/Flash. Complex β T1 Opus/o3
Each request processed by the cheapest tier capable of handling it
For high-stakes outputs, escalate T3/T2 responses to T1 for verification
Routing tools like LiteLLM (open-source, self-hosted) and OpenRouter (SaaS, ~5% markup) provide unified APIs across providers to make routing implementation simpler. LiteLLM has zero markup but requires self-hosting. OpenRouter is turnkey but adds cost at scale.
Pattern 4 β Output Length Control
Savings potential: 20β50% (output tokens are 4β8Γ more expensive than input)
Output tokens dominate cost. Prompts that encourage verbose answers cost significantly more than prompts engineered for concise, structured outputs:
- Ask for specific formats β βReturn a JSON object with keys X, Y, Zβ produces less output than βdescribe the resultβ
- Use structured outputs β JSON mode forces the model to populate a schema rather than narrate an answer
- Set max_tokens β cap output length at your expected maximum to prevent runaway generation
- Specify length explicitly β βIn 2β3 bullet pointsβ produces a fraction of the tokens βSummarize this documentβ produces
Pattern 5 β Fine-Tuning for Narrow Tasks
Savings potential: allows T3 to match T2/T1 quality on specific tasks at T3 prices
Fine-tuning a small model on examples of your specific task can produce quality that matches or exceeds a much larger model:
- Collect 50β500 examples of (input, ideal output) for your specific task
- Fine-tune GPT-4o-mini, Haiku, or an open-weight model (Llama, Phi-4-mini) on these examples
- The fine-tuned T3 model learns the specific output format, tone, and patterns for your domain
- Use this fine-tuned model instead of a T2 or T1 model for that specific task
Fine-tuning works best for: structured extraction with a fixed schema, consistent tone and style matching, domain-specific classification, and tasks with predictable output patterns. It does not help for open-ended reasoning.
Pattern 6 β Stacking Discounts
Savings potential: up to 95% when caching + batching are combined on eligible workloads
The highest savings come from layering multiple optimization techniques on the same workload. A document processing pipeline optimized end-to-end might look like:
| Optimization applied | Effective cost vs standard |
|---|---|
| No optimization (T2, standard) | 100% |
| Switch to T3 model | ~15β30% (70β85% savings) |
| T3 + prompt caching (stable system prompt) | ~8β15% |
| T3 + prompt caching + Batch API | ~4β8% (92β96% savings) |
| T3 fine-tuned + caching + batch | ~3β5% (95β97% savings) |
Not every workload supports all layers. Real-time interactive workloads can't use batch API. Tasks with constantly varying contexts won't benefit much from caching. Apply each pattern only where it fits.
Pattern 7 β Cost Visibility Before Optimization
Before optimizing, measure. Most teams don't know where their AI spend is going:
- Tag every API call with a feature, user type, or workflow label
- Track token counts and costs per endpoint/workflow, not just in aggregate
- Identify the top 20% of request types that drive 80% of cost
- Optimize those first β not the long tail
- Re-measure after each optimization before applying the next
Tools: OpenAI usage dashboard, Anthropic console usage analytics, LiteLLM's proxy cost tracking, CloudZero or PointFive for multi-provider AI FinOps.
Checklist: Do You Understand This?
- Prompt caching: 90% off repeated input β highest value when system prompt + context is large and stable
- Batch API: 50% off everything β apply to any non-real-time workload (documents, evals, nightly jobs)
- Model routing: classify request complexity, route to cheapest tier that can handle it (25β60% savings)
- Output length control: ask for structured/concise outputs; set max_tokens; output tokens are 4β8Γ pricier
- Fine-tuning: T3 model + task-specific training can match T2/T1 quality on narrow tasks at T3 cost
- Stacking: caching + batch + right-tier model = 90β97% cost reduction on eligible workloads