Intermediate

Cost Optimization Patterns

Beyond choosing the right model tier, several implementation patterns can dramatically reduce AI costs on existing workloads. Applied together, these patterns routinely achieve 50–90% cost reduction without sacrificing output quality. The key is applying them systematically rather than hoping model prices drop further.

Pattern 1 β€” Prompt Caching

Savings potential: 40–80% on input costs (when your system prompt and context are large and stable)

Prompt caching stores the KV (key-value) cache of your prompt prefix so repeated requests reuse it at 10% of the standard token price. The biggest wins come when:

  • You have a large system prompt (1,000+ tokens) that doesn't change between requests
  • You prepend the same documents or context to many different user queries
  • You maintain long conversation histories in a persistent chat interface

Best candidates for caching:

  • RAG system prompt + retrieved context template
  • Large tool/function schema definitions
  • Company knowledge base or policy documents
  • Conversation history in multi-turn chat
  • Few-shot examples in the system prompt

Caching won't help when:

  • Every request has a unique, large context (no repeating prefix)
  • Your system prompt is short (<500 tokens)
  • You rarely hit the same prefix twice within the cache TTL window

Rule of thumb: if more than 50% of your average input tokens are from a stable prefix, prompt caching is your highest-leverage optimization.

Pattern 2 β€” Batch Processing

Savings potential: 50% flat discount on all tokens (works everywhere Anthropic or OpenAI Batch API is available)

Any workload that doesn't need a real-time response should use batch APIs. The trade-off is simple: accept up to 24-hour processing time, get 50% off everything.

Identifying your batch-eligible workloads is the first step:

  • Document processing pipelines (contracts, invoices, reports)
  • Nightly data enrichment or classification jobs
  • Generating embeddings for a new corpus
  • Running evaluation tests against a benchmark
  • Pre-generating content variations (A/B test copy, email personalization)

Practical note: most pipelines labelled β€œreal-time” actually tolerate minutes or hours. Challenge the assumption that a workload needs sub-second response. Many can be queued and processed asynchronously.

Pattern 3 β€” Model Routing (Cascade)

Savings potential: 25–60% on mixed workloads by matching request complexity to model tier

Not every request in a system has the same complexity. A routing layer classifies incoming requests and directs them to the cheapest tier that can handle them:

1
Incoming request

User query or task arrives at the system

2
Classify complexity

Cheap classifier (T3 model or rule-based) scores the request: simple / medium / complex

3
Route to tier

Simple β†’ T3 Haiku/Flash-Lite. Medium β†’ T2 Sonnet/Flash. Complex β†’ T1 Opus/o3

4
Generate response

Each request processed by the cheapest tier capable of handling it

5
Quality check (optional)

For high-stakes outputs, escalate T3/T2 responses to T1 for verification

Routing tools like LiteLLM (open-source, self-hosted) and OpenRouter (SaaS, ~5% markup) provide unified APIs across providers to make routing implementation simpler. LiteLLM has zero markup but requires self-hosting. OpenRouter is turnkey but adds cost at scale.

Pattern 4 β€” Output Length Control

Savings potential: 20–50% (output tokens are 4–8Γ— more expensive than input)

Output tokens dominate cost. Prompts that encourage verbose answers cost significantly more than prompts engineered for concise, structured outputs:

  • Ask for specific formats β€” β€œReturn a JSON object with keys X, Y, Z” produces less output than β€œdescribe the result”
  • Use structured outputs β€” JSON mode forces the model to populate a schema rather than narrate an answer
  • Set max_tokens β€” cap output length at your expected maximum to prevent runaway generation
  • Specify length explicitly β€” β€œIn 2–3 bullet points” produces a fraction of the tokens β€œSummarize this document” produces

Pattern 5 β€” Fine-Tuning for Narrow Tasks

Savings potential: allows T3 to match T2/T1 quality on specific tasks at T3 prices

Fine-tuning a small model on examples of your specific task can produce quality that matches or exceeds a much larger model:

  • Collect 50–500 examples of (input, ideal output) for your specific task
  • Fine-tune GPT-4o-mini, Haiku, or an open-weight model (Llama, Phi-4-mini) on these examples
  • The fine-tuned T3 model learns the specific output format, tone, and patterns for your domain
  • Use this fine-tuned model instead of a T2 or T1 model for that specific task

Fine-tuning works best for: structured extraction with a fixed schema, consistent tone and style matching, domain-specific classification, and tasks with predictable output patterns. It does not help for open-ended reasoning.

Pattern 6 β€” Stacking Discounts

Savings potential: up to 95% when caching + batching are combined on eligible workloads

The highest savings come from layering multiple optimization techniques on the same workload. A document processing pipeline optimized end-to-end might look like:

Optimization appliedEffective cost vs standard
No optimization (T2, standard)100%
Switch to T3 model~15–30% (70–85% savings)
T3 + prompt caching (stable system prompt)~8–15%
T3 + prompt caching + Batch API~4–8% (92–96% savings)
T3 fine-tuned + caching + batch~3–5% (95–97% savings)

Not every workload supports all layers. Real-time interactive workloads can't use batch API. Tasks with constantly varying contexts won't benefit much from caching. Apply each pattern only where it fits.

Pattern 7 β€” Cost Visibility Before Optimization

Before optimizing, measure. Most teams don't know where their AI spend is going:

  • Tag every API call with a feature, user type, or workflow label
  • Track token counts and costs per endpoint/workflow, not just in aggregate
  • Identify the top 20% of request types that drive 80% of cost
  • Optimize those first β€” not the long tail
  • Re-measure after each optimization before applying the next

Tools: OpenAI usage dashboard, Anthropic console usage analytics, LiteLLM's proxy cost tracking, CloudZero or PointFive for multi-provider AI FinOps.

Checklist: Do You Understand This?

  • Prompt caching: 90% off repeated input β€” highest value when system prompt + context is large and stable
  • Batch API: 50% off everything β€” apply to any non-real-time workload (documents, evals, nightly jobs)
  • Model routing: classify request complexity, route to cheapest tier that can handle it (25–60% savings)
  • Output length control: ask for structured/concise outputs; set max_tokens; output tokens are 4–8Γ— pricier
  • Fine-tuning: T3 model + task-specific training can match T2/T1 quality on narrow tasks at T3 cost
  • Stacking: caching + batch + right-tier model = 90–97% cost reduction on eligible workloads

Page built: 01 Jun 2026