Intermediate

Cost Optimization Patterns

Beyond choosing the right model tier, several implementation patterns can dramatically reduce AI costs on existing workloads. Applied together, these patterns routinely achieve 50–90% cost reduction without sacrificing output quality. The key is applying them systematically rather than hoping model prices drop further.

Pattern 1 — Prompt Caching

Savings potential: 40–80% on input costs (when your system prompt and context are large and stable)

Prompt caching stores the KV (key-value) cache of your prompt prefix so repeated requests reuse it at 10% of the standard token price. The biggest wins come when:

You have a large system prompt (1,000+ tokens) that doesn't change between requests
You prepend the same documents or context to many different user queries
You maintain long conversation histories in a persistent chat interface

Best candidates for caching:

RAG system prompt + retrieved context template
Large tool/function schema definitions
Company knowledge base or policy documents
Conversation history in multi-turn chat
Few-shot examples in the system prompt

Caching won't help when:

Every request has a unique, large context (no repeating prefix)
Your system prompt is short (<500 tokens)
You rarely hit the same prefix twice within the cache TTL window

Rule of thumb: if more than 50% of your average input tokens are from a stable prefix, prompt caching is your highest-leverage optimization.

Pattern 2 — Batch Processing

Savings potential: 50% flat discount on all tokens (works everywhere Anthropic or OpenAI Batch API is available)

Any workload that doesn't need a real-time response should use batch APIs. The trade-off is simple: accept up to 24-hour processing time, get 50% off everything.

Identifying your batch-eligible workloads is the first step:

Document processing pipelines (contracts, invoices, reports)
Nightly data enrichment or classification jobs
Generating embeddings for a new corpus
Running evaluation tests against a benchmark
Pre-generating content variations (A/B test copy, email personalization)

Practical note: most pipelines labelled “real-time” actually tolerate minutes or hours. Challenge the assumption that a workload needs sub-second response. Many can be queued and processed asynchronously.

Pattern 3 — Model Routing (Cascade)

Savings potential: 25–60% on mixed workloads by matching request complexity to model tier

Not every request in a system has the same complexity. A routing layer classifies incoming requests and directs them to the cheapest tier that can handle them:

Incoming request

User query or task arrives at the system

Classify complexity

Cheap classifier (T3 model or rule-based) scores the request: simple / medium / complex

Route to tier

Simple → T3 Haiku/Flash-Lite. Medium → T2 Sonnet/Flash. Complex → T1 Opus/o3

Generate response

Each request processed by the cheapest tier capable of handling it

Quality check (optional)

For high-stakes outputs, escalate T3/T2 responses to T1 for verification

Routing tools like LiteLLM (open-source, self-hosted) and OpenRouter (SaaS, ~5% markup) provide unified APIs across providers to make routing implementation simpler. LiteLLM has zero markup but requires self-hosting. OpenRouter is turnkey but adds cost at scale.

Pattern 4 — Output Length Control

Savings potential: 20–50% (output tokens are 4–8× more expensive than input)

Output tokens dominate cost. Prompts that encourage verbose answers cost significantly more than prompts engineered for concise, structured outputs:

Ask for specific formats — “Return a JSON object with keys X, Y, Z” produces less output than “describe the result”
Use structured outputs — JSON mode forces the model to populate a schema rather than narrate an answer
Set max_tokens — cap output length at your expected maximum to prevent runaway generation
Specify length explicitly — “In 2–3 bullet points” produces a fraction of the tokens “Summarize this document” produces

Pattern 5 — Fine-Tuning for Narrow Tasks

Savings potential: allows T3 to match T2/T1 quality on specific tasks at T3 prices

Fine-tuning a small model on examples of your specific task can produce quality that matches or exceeds a much larger model:

Collect 50–500 examples of (input, ideal output) for your specific task
Fine-tune GPT-4o-mini, Haiku, or an open-weight model (Llama, Phi-4-mini) on these examples
The fine-tuned T3 model learns the specific output format, tone, and patterns for your domain
Use this fine-tuned model instead of a T2 or T1 model for that specific task

Fine-tuning works best for: structured extraction with a fixed schema, consistent tone and style matching, domain-specific classification, and tasks with predictable output patterns. It does not help for open-ended reasoning.

Pattern 6 — Stacking Discounts

Savings potential: up to 95% when caching + batching are combined on eligible workloads

The highest savings come from layering multiple optimization techniques on the same workload. A document processing pipeline optimized end-to-end might look like:

Optimization applied	Effective cost vs standard
No optimization (T2, standard)	100%
Switch to T3 model	~15–30% (70–85% savings)
T3 + prompt caching (stable system prompt)	~8–15%
T3 + prompt caching + Batch API	~4–8% (92–96% savings)
T3 fine-tuned + caching + batch	~3–5% (95–97% savings)

Not every workload supports all layers. Real-time interactive workloads can't use batch API. Tasks with constantly varying contexts won't benefit much from caching. Apply each pattern only where it fits.

Pattern 7 — Cost Visibility Before Optimization

Before optimizing, measure. Most teams don't know where their AI spend is going:

Tag every API call with a feature, user type, or workflow label
Track token counts and costs per endpoint/workflow, not just in aggregate
Identify the top 20% of request types that drive 80% of cost
Optimize those first — not the long tail
Re-measure after each optimization before applying the next

Tools: OpenAI usage dashboard, Anthropic console usage analytics, LiteLLM's proxy cost tracking, CloudZero or PointFive for multi-provider AI FinOps.

Checklist: Do You Understand This?

Prompt caching: 90% off repeated input — highest value when system prompt + context is large and stable
Batch API: 50% off everything — apply to any non-real-time workload (documents, evals, nightly jobs)
Model routing: classify request complexity, route to cheapest tier that can handle it (25–60% savings)
Output length control: ask for structured/concise outputs; set max_tokens; output tokens are 4–8× pricier
Fine-tuning: T3 model + task-specific training can match T2/T1 quality on narrow tasks at T3 cost
Stacking: caching + batch + right-tier model = 90–97% cost reduction on eligible workloads