Advanced

Caching Strategies for AI Systems

LLM calls are slow and expensive. At enterprise scale, the same system prompts, the same FAQ questions, and the same document processing tasks repeat thousands of times per day. Caching is the single highest-leverage optimisation available — but the strategy depends on whether you are caching tokens, meanings, or responses.

Three Caching Layers

Cache type	How it works	Best for	Typical saving
Prompt cache (native)	Provider caches processed KV state of repeated prefix tokens; subsequent calls reuse the cache	Long repeated system prompts; large document context sent with every call	Up to 90% cost reduction on cached tokens; 85% latency reduction on cached prefix
Semantic cache	Embed incoming query; retrieve cached response if embedding similarity exceeds threshold	FAQ-style queries; repeated intents phrased differently	40-70% cache hit rate on FAQ workloads; full LLM call cost avoided
Response cache (exact)	Hash the input string; return stored response on exact match	Deterministic inputs: fixed report templates, scheduled processing, static lookups	100% cost saving on hit; limited to inputs that truly repeat exactly

Native Prompt Caching (Anthropic and OpenAI)

Both Anthropic (Claude) and OpenAI support prompt caching for repeated prompt prefixes. The provider processes the prefix once and caches the internal KV state. Subsequent requests that share the same prefix reuse the cache instead of reprocessing.

Anthropic prompt caching

Mark cache breakpoints with "cache_control": {"type": "ephemeral"} in the messages API
Minimum cacheable prefix: 1,024 tokens (Sonnet/Haiku) or 2,048 tokens (Opus)
Cache duration: 5 minutes (refreshed on each hit)
Cache write cost: 1.25× base input price; cache read cost: 0.1× base input price
Works for: system prompts, large documents, tool definitions, few-shot examples

OpenAI prompt caching

Automatic — no explicit markup required; applies to prompts > 1,024 tokens
Cache duration: 5-10 minutes of inactivity before eviction
Cache read cost: 50% discount on input tokens
Works for: system prompt prefix (must be at the start of messages array)
Cache hit reported in usage.prompt_tokens_details.cached_tokens

Setup requirement for maximum cache efficiency

# Cache only activates when the prefix is IDENTICAL across requests.

# Requirements:

# 1. System prompt must be static (no dynamic inserts at the start)

# 2. Any dynamic content (user name, date) must come AFTER the cached prefix

# 3. Tool definitions must appear before the cache breakpoint if you want them cached

# Common mistake — this BREAKS the cache:

# system = f"Today is {date}. You are a helpful assistant..." # dynamic prefix

# Correct — date goes in the user message, not system prompt prefix

# system = "You are a helpful assistant..." # static; cacheable

# user = f"Today is {date}. User question: {question}"

Semantic Caching

Semantic caching answers "is this query similar enough to a previous query that the cached response is still correct?" — not just "is it identical?"

How it works

Embed incoming query using a fast embedding model
Search cached query embeddings for cosine similarity > threshold (typically 0.95)
On hit: return cached response immediately; no LLM call
On miss: call LLM, store query embedding + response in cache
Tooling: GPTCache (open source) / Redis with vector index / Weaviate cache layer

False positive risk

"What is the capital of France?" and "What is the capital of Spain?" may be semantically similar but need different answers
Similarity threshold must be tuned per use case — 0.95 is a starting point, not a rule
Do not use semantic cache for queries that include specific entities (names, dates, numbers) — exact match is safer
Always log cache hit/miss rate and sample hits for quality review

Cache Invalidation Triggers

Trigger	Which cache to invalidate	Why
RAG corpus updated	Semantic cache (full flush or partial by topic)	Cached answers based on old retrieved context are now stale
System prompt changed	Native prompt cache (automatic — prefix no longer matches); semantic cache (policy answers may differ)	Behaviour of the model has changed; old responses no longer reflect current policy
Model version change	Semantic cache (full flush)	Different model may produce different responses to the same query
User data deletion (GDPR)	Response cache and semantic cache entries derived from that user's data	Cached responses may contain PII that must be deleted
TTL expiry	Response cache entries past their configured TTL	Time-sensitive answers (prices, availability, status) become stale

Cache Security — User-Level Access Control

Critical: never serve User A's cached response to User B

A cached response may contain information that user A is authorised to see but user B is not. If your cache key does not include the user's access scope, you will inadvertently expose restricted data. The cache key must encode: query hash + user access scope hash (or role). For document-level access control, the cache key must include the user's document permissions. Never use a global semantic cache for multi-tenant systems without scope isolation.

# Correct cache key construction for multi-tenant systems

def build_cache_key(query: str, user: User) -> str:

scope_hash = hash(sorted(user.document_permissions))

query_embedding = embed(query)

return f"{scope_hash}:{query_embedding_hash}"

# Single-tenant or all-users-same-access systems can use query hash alone

# def build_cache_key(query: str) -> str:

# return hash(query) # only safe if all users have identical access

What Not to Cache

Personalised responses — answers that depend on user-specific data from memory or profile
Time-sensitive data — anything involving current prices, availability, live status, or "today"
Responses that include the user's own PII — caching creates additional retention obligations
Agentic action outputs — actions have side effects; returning a cached "I sent the email" response without actually sending it is worse than no cache
Low-volume unique queries — caching overhead (embedding, lookup) costs more than the LLM call for one-off queries

Checklist: Do You Understand This?

What is the difference between native prompt caching and semantic caching — when do you use each?
What is the minimum prefix length for Anthropic prompt caching to activate, and what is the cost of a cache read vs a cache write?
Why must dynamic content (dates, user names) appear after the cached prefix — not before it?
What is the false positive risk in semantic caching, and how do you mitigate it?
What five events should trigger cache invalidation — and which cache type is affected by each?
How must the cache key be constructed in a multi-tenant system to prevent access control violations?