🧠 All Things AI
Advanced

Caching Strategies for AI Systems

LLM calls are slow and expensive. At enterprise scale, the same system prompts, the same FAQ questions, and the same document processing tasks repeat thousands of times per day. Caching is the single highest-leverage optimisation available — but the strategy depends on whether you are caching tokens, meanings, or responses.

Three Caching Layers

Cache typeHow it worksBest forTypical saving
Prompt cache (native)Provider caches processed KV state of repeated prefix tokens; subsequent calls reuse the cacheLong repeated system prompts; large document context sent with every callUp to 90% cost reduction on cached tokens; 85% latency reduction on cached prefix
Semantic cacheEmbed incoming query; retrieve cached response if embedding similarity exceeds thresholdFAQ-style queries; repeated intents phrased differently40-70% cache hit rate on FAQ workloads; full LLM call cost avoided
Response cache (exact)Hash the input string; return stored response on exact matchDeterministic inputs: fixed report templates, scheduled processing, static lookups100% cost saving on hit; limited to inputs that truly repeat exactly

Native Prompt Caching (Anthropic and OpenAI)

Both Anthropic (Claude) and OpenAI support prompt caching for repeated prompt prefixes. The provider processes the prefix once and caches the internal KV state. Subsequent requests that share the same prefix reuse the cache instead of reprocessing.

Anthropic prompt caching

  • Mark cache breakpoints with "cache_control": {"type": "ephemeral"} in the messages API
  • Minimum cacheable prefix: 1,024 tokens (Sonnet/Haiku) or 2,048 tokens (Opus)
  • Cache duration: 5 minutes (refreshed on each hit)
  • Cache write cost: 1.25× base input price; cache read cost: 0.1× base input price
  • Works for: system prompts, large documents, tool definitions, few-shot examples

OpenAI prompt caching

  • Automatic — no explicit markup required; applies to prompts > 1,024 tokens
  • Cache duration: 5-10 minutes of inactivity before eviction
  • Cache read cost: 50% discount on input tokens
  • Works for: system prompt prefix (must be at the start of messages array)
  • Cache hit reported in usage.prompt_tokens_details.cached_tokens

Setup requirement for maximum cache efficiency

# Cache only activates when the prefix is IDENTICAL across requests.

# Requirements:

# 1. System prompt must be static (no dynamic inserts at the start)

# 2. Any dynamic content (user name, date) must come AFTER the cached prefix

# 3. Tool definitions must appear before the cache breakpoint if you want them cached

# Common mistake — this BREAKS the cache:

# system = f"Today is {date}. You are a helpful assistant..." # dynamic prefix

# Correct — date goes in the user message, not system prompt prefix

# system = "You are a helpful assistant..." # static; cacheable

# user = f"Today is {date}. User question: {question}"

Semantic Caching

Semantic caching answers "is this query similar enough to a previous query that the cached response is still correct?" — not just "is it identical?"

How it works

  • Embed incoming query using a fast embedding model
  • Search cached query embeddings for cosine similarity > threshold (typically 0.95)
  • On hit: return cached response immediately; no LLM call
  • On miss: call LLM, store query embedding + response in cache
  • Tooling: GPTCache (open source) / Redis with vector index / Weaviate cache layer

False positive risk

  • "What is the capital of France?" and "What is the capital of Spain?" may be semantically similar but need different answers
  • Similarity threshold must be tuned per use case — 0.95 is a starting point, not a rule
  • Do not use semantic cache for queries that include specific entities (names, dates, numbers) — exact match is safer
  • Always log cache hit/miss rate and sample hits for quality review

Cache Invalidation Triggers

TriggerWhich cache to invalidateWhy
RAG corpus updatedSemantic cache (full flush or partial by topic)Cached answers based on old retrieved context are now stale
System prompt changedNative prompt cache (automatic — prefix no longer matches); semantic cache (policy answers may differ)Behaviour of the model has changed; old responses no longer reflect current policy
Model version changeSemantic cache (full flush)Different model may produce different responses to the same query
User data deletion (GDPR)Response cache and semantic cache entries derived from that user's dataCached responses may contain PII that must be deleted
TTL expiryResponse cache entries past their configured TTLTime-sensitive answers (prices, availability, status) become stale

Cache Security — User-Level Access Control

Critical: never serve User A's cached response to User B

A cached response may contain information that user A is authorised to see but user B is not. If your cache key does not include the user's access scope, you will inadvertently expose restricted data. The cache key must encode: query hash + user access scope hash (or role). For document-level access control, the cache key must include the user's document permissions. Never use a global semantic cache for multi-tenant systems without scope isolation.

# Correct cache key construction for multi-tenant systems

def build_cache_key(query: str, user: User) -> str:

scope_hash = hash(sorted(user.document_permissions))

query_embedding = embed(query)

return f"{scope_hash}:{query_embedding_hash}"

# Single-tenant or all-users-same-access systems can use query hash alone

# def build_cache_key(query: str) -> str:

# return hash(query) # only safe if all users have identical access

What Not to Cache

  • Personalised responses — answers that depend on user-specific data from memory or profile
  • Time-sensitive data — anything involving current prices, availability, live status, or "today"
  • Responses that include the user's own PII — caching creates additional retention obligations
  • Agentic action outputs — actions have side effects; returning a cached "I sent the email" response without actually sending it is worse than no cache
  • Low-volume unique queries — caching overhead (embedding, lookup) costs more than the LLM call for one-off queries

Checklist: Do You Understand This?

  • What is the difference between native prompt caching and semantic caching — when do you use each?
  • What is the minimum prefix length for Anthropic prompt caching to activate, and what is the cost of a cache read vs a cache write?
  • Why must dynamic content (dates, user names) appear after the cached prefix — not before it?
  • What is the false positive risk in semantic caching, and how do you mitigate it?
  • What five events should trigger cache invalidation — and which cache type is affected by each?
  • How must the cache key be constructed in a multi-tenant system to prevent access control violations?