State & Memory Management
LLMs are stateless by default — every API call starts from scratch. Building a chatbot that feels coherent across a conversation, let alone across sessions, requires you to manage memory explicitly. This is one of the most consequential architectural decisions in a chatbot build: the wrong memory strategy either bloats your context window and costs you money, or drops critical information and breaks continuity.
Why LLMs Need Explicit Memory
When you call an LLM API, the model has no memory of any previous call. It sees only what is in the current prompt. To give a chatbot continuity, you must include all relevant prior context in every prompt. This creates a fundamental tension:
| Problem | Consequence |
|---|---|
| Context window is finite (even with 1M tokens, sending everything is slow and expensive) | Cannot send the full conversation history indefinitely |
| Information shared early gets “pushed out” as the conversation grows | Bot forgets the user's name, goal, or constraints from turn 1 by turn 50 |
| Sessions are isolated — starting a new chat means starting from scratch | Returning users must re-explain their context every session |
| Different information ages differently — preferences persist longer than last question | A single memory store cannot handle all information types efficiently |
Memory Taxonomy
Think of LLM memory in two dimensions: scope (within-session vs. across-session) and storage location (in-context vs. external).
Each memory tier has different latency, cost, and durability characteristics
Strategy 1 — In-Context Window Management
The simplest memory strategy: include conversation history directly in the prompt. No external storage required. Three patterns exist within this approach.
Full History Buffer
Sliding Window
Summary Buffer (Hybrid)
ConversationSummaryBufferMemory.Strategy 2 — External Long-Term Memory
External memory persists across sessions. Information is written to a database during conversations and retrieved — selectively — into future prompts. This is the only way to give your chatbot genuine cross-session continuity.
Vector Store Memory
Structured / Key-Value Memory
Hybrid Memory Architecture (Production Pattern)
Production chatbots typically use three tiers simultaneously:
Each tier has a different retrieval strategy — only working memory and user profile are always included
Working State — Task Bots Specifically
Task bots need a separate concept: working state — the current values of the slots being collected for an in-progress task.
What to Store — The Memory Selection Problem
Memory Libraries and Tools
| Tool | Type | Best for |
|---|---|---|
| LangChain Memory | Library (Python) | Quick implementation of buffer, summary, and vector memory; integrates with LangChain chains |
| LangGraph Checkpointing | State graph (Python) | Persistent graph state across turns and sessions; production agentic systems |
| Mem0 | Managed memory service | Cross-session personalised memory with automatic extraction and retrieval |
| Redis | In-memory store | Session state (working memory), short-term buffer with TTL; also supports vector search (RedisVL) |
| Pinecone / Qdrant / pgvector | Vector database | Semantic long-term memory retrieval; episodic and semantic memory stores |
Failure Modes
Context window overflow mid-conversation
When the combined prompt + history + new message exceeds the model's context limit, the API throws an error — or silently truncates from the beginning. Always implement a token counter and a truncation/summarisation trigger before you hit the limit.
Lost information at window boundary
Sliding window memory drops the oldest messages. If the user stated a key constraint at turn 2 and you are at turn 25 with a window of 15, that constraint is gone. Solution: extract key constraints into a pinned “user context” block always included regardless of window position.
Memory retrieval pollution
Retrieving too many or poorly ranked memories floods the context with irrelevant past conversations. Apply strict top-k limits (3–5 memories max), set a similarity score threshold, and test retrieval quality independently of generation quality.
Stale memory
User preferences change over time. Implement TTLs on stored memories, or include a last-updated timestamp. Allow users to explicitly correct or clear stored facts.
Not persisting working state server-side
Storing task state only in the LLM conversation thread means a network hiccup, browser refresh, or timeout loses all collected slots. Always persist working state to a server-side session store (Redis with a session TTL).
Privacy and Memory
Design memory with deletion in mind from day one. Users have the right to know what you remember about them and to request deletion. This has GDPR and CCPA implications in regulated markets.
Strategy Chooser
| Your situation | Recommended strategy |
|---|---|
| Short transactions, under 20 turns, precise recall needed | Full history buffer (in-context) |
| Long conversations, recent context most relevant, cost-sensitive | Sliding window (last N turns) |
| Long conversations where early context still matters | Summary buffer (ConversationSummaryBufferMemory) |
| Multi-step task with structured slots | Working state (Redis session store) + sliding window for conversation |
| Personalised assistant with returning users | Hybrid: structured user profile (DB) + vector episodic memory |
| You want managed memory without building it yourself | Mem0 managed memory layer |
Checklist: Do You Understand This?
- Can you explain why LLMs need explicit memory management and what “stateless by default” means?
- Can you describe the difference between in-context memory and external memory?
- What is a sliding window, and what is the key failure mode it introduces?
- How does a summary buffer improve on a pure sliding window?
- Can you describe the three-tier hybrid memory architecture used in production chatbots?
- What is working state in a task bot, and why should it be persisted server-side?
- What is the “memory selection problem”, and how does Mem0 approach it?
- Can you name three failure modes in memory management and how to mitigate them?
- What privacy requirements should you design into a memory system from the start?