Intermediate

State & Memory Management

LLMs are stateless by default — every API call starts from scratch. Building a chatbot that feels coherent across a conversation, let alone across sessions, requires you to manage memory explicitly. This is one of the most consequential architectural decisions in a chatbot build: the wrong memory strategy either bloats your context window and costs you money, or drops critical information and breaks continuity.

Why LLMs Need Explicit Memory

When you call an LLM API, the model has no memory of any previous call. It sees only what is in the current prompt. To give a chatbot continuity, you must include all relevant prior context in every prompt. This creates a fundamental tension:

Problem	Consequence
Context window is finite (even with 1M tokens, sending everything is slow and expensive)	Cannot send the full conversation history indefinitely
Information shared early gets “pushed out” as the conversation grows	Bot forgets the user's name, goal, or constraints from turn 1 by turn 50
Sessions are isolated — starting a new chat means starting from scratch	Returning users must re-explain their context every session
Different information ages differently — preferences persist longer than last question	A single memory store cannot handle all information types efficiently

Memory Taxonomy

Think of LLM memory in two dimensions: scope (within-session vs. across-session) and storage location (in-context vs. external).

Scope

Within-session only

Lost when conversation ends

Cross-session

Persists across conversations

In-context

In-context buffer

Full history or sliding window — in prompt

Summary buffer

Compressed rolling summary + recent turns

External

Working state

Redis session store — task slots

User profile

DB — preferences, goals, facts

Vector memory

Pinecone/pgvector — episodic + semantic

Each memory tier has different latency, cost, and durability characteristics

Strategy 1 — In-Context Window Management

The simplest memory strategy: include conversation history directly in the prompt. No external storage required. Three patterns exist within this approach.

Full History Buffer

How it works: Every message in the conversation is included verbatim in the prompt. System prompt + full history + new user message → LLM.

Best for: Short interactions (under ~20 turns), high-stakes tasks where precision matters, cases where you need the exact phrasing of past messages.

Trade-offs: Cost and latency grow linearly with conversation length. With very large context windows (Claude 3.7: 200K tokens; Gemini 2.5 Pro: 1M+ tokens) this is viable for longer sessions, but still expensive at scale.

Sliding Window

How it works: Keep only the last N turns in the prompt. Older messages are dropped. Often combined with a pinned system prompt and optionally a pinned “user context” block at the top.

Best for: Long conversations where recent context is most relevant; cost-sensitive applications; customer support bots.

Trade-offs: Information from early in the conversation is permanently lost when it falls outside the window. A user's stated constraint from message 2 will be forgotten by message 20 if N is small.

Summary Buffer (Hybrid)

How it works: Maintain two sections in the prompt: (1) a rolling LLM-generated summary of everything older than the window, and (2) the verbatim recent N turns. As the window fills, the oldest turn gets compressed into the summary. LangChain calls this ConversationSummaryBufferMemory.

Best for: Long sessions where early context still matters; graceful degradation (gist preserved) rather than hard cutoff.

Trade-offs: Summarisation introduces an extra LLM call on every window-fill event, adding latency and cost. The summary is lossy — nuance and exact phrasing are not preserved.

Strategy 2 — External Long-Term Memory

External memory persists across sessions. Information is written to a database during conversations and retrieved — selectively — into future prompts. This is the only way to give your chatbot genuine cross-session continuity.

Vector Store Memory

How it works: Conversation turns (or extracted facts) are embedded and stored in a vector database. At the start of each new turn, the user's message is embedded and used to retrieve the most semantically relevant memories. Retrieved memories are injected into the prompt as context.

What to store: Key facts the user has stated (preferences, goals, constraints), significant decisions made in past sessions, topic summaries. Not every message — be selective.

Trade-offs: Retrieval quality depends on embedding quality and chunk design. Irrelevant memories can pollute the context. Requires vector DB infrastructure and memory management logic.

Structured / Key-Value Memory

How it works: Extract structured facts from conversations and store them in a conventional database. Examples: user name, location, language preference, account type, stated goals. Retrieved as a structured “user context” block injected at the top of every prompt.

Best for: Well-defined, predictable facts; deterministic retrieval without semantic search overhead.

Trade-offs: Requires a schema — you must know in advance what facts matter. Does not handle unstructured or emergent memory needs.

Hybrid Memory Architecture (Production Pattern)

Production chatbots typically use three tiers simultaneously:

Working memory

Recent N turns + task slots (Redis + in-context) — always included

→

User profile

Name, prefs, account type (DB) — always included, small + deterministic

→

Episodic memory

Past session summaries, key decisions (Vector store) — retrieved per query top-k

→

Assembled prompt

All three tiers combined → LLM

Each tier has a different retrieval strategy — only working memory and user profile are always included

Working State — Task Bots Specifically

Task bots need a separate concept: working state — the current values of the slots being collected for an in-progress task.

What it contains: The schema of the current task (e.g. booking: {date, time, party_size, preferences}) and the current fill status of each slot (filled / empty / invalid).

Where to store it: Session-scoped key-value store (Redis with TTL, or a DB row keyed by session ID). Not in the LLM context alone — if the user refreshes or reconnects, state must survive.

How to inject it: At the start of each turn, serialise the current slot state into the system prompt: “Collected so far — date: 15 March, time: 7pm, party_size: not yet collected.” The LLM then asks only for missing slots.

Handle corrections: Users change their minds. If a user says “actually, make it 8pm”, the LLM must update the slot and confirm. Design the state schema to support updates, not just initial fills.

What to Store — The Memory Selection Problem

Store: Explicitly stated preferences, stated goals, key decisions and their reasoning, corrections the user made to the bot's behaviour.

Store (conditionally): Session summaries at end of conversation — what was discussed, what was resolved, what remains open.

Do not store: Every message verbatim (creates noise and retrieval pollution), transient questions with no future relevance, information the user explicitly wants private.

Mem0 approach (2025): Use the LLM itself to evaluate what is worth remembering. After each turn (or batch), a secondary LLM call evaluates the conversation and extracts structured memories. Mem0 (mem0.ai) provides a managed memory layer implementing this pattern.

Memory Libraries and Tools

Tool	Type	Best for
LangChain Memory	Library (Python)	Quick implementation of buffer, summary, and vector memory; integrates with LangChain chains
LangGraph Checkpointing	State graph (Python)	Persistent graph state across turns and sessions; production agentic systems
Mem0	Managed memory service	Cross-session personalised memory with automatic extraction and retrieval
Redis	In-memory store	Session state (working memory), short-term buffer with TTL; also supports vector search (RedisVL)
Pinecone / Qdrant / pgvector	Vector database	Semantic long-term memory retrieval; episodic and semantic memory stores

Failure Modes

Context window overflow mid-conversation

When the combined prompt + history + new message exceeds the model's context limit, the API throws an error — or silently truncates from the beginning. Always implement a token counter and a truncation/summarisation trigger before you hit the limit.

Lost information at window boundary

Sliding window memory drops the oldest messages. If the user stated a key constraint at turn 2 and you are at turn 25 with a window of 15, that constraint is gone. Solution: extract key constraints into a pinned “user context” block always included regardless of window position.

Memory retrieval pollution

Retrieving too many or poorly ranked memories floods the context with irrelevant past conversations. Apply strict top-k limits (3–5 memories max), set a similarity score threshold, and test retrieval quality independently of generation quality.

Stale memory

User preferences change over time. Implement TTLs on stored memories, or include a last-updated timestamp. Allow users to explicitly correct or clear stored facts.

Not persisting working state server-side

Storing task state only in the LLM conversation thread means a network hiccup, browser refresh, or timeout loses all collected slots. Always persist working state to a server-side session store (Redis with a session TTL).

Privacy and Memory

Design memory with deletion in mind from day one. Users have the right to know what you remember about them and to request deletion. This has GDPR and CCPA implications in regulated markets.

• Maintain a user-facing memory view — show users what is stored about them

• Implement per-fact deletion, not just full wipe

• Do not store sensitive data (health, financial, authentication) without explicit consent and encryption at rest

• Apply retention TTLs — memories older than N months that have not been accessed should be expired automatically

Strategy Chooser

Your situation	Recommended strategy
Short transactions, under 20 turns, precise recall needed	Full history buffer (in-context)
Long conversations, recent context most relevant, cost-sensitive	Sliding window (last N turns)
Long conversations where early context still matters	Summary buffer (ConversationSummaryBufferMemory)
Multi-step task with structured slots	Working state (Redis session store) + sliding window for conversation
Personalised assistant with returning users	Hybrid: structured user profile (DB) + vector episodic memory
You want managed memory without building it yourself	Mem0 managed memory layer

Checklist: Do You Understand This?

Can you explain why LLMs need explicit memory management and what “stateless by default” means?
Can you describe the difference between in-context memory and external memory?
What is a sliding window, and what is the key failure mode it introduces?
How does a summary buffer improve on a pure sliding window?
Can you describe the three-tier hybrid memory architecture used in production chatbots?
What is working state in a task bot, and why should it be persisted server-side?
What is the “memory selection problem”, and how does Mem0 approach it?
Can you name three failure modes in memory management and how to mitigate them?
What privacy requirements should you design into a memory system from the start?