Intermediate

RAG Chatbot Architecture

The RAG (Retrieval-Augmented Generation) chatbot is the most widely deployed AI architecture in production today. It lets an LLM answer questions grounded in your private documents — without fine-tuning, without hallucinating facts that are not in your corpus, and with citations linking every answer back to a source.

System Overview

A RAG chatbot has two distinct phases — the offline ingestion pipelinethat processes and indexes your documents, and the online query pipelinethat retrieves relevant context and generates grounded answers:

Ingestion (offline / batch)

Document Loader

PDF, HTML, Drive, S3

Chunker

512 tok + overlap

Embedding Model

voyage-3 / text-embedding-3

Vector Store

Pinecone / Qdrant / pgvector

Query (online / per-request)

Query Rewriter

Standalone + HyDE

Hybrid Search

Dense + BM25 → RRF

Reranker

Cohere / BGE cross-encoder

LLM + Citations

Grounded generation

Memory (multi-turn)

Conversation Buffer

Last 5–10 turns

History-Aware Rewrite

Resolve references

Session Store

Redis / DB

Two offline + one online pipeline — ingestion runs once; query pipeline runs per request

Ingestion Pipeline

Document loading

The first stage is loading raw documents from wherever they live. Most production RAG systems need multiple loaders:

PDFs — PDF plumber or PyMuPDF for digital-native; OCR (Tesseract, Mistral OCR) for scanned
Web pages — Playwright or BeautifulSoup for HTML; handle JavaScript-rendered content
Office documents — python-docx for Word, openpyxl for Excel, python-pptx for slides
Databases / APIs — direct SQL query or API fetch, structured as text records
Cloud storage — Google Drive, SharePoint, S3 connectors (many available via LlamaIndex)

Key output: plain text plus metadata (source URL, document title, page number, date, author). Preserve metadata — you need it for citations later.

Chunking strategy

Chunking is the most impactful design decision in a RAG system. Chunks too large lose retrieval precision; chunks too small lose context. Common strategies:

Strategy	Chunk size	Best for	Trade-off
Fixed-size with overlap	512–1024 tokens, 10–20% overlap	General text, narrative docs	May split mid-sentence; simple to implement
Sentence / paragraph boundary	Variable, 200–800 tokens	Articles, reports, manuals	More coherent chunks; irregular size complicates batching
Semantic chunking	Variable, embedding-based boundary detection	Long mixed-topic documents	Best chunk coherence; extra embedding compute cost
Hierarchical (parent-child)	Small child (128 tokens) + large parent (1024)	Detailed Q&A with broad context	Retrieve small; inject large parent — doubles storage
Structure-aware	By heading / section / code block	Documentation, code repos, legal docs	Requires document structure parsing

2025 recommendation

Use contextual chunking (Anthropic, 2024): before embedding each chunk, prepend a one-sentence AI-generated summary of the chunk's position in the document (“This chunk is from Section 3 of the Q3 2025 earnings report, discussing revenue by region”). This reduces retrieval failures by 49% for free-standing chunks that lack surrounding context. The overhead is one LLM call per chunk at ingestion time.

Embedding and vector storage

Each chunk is converted to a dense vector (embedding) and stored in a vector database. The embedding model must be the same at ingestion and query time.

Embedding model	Dimensions	Notes
text-embedding-3-large (OpenAI)	3,072	Strong performance, $0.00013/1K tokens
text-embedding-3-small (OpenAI)	1,536	5× cheaper, still good for most use cases
voyage-3 (Voyage AI)	1,024	State-of-the-art on MTEB, domain-specific variants
nomic-embed-text (local)	768	Strong open-source option; runs locally
mxbai-embed-large (local)	1,024	Top open-source on MTEB, runs on CPU

Vector store options: Pinecone (managed, production-ready, serverless tier), Qdrant (open-source, self-host or cloud, fast), pgvector(Postgres extension — no extra database if you already use Postgres), Weaviate (multi-modal, hybrid search built-in), Chroma (simple local development).

Query Pipeline

Every user message passes through this sequence before an answer is generated:

User query

Raw conversational message

→

Query rewriter

Standalone + multi-query + HyDE

→

Hybrid search

Dense + BM25 → RRF merge (top 50)

→

Reranker

Cross-encoder scores top 50 → keeps top 5–8

→

Context assembler

Chunks + metadata + system instruction

→

LLM generation

Grounded answer with inline citations

Target: total latency <2 s including retrieval, reranking, and LLM TTFT

Query rewriting

Raw user questions are often poor retrieval queries. A small LLM call transforms the question before retrieval:

Standalone query — remove conversational references (“what about the second one?” → “what are the pricing details for the Enterprise plan?”)
Multi-query expansion — generate 3–5 alternative phrasings of the same question, retrieve for all, deduplicate results
HyDE (Hypothetical Document Embeddings) — generate a hypothetical ideal answer, embed it, retrieve documents similar to that ideal answer rather than the question

Hybrid search

Dense vector search alone misses exact-match queries (“what does Section 4.2.1 say?”). BM25 keyword search alone misses semantic matches. Combine both:

Run dense retrieval (cosine similarity in vector store) and BM25 keyword search in parallel
Merge results using Reciprocal Rank Fusion (RRF) — a score that rewards items ranked high in multiple result lists
Take the top 20–50 merged candidates forward to the reranker

Pinecone, Weaviate, and Qdrant all support hybrid search natively.

Reranking

Vector similarity is fast but imprecise. A cross-encoder reranker takes the query and each candidate chunk together and scores how relevant the chunk actually is to the query — much more accurate than cosine distance, but too slow to apply to millions of documents. The two-stage approach (retrieve 50 → rerank → keep top 5) gives you both speed and precision.

Reranker options: Cohere Rerank 3.5 (API, best performance), FlashRank (open-source, fast local), BGE-Reranker (open-source, strong on BEIR benchmark). A reranker typically reduces retrieval failures by 30–40%.

Context assembly

After reranking, the top K chunks (typically 3–8) are assembled into a context block and injected into the generation prompt. Best practices:

Include chunk metadata (document title, page, date) alongside content — helps the LLM cite accurately
Order chunks by relevance score, most relevant first
Add a system instruction: “Answer using only the provided context. If the answer is not in the context, say so.”
Use XML tags to delimit each source: <source id="1" title="...">...</source>

Generation and Citation Rendering

The LLM receives the context block plus the user question and generates a grounded answer. To enable citation rendering:

Instruct the model to include inline source references: “As stated in [1], the policy requires...”
Ask the model to output structured JSON: {"answer": "...", "citations": [{"id": 1, "text": "..."}, ...]}
The UI maps citation IDs back to the original chunks and renders clickable source links
For streaming responses, stream the answer text first, then parse and render citations after the full response arrives

Critical guardrail: Even with RAG, LLMs can blend retrieved facts with training knowledge. Add a faithfulness check: use an LLM-as-judge to verify that every claim in the answer is supported by a cited chunk. Flag or suppress answers that fail the check. Production systems target >90% faithfulness.

Multi-Turn Conversation Memory

A basic RAG system retrieves using only the current question. This breaks on follow-up questions that reference prior turns (“explain that in simpler terms”). History-aware RAG:

Maintain a conversation buffer (last 5–10 turns) in the session
Before retrieval, use a small LLM call to rewrite the current query as a standalone question incorporating relevant context from conversation history
Retrieve using the rewritten standalone query
Include the (summarised) conversation history in the generation prompt for coherent multi-turn answers

Evaluation with RAGAS

RAGAS (Retrieval Augmented Generation Assessment) provides the standard evaluation framework for RAG chatbots. Four core metrics:

Metric	What it measures	Target
Faithfulness	Are all answer claims supported by retrieved context?	>90%
Answer relevance	Does the answer actually address the question?	>85%
Context precision	How much of the retrieved context was actually useful?	>70%
Context recall	Did retrieval surface all the context needed to answer?	>80%

Run RAGAS against a golden dataset of question/answer/source triples before shipping. Regression test on every significant change to chunking, embedding model, or retrieval configuration. Other evaluation tools: LangSmith, Arize Phoenix, DeepEval.

Streaming Responses

Users perceive RAG chatbots as slow because retrieval adds latency before generation begins. Streaming reduces perceived wait time:

Show a typing indicator or “searching...” message immediately on query submission
Stream LLM tokens to the UI as they arrive (Server-Sent Events or WebSockets)
Render citation references progressively — parse and link them after streaming completes
Target first-token latency <2 seconds (retrieval + reranking + LLM TTFT combined)

Production Component Choices (2025–2026)

Component	Managed / cloud option	Self-hosted option
Orchestration	LangChain / LlamaIndex cloud	LangChain, LlamaIndex (local)
Embedding	OpenAI text-embedding-3, Voyage AI	nomic-embed-text, mxbai-embed-large
Vector store	Pinecone, Weaviate Cloud	Qdrant, pgvector, Chroma
Reranker	Cohere Rerank 3.5	BGE-Reranker, FlashRank
Generation LLM	Claude Sonnet / GPT-4o	Llama 3.1 70B via Ollama/vLLM
Observability	LangSmith, Arize Phoenix	Langfuse (self-hosted Docker)

Checklist: Do You Understand This?

What are the two main phases of a RAG system and which runs offline vs online?
Why does chunking strategy matter so much, and what is the trade-off between chunk size and retrieval quality?
What is hybrid search, and why is it better than dense-only or keyword-only retrieval?
What does a reranker do and why is it applied after retrieval rather than replacing it?
How does history-aware query rewriting enable multi-turn RAG conversations?
Name the four RAGAS metrics and explain what each one measures.
What faithfulness check prevents the LLM from blending retrieved facts with hallucinated claims?