🧠 All Things AI
Intermediate

RAG Chatbot Architecture

The RAG (Retrieval-Augmented Generation) chatbot is the most widely deployed AI architecture in production today. It lets an LLM answer questions grounded in your private documents β€” without fine-tuning, without hallucinating facts that are not in your corpus, and with citations linking every answer back to a source.

System Overview

A RAG chatbot has two distinct phases β€” the offline ingestion pipelinethat processes and indexes your documents, and the online query pipelinethat retrieves relevant context and generates grounded answers:

Ingestion (offline / batch)
Document Loader
PDF, HTML, Drive, S3
Chunker
512 tok + overlap
Embedding Model
voyage-3 / text-embedding-3
Vector Store
Pinecone / Qdrant / pgvector
Query (online / per-request)
Query Rewriter
Standalone + HyDE
Hybrid Search
Dense + BM25 β†’ RRF
Reranker
Cohere / BGE cross-encoder
LLM + Citations
Grounded generation
Memory (multi-turn)
Conversation Buffer
Last 5–10 turns
History-Aware Rewrite
Resolve references
Session Store
Redis / DB

Two offline + one online pipeline β€” ingestion runs once; query pipeline runs per request

Ingestion Pipeline

Document loading

The first stage is loading raw documents from wherever they live. Most production RAG systems need multiple loaders:

  • PDFs β€” PDF plumber or PyMuPDF for digital-native; OCR (Tesseract, Mistral OCR) for scanned
  • Web pages β€” Playwright or BeautifulSoup for HTML; handle JavaScript-rendered content
  • Office documents β€” python-docx for Word, openpyxl for Excel, python-pptx for slides
  • Databases / APIs β€” direct SQL query or API fetch, structured as text records
  • Cloud storage β€” Google Drive, SharePoint, S3 connectors (many available via LlamaIndex)

Key output: plain text plus metadata (source URL, document title, page number, date, author). Preserve metadata β€” you need it for citations later.

Chunking strategy

Chunking is the most impactful design decision in a RAG system. Chunks too large lose retrieval precision; chunks too small lose context. Common strategies:

StrategyChunk sizeBest forTrade-off
Fixed-size with overlap512–1024 tokens, 10–20% overlapGeneral text, narrative docsMay split mid-sentence; simple to implement
Sentence / paragraph boundaryVariable, 200–800 tokensArticles, reports, manualsMore coherent chunks; irregular size complicates batching
Semantic chunkingVariable, embedding-based boundary detectionLong mixed-topic documentsBest chunk coherence; extra embedding compute cost
Hierarchical (parent-child)Small child (128 tokens) + large parent (1024)Detailed Q&A with broad contextRetrieve small; inject large parent β€” doubles storage
Structure-awareBy heading / section / code blockDocumentation, code repos, legal docsRequires document structure parsing

2025 recommendation

Use contextual chunking (Anthropic, 2024): before embedding each chunk, prepend a one-sentence AI-generated summary of the chunk's position in the document (β€œThis chunk is from Section 3 of the Q3 2025 earnings report, discussing revenue by region”). This reduces retrieval failures by 49% for free-standing chunks that lack surrounding context. The overhead is one LLM call per chunk at ingestion time.

Embedding and vector storage

Each chunk is converted to a dense vector (embedding) and stored in a vector database. The embedding model must be the same at ingestion and query time.

Embedding modelDimensionsNotes
text-embedding-3-large (OpenAI)3,072Strong performance, $0.00013/1K tokens
text-embedding-3-small (OpenAI)1,5365Γ— cheaper, still good for most use cases
voyage-3 (Voyage AI)1,024State-of-the-art on MTEB, domain-specific variants
nomic-embed-text (local)768Strong open-source option; runs locally
mxbai-embed-large (local)1,024Top open-source on MTEB, runs on CPU

Vector store options: Pinecone (managed, production-ready, serverless tier), Qdrant (open-source, self-host or cloud, fast), pgvector(Postgres extension β€” no extra database if you already use Postgres), Weaviate (multi-modal, hybrid search built-in), Chroma (simple local development).

Query Pipeline

Every user message passes through this sequence before an answer is generated:

User query
Raw conversational message
β†’
Query rewriter
Standalone + multi-query + HyDE
β†’
Hybrid search
Dense + BM25 β†’ RRF merge (top 50)
β†’
Reranker
Cross-encoder scores top 50 β†’ keeps top 5–8
β†’
Context assembler
Chunks + metadata + system instruction
β†’
LLM generation
Grounded answer with inline citations

Target: total latency <2 s including retrieval, reranking, and LLM TTFT

Query rewriting

Raw user questions are often poor retrieval queries. A small LLM call transforms the question before retrieval:

  • Standalone query β€” remove conversational references (β€œwhat about the second one?” β†’ β€œwhat are the pricing details for the Enterprise plan?”)
  • Multi-query expansion β€” generate 3–5 alternative phrasings of the same question, retrieve for all, deduplicate results
  • HyDE (Hypothetical Document Embeddings) β€” generate a hypothetical ideal answer, embed it, retrieve documents similar to that ideal answer rather than the question

Hybrid search

Dense vector search alone misses exact-match queries (β€œwhat does Section 4.2.1 say?”). BM25 keyword search alone misses semantic matches. Combine both:

  • Run dense retrieval (cosine similarity in vector store) and BM25 keyword search in parallel
  • Merge results using Reciprocal Rank Fusion (RRF) β€” a score that rewards items ranked high in multiple result lists
  • Take the top 20–50 merged candidates forward to the reranker

Pinecone, Weaviate, and Qdrant all support hybrid search natively.

Reranking

Vector similarity is fast but imprecise. A cross-encoder reranker takes the query and each candidate chunk together and scores how relevant the chunk actually is to the query β€” much more accurate than cosine distance, but too slow to apply to millions of documents. The two-stage approach (retrieve 50 β†’ rerank β†’ keep top 5) gives you both speed and precision.

Reranker options: Cohere Rerank 3.5 (API, best performance), FlashRank (open-source, fast local), BGE-Reranker (open-source, strong on BEIR benchmark). A reranker typically reduces retrieval failures by 30–40%.

Context assembly

After reranking, the top K chunks (typically 3–8) are assembled into a context block and injected into the generation prompt. Best practices:

  • Include chunk metadata (document title, page, date) alongside content β€” helps the LLM cite accurately
  • Order chunks by relevance score, most relevant first
  • Add a system instruction: β€œAnswer using only the provided context. If the answer is not in the context, say so.”
  • Use XML tags to delimit each source: <source id="1" title="...">...</source>

Generation and Citation Rendering

The LLM receives the context block plus the user question and generates a grounded answer. To enable citation rendering:

  • Instruct the model to include inline source references: β€œAs stated in [1], the policy requires...”
  • Ask the model to output structured JSON: {"answer": "...", "citations": [{"id": 1, "text": "..."}, ...]}
  • The UI maps citation IDs back to the original chunks and renders clickable source links
  • For streaming responses, stream the answer text first, then parse and render citations after the full response arrives

Critical guardrail: Even with RAG, LLMs can blend retrieved facts with training knowledge. Add a faithfulness check: use an LLM-as-judge to verify that every claim in the answer is supported by a cited chunk. Flag or suppress answers that fail the check. Production systems target >90% faithfulness.

Multi-Turn Conversation Memory

A basic RAG system retrieves using only the current question. This breaks on follow-up questions that reference prior turns (β€œexplain that in simpler terms”). History-aware RAG:

  • Maintain a conversation buffer (last 5–10 turns) in the session
  • Before retrieval, use a small LLM call to rewrite the current query as a standalone question incorporating relevant context from conversation history
  • Retrieve using the rewritten standalone query
  • Include the (summarised) conversation history in the generation prompt for coherent multi-turn answers

Evaluation with RAGAS

RAGAS (Retrieval Augmented Generation Assessment) provides the standard evaluation framework for RAG chatbots. Four core metrics:

MetricWhat it measuresTarget
FaithfulnessAre all answer claims supported by retrieved context?>90%
Answer relevanceDoes the answer actually address the question?>85%
Context precisionHow much of the retrieved context was actually useful?>70%
Context recallDid retrieval surface all the context needed to answer?>80%

Run RAGAS against a golden dataset of question/answer/source triples before shipping. Regression test on every significant change to chunking, embedding model, or retrieval configuration. Other evaluation tools: LangSmith, Arize Phoenix, DeepEval.

Streaming Responses

Users perceive RAG chatbots as slow because retrieval adds latency before generation begins. Streaming reduces perceived wait time:

  • Show a typing indicator or β€œsearching...” message immediately on query submission
  • Stream LLM tokens to the UI as they arrive (Server-Sent Events or WebSockets)
  • Render citation references progressively β€” parse and link them after streaming completes
  • Target first-token latency <2 seconds (retrieval + reranking + LLM TTFT combined)

Production Component Choices (2025–2026)

ComponentManaged / cloud optionSelf-hosted option
OrchestrationLangChain / LlamaIndex cloudLangChain, LlamaIndex (local)
EmbeddingOpenAI text-embedding-3, Voyage AInomic-embed-text, mxbai-embed-large
Vector storePinecone, Weaviate CloudQdrant, pgvector, Chroma
RerankerCohere Rerank 3.5BGE-Reranker, FlashRank
Generation LLMClaude Sonnet / GPT-4oLlama 3.1 70B via Ollama/vLLM
ObservabilityLangSmith, Arize PhoenixLangfuse (self-hosted Docker)

Checklist: Do You Understand This?

  • What are the two main phases of a RAG system and which runs offline vs online?
  • Why does chunking strategy matter so much, and what is the trade-off between chunk size and retrieval quality?
  • What is hybrid search, and why is it better than dense-only or keyword-only retrieval?
  • What does a reranker do and why is it applied after retrieval rather than replacing it?
  • How does history-aware query rewriting enable multi-turn RAG conversations?
  • Name the four RAGAS metrics and explain what each one measures.
  • What faithfulness check prevents the LLM from blending retrieved facts with hallucinated claims?