RAG Chatbot Architecture
The RAG (Retrieval-Augmented Generation) chatbot is the most widely deployed AI architecture in production today. It lets an LLM answer questions grounded in your private documents β without fine-tuning, without hallucinating facts that are not in your corpus, and with citations linking every answer back to a source.
System Overview
A RAG chatbot has two distinct phases β the offline ingestion pipelinethat processes and indexes your documents, and the online query pipelinethat retrieves relevant context and generates grounded answers:
Two offline + one online pipeline β ingestion runs once; query pipeline runs per request
Ingestion Pipeline
Document loading
The first stage is loading raw documents from wherever they live. Most production RAG systems need multiple loaders:
- PDFs β PDF plumber or PyMuPDF for digital-native; OCR (Tesseract, Mistral OCR) for scanned
- Web pages β Playwright or BeautifulSoup for HTML; handle JavaScript-rendered content
- Office documents β python-docx for Word, openpyxl for Excel, python-pptx for slides
- Databases / APIs β direct SQL query or API fetch, structured as text records
- Cloud storage β Google Drive, SharePoint, S3 connectors (many available via LlamaIndex)
Key output: plain text plus metadata (source URL, document title, page number, date, author). Preserve metadata β you need it for citations later.
Chunking strategy
Chunking is the most impactful design decision in a RAG system. Chunks too large lose retrieval precision; chunks too small lose context. Common strategies:
| Strategy | Chunk size | Best for | Trade-off |
|---|---|---|---|
| Fixed-size with overlap | 512β1024 tokens, 10β20% overlap | General text, narrative docs | May split mid-sentence; simple to implement |
| Sentence / paragraph boundary | Variable, 200β800 tokens | Articles, reports, manuals | More coherent chunks; irregular size complicates batching |
| Semantic chunking | Variable, embedding-based boundary detection | Long mixed-topic documents | Best chunk coherence; extra embedding compute cost |
| Hierarchical (parent-child) | Small child (128 tokens) + large parent (1024) | Detailed Q&A with broad context | Retrieve small; inject large parent β doubles storage |
| Structure-aware | By heading / section / code block | Documentation, code repos, legal docs | Requires document structure parsing |
2025 recommendation
Use contextual chunking (Anthropic, 2024): before embedding each chunk, prepend a one-sentence AI-generated summary of the chunk's position in the document (βThis chunk is from Section 3 of the Q3 2025 earnings report, discussing revenue by regionβ). This reduces retrieval failures by 49% for free-standing chunks that lack surrounding context. The overhead is one LLM call per chunk at ingestion time.
Embedding and vector storage
Each chunk is converted to a dense vector (embedding) and stored in a vector database. The embedding model must be the same at ingestion and query time.
| Embedding model | Dimensions | Notes |
|---|---|---|
| text-embedding-3-large (OpenAI) | 3,072 | Strong performance, $0.00013/1K tokens |
| text-embedding-3-small (OpenAI) | 1,536 | 5Γ cheaper, still good for most use cases |
| voyage-3 (Voyage AI) | 1,024 | State-of-the-art on MTEB, domain-specific variants |
| nomic-embed-text (local) | 768 | Strong open-source option; runs locally |
| mxbai-embed-large (local) | 1,024 | Top open-source on MTEB, runs on CPU |
Vector store options: Pinecone (managed, production-ready, serverless tier), Qdrant (open-source, self-host or cloud, fast), pgvector(Postgres extension β no extra database if you already use Postgres), Weaviate (multi-modal, hybrid search built-in), Chroma (simple local development).
Query Pipeline
Every user message passes through this sequence before an answer is generated:
Target: total latency <2 s including retrieval, reranking, and LLM TTFT
Query rewriting
Raw user questions are often poor retrieval queries. A small LLM call transforms the question before retrieval:
- Standalone query β remove conversational references (βwhat about the second one?β β βwhat are the pricing details for the Enterprise plan?β)
- Multi-query expansion β generate 3β5 alternative phrasings of the same question, retrieve for all, deduplicate results
- HyDE (Hypothetical Document Embeddings) β generate a hypothetical ideal answer, embed it, retrieve documents similar to that ideal answer rather than the question
Hybrid search
Dense vector search alone misses exact-match queries (βwhat does Section 4.2.1 say?β). BM25 keyword search alone misses semantic matches. Combine both:
- Run dense retrieval (cosine similarity in vector store) and BM25 keyword search in parallel
- Merge results using Reciprocal Rank Fusion (RRF) β a score that rewards items ranked high in multiple result lists
- Take the top 20β50 merged candidates forward to the reranker
Pinecone, Weaviate, and Qdrant all support hybrid search natively.
Reranking
Vector similarity is fast but imprecise. A cross-encoder reranker takes the query and each candidate chunk together and scores how relevant the chunk actually is to the query β much more accurate than cosine distance, but too slow to apply to millions of documents. The two-stage approach (retrieve 50 β rerank β keep top 5) gives you both speed and precision.
Reranker options: Cohere Rerank 3.5 (API, best performance), FlashRank (open-source, fast local), BGE-Reranker (open-source, strong on BEIR benchmark). A reranker typically reduces retrieval failures by 30β40%.
Context assembly
After reranking, the top K chunks (typically 3β8) are assembled into a context block and injected into the generation prompt. Best practices:
- Include chunk metadata (document title, page, date) alongside content β helps the LLM cite accurately
- Order chunks by relevance score, most relevant first
- Add a system instruction: βAnswer using only the provided context. If the answer is not in the context, say so.β
- Use XML tags to delimit each source:
<source id="1" title="...">...</source>
Generation and Citation Rendering
The LLM receives the context block plus the user question and generates a grounded answer. To enable citation rendering:
- Instruct the model to include inline source references: βAs stated in [1], the policy requires...β
- Ask the model to output structured JSON:
{"answer": "...", "citations": [{"id": 1, "text": "..."}, ...]} - The UI maps citation IDs back to the original chunks and renders clickable source links
- For streaming responses, stream the answer text first, then parse and render citations after the full response arrives
Critical guardrail: Even with RAG, LLMs can blend retrieved facts with training knowledge. Add a faithfulness check: use an LLM-as-judge to verify that every claim in the answer is supported by a cited chunk. Flag or suppress answers that fail the check. Production systems target >90% faithfulness.
Multi-Turn Conversation Memory
A basic RAG system retrieves using only the current question. This breaks on follow-up questions that reference prior turns (βexplain that in simpler termsβ). History-aware RAG:
- Maintain a conversation buffer (last 5β10 turns) in the session
- Before retrieval, use a small LLM call to rewrite the current query as a standalone question incorporating relevant context from conversation history
- Retrieve using the rewritten standalone query
- Include the (summarised) conversation history in the generation prompt for coherent multi-turn answers
Evaluation with RAGAS
RAGAS (Retrieval Augmented Generation Assessment) provides the standard evaluation framework for RAG chatbots. Four core metrics:
| Metric | What it measures | Target |
|---|---|---|
| Faithfulness | Are all answer claims supported by retrieved context? | >90% |
| Answer relevance | Does the answer actually address the question? | >85% |
| Context precision | How much of the retrieved context was actually useful? | >70% |
| Context recall | Did retrieval surface all the context needed to answer? | >80% |
Run RAGAS against a golden dataset of question/answer/source triples before shipping. Regression test on every significant change to chunking, embedding model, or retrieval configuration. Other evaluation tools: LangSmith, Arize Phoenix, DeepEval.
Streaming Responses
Users perceive RAG chatbots as slow because retrieval adds latency before generation begins. Streaming reduces perceived wait time:
- Show a typing indicator or βsearching...β message immediately on query submission
- Stream LLM tokens to the UI as they arrive (Server-Sent Events or WebSockets)
- Render citation references progressively β parse and link them after streaming completes
- Target first-token latency <2 seconds (retrieval + reranking + LLM TTFT combined)
Production Component Choices (2025β2026)
| Component | Managed / cloud option | Self-hosted option |
|---|---|---|
| Orchestration | LangChain / LlamaIndex cloud | LangChain, LlamaIndex (local) |
| Embedding | OpenAI text-embedding-3, Voyage AI | nomic-embed-text, mxbai-embed-large |
| Vector store | Pinecone, Weaviate Cloud | Qdrant, pgvector, Chroma |
| Reranker | Cohere Rerank 3.5 | BGE-Reranker, FlashRank |
| Generation LLM | Claude Sonnet / GPT-4o | Llama 3.1 70B via Ollama/vLLM |
| Observability | LangSmith, Arize Phoenix | Langfuse (self-hosted Docker) |
Checklist: Do You Understand This?
- What are the two main phases of a RAG system and which runs offline vs online?
- Why does chunking strategy matter so much, and what is the trade-off between chunk size and retrieval quality?
- What is hybrid search, and why is it better than dense-only or keyword-only retrieval?
- What does a reranker do and why is it applied after retrieval rather than replacing it?
- How does history-aware query rewriting enable multi-turn RAG conversations?
- Name the four RAGAS metrics and explain what each one measures.
- What faithfulness check prevents the LLM from blending retrieved facts with hallucinated claims?