Intermediate

Document Ingestion

Document ingestion is the first stage of RAG — loading source documents and converting them into clean, indexable text. The quality of ingestion directly affects retrieval quality downstream. Garbage in, garbage out.

Raw Documents

PDF / HTML / DOCX / CSV

→

Loader

Extract text + metadata

→

Cleaner

Strip noise, normalise

→

Chunker

Split into pieces

→

Embedder

Convert to vectors

→

Vector Store

Index for retrieval

Knowledge base ingestion pipeline

Document Loader Libraries

Two popular Python libraries provide document loaders covering most common formats:

LangChain loaders (langchain-community) — a large collection of document loaders for PDFs, HTML, DOCX, CSV, JSON, Notion, Google Drive, Confluence, and many more. Good for quick prototyping; outputs Document objects with page_content and metadata.
LlamaIndex readers (llama-index) — similar scope; includes readers for S3, Slack, GitHub, database tables. More integrated with LlamaIndex's retrieval ecosystem.
Unstructured (unstructured library) — specialised in parsing complex document layouts. Better than generic loaders for PDFs with tables, multi-column layouts, or mixed content.

For simple prototypes, use LangChain loaders. For production systems with complex PDFs, evaluate Unstructured.

PDF Extraction

PDFs are the most common and most problematic format. Three approaches by quality:

Text layer extraction (pypdf, pdfplumber) — extracts existing text from the PDF text layer. Fast and cheap. Fails silently on scanned PDFs (image-only), produces garbled text from multi-column layouts, and loses most table structure.
pdfplumber — better than pypdf for table extraction; provides coordinate-based text extraction that handles multi-column layouts better. Start here for PDFs with tables.
OCR fallback (pytesseract or unstructured with OCR) — for scanned PDFs with no text layer. Significantly slower; quality depends on scan quality. Only use when text layer extraction fails.

Always check extracted text quality before indexing. A quick heuristic: if the extracted text has character density well below what you'd expect for the page, the PDF is likely scanned and needs OCR.

import pdfplumber

def extract_pdf(path: str) -> str:
    text_parts = []
    with pdfplumber.open(path) as pdf:
        for page in pdf.pages:
            # Extract text
            text = page.extract_text()
            if text:
                text_parts.append(text)
            # Extract tables as markdown-style text
            for table in page.extract_tables():
                table_text = "\n".join(" | ".join(str(cell or "") for cell in row) for row in table)
                text_parts.append(table_text)
    return "\n\n".join(text_parts)

HTML Cleaning

Raw HTML from web pages contains navigation, ads, footers, scripts, and styling — all noise that degrades embedding quality. Clean before indexing:

BeautifulSoup — extract specific elements (.main-content, article tags) and strip unwanted tags
trafilatura — purpose-built for extracting main article content from web pages; handles most news/blog/documentation sites well
readability-lxml — similar to trafilatura; works well for article content but less so for documentation with heavy code examples

import trafilatura

def extract_html(url: str) -> str:
    downloaded = trafilatura.fetch_url(url)
    return trafilatura.extract(downloaded, include_tables=True) or ""

DOCX and Office Formats

Word (.docx): python-docx — extracts paragraphs and table cells. Does not extract images or embedded objects.
Excel (.xlsx): openpyxl — read cell values; consider whether structured data is better handled via a SQL tool rather than embedding
PowerPoint (.pptx): python-pptx — extract text from slide shapes. Slide-per-chunk is usually the right chunking unit for presentations.

CSV and Structured Data

The key decision for structured data: embed or query?

Embed if: Each row has a text description field (product descriptions, article summaries, support tickets) and the primary query pattern is semantic search
Use SQL tool if: Queries require filtering, aggregation, or exact lookup (find all orders over $500 in Q3, count records by category)
Hybrid: Load structured data into a database; use a SQL MCP server for direct queries; optionally embed the description field for semantic search

Checklist: Do You Understand This?

LangChain / LlamaIndex loaders: quick prototyping; Unstructured: complex PDF layouts and tables
PDF: text layer extraction first (pypdf/pdfplumber); OCR fallback for scanned PDFs only
HTML: clean with trafilatura or BeautifulSoup before indexing — strip nav, ads, scripts
DOCX: python-docx; XLSX: openpyxl; PPTX: python-pptx (one chunk per slide)
CSV/structured data: embed text description fields; use SQL tool for filter/aggregation queries