Intermediate
Document Ingestion
Loaders, parsers, and format handling — processing PDF, HTML, DOCX, and CSV into indexable content.
What You Will Learn
- Document loaders: LangChain, LlamaIndex, custom loaders
- PDF extraction: text layers, OCR fallback, table extraction
- HTML cleaning: removing nav, ads, scripts before indexing
- DOCX and Office format handling
- CSV and structured data: when to embed vs use a SQL tool
This page is under development. Content is being added progressively. Check back soon for the full article.