Document Parsing for RAG

Turning PDFs, DOCX, HTML, and slide decks into clean chunks for retrieval.

The hardest part of most RAG pipelines isn't the LLM — it's getting good text out of PDFs and Word docs. This category exists to solve that.

Open source / self-host parsers

★ Unstructured.io — open core (Python); the most popular OSS parser. PDF / DOCX / HTML / EPUB / images / emails. Hosted API also available.
★ marker (Vik Paruchuri) — PDF → Markdown via an ML model; very high quality on academic / structured documents.
docling (IBM) — open-source; DOCX, PDF, PPTX, HTML, images; layout-aware.
MegaParse — open-source; aggressive PDF structure preservation.
pdfplumber — Python; great for tables.
PyMuPDF / mupdf-js — fast PDF text extraction; available in JS via @mupdf/mupdfjs.

★ LlamaParse (LlamaIndex) — gold-standard hosted PDF/DOCX parser; complex tables and scanned docs handled gracefully. Free tier.
★ Mindee — receipts, invoices, IDs, custom doc extraction. Free tier.
Reducto — newer, very accurate hosted parser.
Anthropic Files API, OpenAI Files / Responses with file_search — let the model handle parsing inside the API.
AWS Textract, Google Document AI, Azure Document Intelligence — cloud heavyweights.
Veryfi / Klippa / Rossum — vertical-specific (receipts, invoices).

See OCR & Computer Vision — Tesseract.js, Mindee, Textract.

By 2026 LLMs (Claude, GPT-4o, Gemini) are accurate enough that for many use cases the simplest pipeline is:

It's slower and more expensive than dedicated parsers but far more accurate on weird layouts. Pair with a real parser for cost-sensitive paths.

Highest-quality hosted PDF parsing: LlamaParse or Reducto.
Self-host, OSS, broad format support: Unstructured.io.
Academic / scientific PDFs: marker.
Edge runtime / Workers: unpdf or @mupdf/mupdfjs.
Receipts / invoices / IDs: Mindee or AWS Textract.
Just feed to the LLM: Anthropic Files API or OpenAI Responses API; simplest if cost isn't tight.
Chunking after parse: Chonkie or LangChain splitters.