Document Parsing for RAG
Turning PDFs, DOCX, HTML, and slide decks into clean chunks for retrieval.
The hardest part of most RAG pipelines isn't the LLM — it's getting good text out of PDFs and Word docs. This category exists to solve that.
Open source / self-host parsers
- ★ Unstructured.io — open core (Python); the most popular OSS parser. PDF / DOCX / HTML / EPUB / images / emails. Hosted API also available.
- ★ marker (Vik Paruchuri) — PDF → Markdown via an ML model; very high quality on academic / structured documents.
- docling (IBM) — open-source; DOCX, PDF, PPTX, HTML, images; layout-aware.
- MegaParse — open-source; aggressive PDF structure preservation.
- pdfplumber — Python; great for tables.
- PyMuPDF /
mupdf-js— fast PDF text extraction; available in JS via@mupdf/mupdfjs.
Hosted document AI
- ★ LlamaParse (LlamaIndex) — gold-standard hosted PDF/DOCX parser; complex tables and scanned docs handled gracefully. Free tier.
- ★ Mindee — receipts, invoices, IDs, custom doc extraction. Free tier.
- Reducto — newer, very accurate hosted parser.
- Anthropic Files API, OpenAI Files / Responses with file_search — let the model handle parsing inside the API.
- AWS Textract, Google Document AI, Azure Document Intelligence — cloud heavyweights.
- Veryfi / Klippa / Rossum — vertical-specific (receipts, invoices).
TypeScript-native (run in Node / Workers)
- ★
pdf-parse— minimal PDF→text in Node. pdf2json— older alternative.pdf-lib— for PDF manipulation; not text extraction.@mupdf/mupdfjs— MuPDF in WASM; fast text + images extraction.unpdf(UnJS) — PDF utilities for serverless / edge runtimes.@vendia/pdf-extractor— focused on edge runtimes.mammoth— DOCX → HTML / markdown.@docusaurus/utilsmdast parsing — for markdown.turndown— HTML → Markdown.@mozilla/readability— extract main article content from HTML.
OCR fallback for scanned PDFs
See OCR & Computer Vision — Tesseract.js, Mindee, Textract.
LLMs as parsers
By 2026 LLMs (Claude, GPT-4o, Gemini) are accurate enough that for many use cases the simplest pipeline is:
- PDF → page images.
- Send pages to a vision-capable LLM with a "convert to markdown" prompt.
- Cache the result keyed by file hash.
It's slower and more expensive than dedicated parsers but far more accurate on weird layouts. Pair with a real parser for cost-sensitive paths.
Chunking after parsing
- ★ Chonkie — modern, fast, opinionated chunker (Python + TS).
- LangChain text splitters — recursive, semantic, markdown-aware.
- LlamaIndex node parsers — similar set.
semantic-text-splitter(Rust + WASM) — fast.@mastra/ragchunkers — bundled with Mastra.
Tables specifically
- Camelot / Tabula (Python) — PDF tables.
- pdfplumber — Python; programmatic.
- LlamaParse premium mode — best hosted table extraction.
- Reducto — table-focused.
Pick this if…
- Highest-quality hosted PDF parsing: LlamaParse or Reducto.
- Self-host, OSS, broad format support: Unstructured.io.
- Academic / scientific PDFs: marker.
- Edge runtime / Workers:
unpdfor@mupdf/mupdfjs. - Receipts / invoices / IDs: Mindee or AWS Textract.
- Just feed to the LLM: Anthropic Files API or OpenAI Responses API; simplest if cost isn't tight.
- Chunking after parse: Chonkie or LangChain splitters.