Web Dev Tools

Document Parsing for RAG

Turning PDFs, DOCX, HTML, and slide decks into clean chunks for retrieval.

The hardest part of most RAG pipelines isn't the LLM — it's getting good text out of PDFs and Word docs. This category exists to solve that.

Open source / self-host parsers

  • Unstructured.io — open core (Python); the most popular OSS parser. PDF / DOCX / HTML / EPUB / images / emails. Hosted API also available.
  • marker (Vik Paruchuri) — PDF → Markdown via an ML model; very high quality on academic / structured documents.
  • docling (IBM) — open-source; DOCX, PDF, PPTX, HTML, images; layout-aware.
  • MegaParse — open-source; aggressive PDF structure preservation.
  • pdfplumber — Python; great for tables.
  • PyMuPDF / mupdf-js — fast PDF text extraction; available in JS via @mupdf/mupdfjs.

Hosted document AI

  • LlamaParse (LlamaIndex) — gold-standard hosted PDF/DOCX parser; complex tables and scanned docs handled gracefully. Free tier.
  • Mindee — receipts, invoices, IDs, custom doc extraction. Free tier.
  • Reducto — newer, very accurate hosted parser.
  • Anthropic Files API, OpenAI Files / Responses with file_search — let the model handle parsing inside the API.
  • AWS Textract, Google Document AI, Azure Document Intelligence — cloud heavyweights.
  • Veryfi / Klippa / Rossum — vertical-specific (receipts, invoices).

TypeScript-native (run in Node / Workers)

  • pdf-parse — minimal PDF→text in Node.
  • pdf2json — older alternative.
  • pdf-lib — for PDF manipulation; not text extraction.
  • @mupdf/mupdfjs — MuPDF in WASM; fast text + images extraction.
  • unpdf (UnJS) — PDF utilities for serverless / edge runtimes.
  • @vendia/pdf-extractor — focused on edge runtimes.
  • mammoth — DOCX → HTML / markdown.
  • @docusaurus/utils mdast parsing — for markdown.
  • turndown — HTML → Markdown.
  • @mozilla/readability — extract main article content from HTML.

OCR fallback for scanned PDFs

See OCR & Computer Vision — Tesseract.js, Mindee, Textract.

LLMs as parsers

By 2026 LLMs (Claude, GPT-4o, Gemini) are accurate enough that for many use cases the simplest pipeline is:

  1. PDF → page images.
  2. Send pages to a vision-capable LLM with a "convert to markdown" prompt.
  3. Cache the result keyed by file hash.

It's slower and more expensive than dedicated parsers but far more accurate on weird layouts. Pair with a real parser for cost-sensitive paths.

Chunking after parsing

  • Chonkie — modern, fast, opinionated chunker (Python + TS).
  • LangChain text splitters — recursive, semantic, markdown-aware.
  • LlamaIndex node parsers — similar set.
  • semantic-text-splitter (Rust + WASM) — fast.
  • @mastra/rag chunkers — bundled with Mastra.

Tables specifically

  • Camelot / Tabula (Python) — PDF tables.
  • pdfplumber — Python; programmatic.
  • LlamaParse premium mode — best hosted table extraction.
  • Reducto — table-focused.

Pick this if…

  • Highest-quality hosted PDF parsing: LlamaParse or Reducto.
  • Self-host, OSS, broad format support: Unstructured.io.
  • Academic / scientific PDFs: marker.
  • Edge runtime / Workers: unpdf or @mupdf/mupdfjs.
  • Receipts / invoices / IDs: Mindee or AWS Textract.
  • Just feed to the LLM: Anthropic Files API or OpenAI Responses API; simplest if cost isn't tight.
  • Chunking after parse: Chonkie or LangChain splitters.

On this page