Skip to content

Add pdfmux to PDF processing tools#20

Open
NameetP wants to merge 1 commit into
tstanislawek:mainfrom
NameetP:add-pdfmux
Open

Add pdfmux to PDF processing tools#20
NameetP wants to merge 1 commit into
tstanislawek:mainfrom
NameetP:add-pdfmux

Conversation

@NameetP
Copy link
Copy Markdown

@NameetP NameetP commented Apr 16, 2026

Adding pdfmux under PDF processing tools.

What it is

A Python orchestrator that classifies each PDF page (digital text, scanned, tables, mixed) and routes it to the optimal backend — PyMuPDF for digital text, Docling for tables, RapidOCR for scans, Gemini Flash for hard pages. Emits Markdown plus a per-page confidence score so document understanding pipelines can quarantine low-trust pages instead of feeding noise to KIE / LIR / classification models downstream.

Why it fits this list

  • Covers the messy ingestion step that precedes the KIE/LIR tasks this list focuses on. pdfmux doesn't do KIE itself; it produces clean Markdown + confidence signals that feed into the information extraction work downstream.
  • Confidence scoring is the differentiator. Rather than a binary "did it extract or not?" the score lets IE pipelines route low-confidence pages to human review or re-extraction.
  • Sits next to peers already in this section: borb, pdfplumber, pdfminer.six, Layout Parser, deepdoctection.
  • Open source (MIT), on PyPI: pip install pdfmux. Active maintainership — v1.5.0 released April 2026.
  • Benchmarked on 1,422 pages across 11 real-world business documents (10-Ks, S-1s, academic papers, legal opinions, FDA reports) with 100% confidence retained.

Links

pdfmux is a Python orchestrator for PDF-to-Markdown conversion. It
classifies each page and routes to the optimal backend (PyMuPDF,
Docling, RapidOCR, Gemini Flash), emitting Markdown plus a per-page
confidence score so document understanding pipelines can quarantine
low-trust pages.

Relevant to this list because:
- Handles exactly the messy document ingestion step that precedes KIE
- Exposes a confidence signal useful for downstream IE quality gating
- Open source (MIT), actively maintained (v1.5.0 released April 2026)
- Benchmarked on 1,422 pages across 11 real-world business documents
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant