Add pdfmux to PDF processing tools#20
Open
NameetP wants to merge 1 commit into
Open
Conversation
pdfmux is a Python orchestrator for PDF-to-Markdown conversion. It classifies each page and routes to the optimal backend (PyMuPDF, Docling, RapidOCR, Gemini Flash), emitting Markdown plus a per-page confidence score so document understanding pipelines can quarantine low-trust pages. Relevant to this list because: - Handles exactly the messy document ingestion step that precedes KIE - Exposes a confidence signal useful for downstream IE quality gating - Open source (MIT), actively maintained (v1.5.0 released April 2026) - Benchmarked on 1,422 pages across 11 real-world business documents
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adding pdfmux under PDF processing tools.
What it is
A Python orchestrator that classifies each PDF page (digital text, scanned, tables, mixed) and routes it to the optimal backend — PyMuPDF for digital text, Docling for tables, RapidOCR for scans, Gemini Flash for hard pages. Emits Markdown plus a per-page confidence score so document understanding pipelines can quarantine low-trust pages instead of feeding noise to KIE / LIR / classification models downstream.
Why it fits this list
pip install pdfmux. Active maintainership — v1.5.0 released April 2026.Links