feat: Add automatic OCR support for scanned PDFs by anup00900 · Pull Request #4810 · pymupdf/PyMuPDF

anup00900 · 2025-11-26T11:06:06Z

Add is_scanned_page() function to detect scanned vs native PDFs
Add get_text_smart() method with automatic OCR detection
Add to_markdown() methods for Page and Document classes
Add document_to_markdown() for full document conversion

New API:

page.is_scanned() - Detect if page is image-based
page.get_text_smart() - Auto-apply OCR when needed
page.to_markdown() - Convert page to Markdown with OCR
doc.to_markdown() - Convert entire document to Markdown

These enhancements make PyMuPDF work seamlessly with scanned PDFs by automatically detecting and applying OCR when text extraction would otherwise return empty results.

This addresses a common user complaint about empty text extraction from scanned PDFs, though no specific issue was filed.

github-actions · 2025-11-26T11:06:19Z

Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.

I have read the CLA Document and I hereby sign the CLA

_{You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.}

- Add is_scanned_page() function to detect scanned vs native PDFs - Add get_text_smart() method with automatic OCR detection - Add to_markdown() methods for Page and Document classes - Add document_to_markdown() for full document conversion New API: - page.is_scanned() - Detect if page is image-based - page.get_text_smart() - Auto-apply OCR when needed - page.to_markdown() - Convert page to Markdown with OCR - doc.to_markdown() - Convert entire document to Markdown These enhancements make PyMuPDF work seamlessly with scanned PDFs by automatically detecting and applying OCR when text extraction would otherwise return empty results. Fixes: #[issue_number if any]

anup00900 · 2025-11-26T11:11:33Z

I have read the CLA Document and I hereby sign the CLA

JorjMcKie · 2025-11-26T11:21:52Z

Thank you for your suggestion.

This feature is already included in package pymupdf4llm when being run in layout mode - see here.

It includes heuristics for deciding whether a page would indeed benefit from being OCR'd - far beyond the criteria present in this PR.

We will not duplicate efforts here. At a later point in time we might consider extending PyMuPDF's API by including calls to respective pymupdf4llm functions, but it is not among our pressing To Dos.

So I hoping for your understanding I am going to close this PR.

anup00900 force-pushed the feature/auto-ocr-scanned-pdfs branch from 59c4421 to 58c6dab Compare November 26, 2025 11:08

JorjMcKie closed this Nov 26, 2025

github-actions Bot locked and limited conversation to collaborators Nov 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add automatic OCR support for scanned PDFs#4810

feat: Add automatic OCR support for scanned PDFs#4810
anup00900 wants to merge 1 commit into
pymupdf:mainfrom
anup00900:feature/auto-ocr-scanned-pdfs

anup00900 commented Nov 26, 2025

Uh oh!

github-actions Bot commented Nov 26, 2025 •

edited

Loading

Uh oh!

anup00900 commented Nov 26, 2025

Uh oh!

JorjMcKie commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anup00900 commented Nov 26, 2025

Uh oh!

github-actions Bot commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anup00900 commented Nov 26, 2025

Uh oh!

JorjMcKie commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Nov 26, 2025 •

edited

Loading