Skip to content

feat: Add automatic OCR support for scanned PDFs#4810

Closed
anup00900 wants to merge 1 commit into
pymupdf:mainfrom
anup00900:feature/auto-ocr-scanned-pdfs
Closed

feat: Add automatic OCR support for scanned PDFs#4810
anup00900 wants to merge 1 commit into
pymupdf:mainfrom
anup00900:feature/auto-ocr-scanned-pdfs

Conversation

@anup00900
Copy link
Copy Markdown

  • Add is_scanned_page() function to detect scanned vs native PDFs
  • Add get_text_smart() method with automatic OCR detection
  • Add to_markdown() methods for Page and Document classes
  • Add document_to_markdown() for full document conversion

New API:

  • page.is_scanned() - Detect if page is image-based
  • page.get_text_smart() - Auto-apply OCR when needed
  • page.to_markdown() - Convert page to Markdown with OCR
  • doc.to_markdown() - Convert entire document to Markdown

These enhancements make PyMuPDF work seamlessly with scanned PDFs by automatically detecting and applying OCR when text extraction would otherwise return empty results.

This addresses a common user complaint about empty text extraction from scanned PDFs, though no specific issue was filed.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 26, 2025


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

- Add is_scanned_page() function to detect scanned vs native PDFs
- Add get_text_smart() method with automatic OCR detection
- Add to_markdown() methods for Page and Document classes
- Add document_to_markdown() for full document conversion

New API:
- page.is_scanned() - Detect if page is image-based
- page.get_text_smart() - Auto-apply OCR when needed
- page.to_markdown() - Convert page to Markdown with OCR
- doc.to_markdown() - Convert entire document to Markdown

These enhancements make PyMuPDF work seamlessly with scanned PDFs
by automatically detecting and applying OCR when text extraction
would otherwise return empty results.

Fixes: #[issue_number if any]
@anup00900 anup00900 force-pushed the feature/auto-ocr-scanned-pdfs branch from 59c4421 to 58c6dab Compare November 26, 2025 11:08
@anup00900
Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

@JorjMcKie
Copy link
Copy Markdown
Collaborator

Thank you for your suggestion.

This feature is already included in package pymupdf4llm when being run in layout mode - see here.

It includes heuristics for deciding whether a page would indeed benefit from being OCR'd - far beyond the criteria present in this PR.

We will not duplicate efforts here. At a later point in time we might consider extending PyMuPDF's API by including calls to respective pymupdf4llm functions, but it is not among our pressing To Dos.

So I hoping for your understanding I am going to close this PR.

@JorjMcKie JorjMcKie closed this Nov 26, 2025
@github-actions github-actions Bot locked and limited conversation to collaborators Nov 26, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants