feat: Add automatic OCR support for scanned PDFs#4810
Conversation
|
I have read the CLA Document and I hereby sign the CLA You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot. |
- Add is_scanned_page() function to detect scanned vs native PDFs - Add get_text_smart() method with automatic OCR detection - Add to_markdown() methods for Page and Document classes - Add document_to_markdown() for full document conversion New API: - page.is_scanned() - Detect if page is image-based - page.get_text_smart() - Auto-apply OCR when needed - page.to_markdown() - Convert page to Markdown with OCR - doc.to_markdown() - Convert entire document to Markdown These enhancements make PyMuPDF work seamlessly with scanned PDFs by automatically detecting and applying OCR when text extraction would otherwise return empty results. Fixes: #[issue_number if any]
59c4421 to
58c6dab
Compare
|
I have read the CLA Document and I hereby sign the CLA |
|
Thank you for your suggestion. This feature is already included in package pymupdf4llm when being run in layout mode - see here. It includes heuristics for deciding whether a page would indeed benefit from being OCR'd - far beyond the criteria present in this PR. We will not duplicate efforts here. At a later point in time we might consider extending PyMuPDF's API by including calls to respective pymupdf4llm functions, but it is not among our pressing To Dos. So I hoping for your understanding I am going to close this PR. |
New API:
These enhancements make PyMuPDF work seamlessly with scanned PDFs by automatically detecting and applying OCR when text extraction would otherwise return empty results.
This addresses a common user complaint about empty text extraction from scanned PDFs, though no specific issue was filed.