Replies: 3 comments
-
|
Perhaps PaddleOCR is a better choice, as it can achieve state-of-the-art accuracy in the analysis of scanned PDFs. https://www.paddleocr.ai/latest/en/version3.x/algorithm/PP-StructureV3/PP-StructureV3.html |
Beta Was this translation helpful? Give feedback.
-
|
markitdown uses pdfminer under the hood, which only reads text that is actually encoded in the PDF. For scanned PDFs (images of pages), there is no embedded text, so markitdown will return empty or minimal output - this is a pdfminer limitation, not a markitdown one. To handle scanned PDFs you need to run OCR first and produce a searchable PDF, then pass that through markitdown. Common options:
Once you have a searchable PDF with embedded text, markitdown converts it cleanly. |
Beta Was this translation helpful? Give feedback.
-
|
Also, it seems that since March, markitdown supports an OCR plugin. Here is the README for the plugin. Note: You will need to supply your own API key for an LLM provider (like OpenAI) for it to work. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm working on a program that rename's documents based on the files metadata, as well as context extracted from PDF's. PyPDF works well enough for Pnative PDF's but struggles with anything scanned. How does this handle scanned documents?
Beta Was this translation helpful? Give feedback.
All reactions