Does markitdown support scanned PDF's? #1361

mindphil · 2025-07-18T15:21:39Z

mindphil
Jul 18, 2025

I'm working on a program that rename's documents based on the files metadata, as well as context extracted from PDF's. PyPDF works well enough for Pnative PDF's but struggles with anything scanned. How does this handle scanned documents?

cuicheng01 · 2025-09-15T03:33:58Z

cuicheng01
Sep 15, 2025

Perhaps PaddleOCR is a better choice, as it can achieve state-of-the-art accuracy in the analysis of scanned PDFs. https://www.paddleocr.ai/latest/en/version3.x/algorithm/PP-StructureV3/PP-StructureV3.html

0 replies

VANDRANKI · 2026-04-15T14:24:12Z

VANDRANKI
Apr 15, 2026

markitdown uses pdfminer under the hood, which only reads text that is actually encoded in the PDF. For scanned PDFs (images of pages), there is no embedded text, so markitdown will return empty or minimal output - this is a pdfminer limitation, not a markitdown one.

To handle scanned PDFs you need to run OCR first and produce a searchable PDF, then pass that through markitdown. Common options:

tesseract + ocrmypdf: then
PaddleOCR (as mentioned above) for higher accuracy on complex layouts
Cloud OCR services (Azure Document Intelligence, Google Document AI) if accuracy is critical

Once you have a searchable PDF with embedded text, markitdown converts it cleanly.

0 replies

lciesielski · 2026-04-21T13:26:46Z

lciesielski
Apr 21, 2026

Also, it seems that since March, markitdown supports an OCR plugin. Here is the README for the plugin.

Note: You will need to supply your own API key for an LLM provider (like OpenAI) for it to work.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does markitdown support scanned PDF's? #1361

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Does markitdown support scanned PDF's? #1361

Uh oh!

mindphil Jul 18, 2025

Replies: 3 comments

Uh oh!

cuicheng01 Sep 15, 2025

Uh oh!

VANDRANKI Apr 15, 2026

Uh oh!

lciesielski Apr 21, 2026

mindphil
Jul 18, 2025

cuicheng01
Sep 15, 2025

VANDRANKI
Apr 15, 2026

lciesielski
Apr 21, 2026