Skip to content

Image Extraction Improvements for Issue #511#1136

Draft
ljluestc wants to merge 1 commit into
run-llama:mainfrom
ljluestc:fix/pdf-image-extraction-511
Draft

Image Extraction Improvements for Issue #511#1136
ljluestc wants to merge 1 commit into
run-llama:mainfrom
ljluestc:fix/pdf-image-extraction-511

Conversation

@ljluestc
Copy link
Copy Markdown

Image Extraction Improvements for Issue #511

Problem Statement

Users reported that images in uploaded PDFs are not being recognized or extracted. Specifically, a user uploaded a Chinese microwave oven instruction manual (weibolu.pdf) containing text, images, and tables, but the output did not contain the images.

Job ID: 6a79d5d9-ce02-4103-9055-db03be7e7613

Root Cause Analysis

  1. Default parsing mode does not prioritize image extraction - users need to opt into premium or agent-based parsing
  2. Fast mode explicitly skips OCR and image extraction without clear warning
  3. No diagnostic tooling existed to help users understand why images weren't extracted
  4. No convenient API to get both text and images in a single call
  5. Chinese documents require language='zh' for optimal OCR

Solution Overview

This PR adds comprehensive image extraction diagnostics, warnings, and convenience methods to help users successfully extract images from PDFs.

Migration Guide

For users currently using load_data():

# Before
documents = parser.load_data("document.pdf")

# After (if you need images)
text_documents, image_documents = parser.load_data_with_images("document.pdf")

For users currently using parse():

# Before
result = await parser.aparse("document.pdf")

# After (with diagnostics)
result = await parser.aparse("document.pdf")
if not result.has_images():
    result.print_image_extraction_report()

Related Issues

- Add load_data_with_images / aload_data_with_images convenience methods
  to LlamaParse (Python) that return both text/markdown documents and
  ImageDocument objects in a single call
- Add loadDataWithImages to LlamaParseReader (TypeScript) that returns
  documents and image metadata together
- Add JobResult diagnostic helpers: has_images(),
  get_image_extraction_summary(), get_image_extraction_troubleshooting(),
  print_image_extraction_report()
- Emit a helpful warning when parse() detects no images were extracted,
  suggesting language, premium_mode, and take_screenshot options
- Clarify fast_mode description to warn it skips image extraction
- Improve docstrings on load_data / loadData to point users toward
  image-aware methods
- Add 17 unit tests for the new JobResult methods

Closes run-llama#511
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Mar 27, 2026

⚠️ No Changeset found

Latest commit: cce0830

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@ljluestc ljluestc changed the title # Image Extraction Improvements for Issue #511 Image Extraction Improvements for Issue #511 Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Images in uploaded pdfs are not recognized

1 participant