Skip to content

refactor: make dpi explicit on convert_pdf_to_image for dedup with unstructured#501

Merged
cragwolfe merged 2 commits intomainfrom
refactor/dedup-convert-pdf-to-image
Apr 3, 2026
Merged

refactor: make dpi explicit on convert_pdf_to_image for dedup with unstructured#501
cragwolfe merged 2 commits intomainfrom
refactor/dedup-convert-pdf-to-image

Conversation

@KRRT7
Copy link
Copy Markdown
Collaborator

@KRRT7 KRRT7 commented Apr 3, 2026

Summary

  • Make dpi an explicit parameter (default 200) on convert_pdf_to_image instead of reading inference_config.PDF_RENDER_DPI internally
  • Enables unstructured to import and call this function directly, eliminating the duplicate _render_pdf_pages implementation
  • No behavior change — both internal callers already pass dpi explicitly

Changelog

## 1.6.2

### Enhancement
- Make `dpi` an explicit parameter on `convert_pdf_to_image` (default 200) instead of reading from config internally, enabling unstructured to use this as the single source of truth for PDF rendering

Depends on / blocks

Remove internal config dependency so unstructured can use this as the
single source of truth for PDF rendering without duplicating the function.
No longer referenced after convert_pdf_to_image takes dpi as an
explicit parameter — callers pass their own config value instead.
@KRRT7
Copy link
Copy Markdown
Collaborator Author

KRRT7 commented Apr 3, 2026

Added a second commit that removes PDF_RENDER_DPI from InferenceConfig — it's now dead code since convert_pdf_to_image takes dpi as an explicit parameter instead of reading from config. Callers (i.e. unstructured) pass their own config value directly.

@cragwolfe cragwolfe enabled auto-merge (squash) April 3, 2026 02:57
@cragwolfe cragwolfe merged commit b48efdd into main Apr 3, 2026
10 checks passed
@cragwolfe cragwolfe deleted the refactor/dedup-convert-pdf-to-image branch April 3, 2026 02:58
github-merge-queue Bot pushed a commit to Unstructured-IO/unstructured that referenced this pull request Apr 3, 2026
…erence (#4315)

## Summary

- Delete `_render_pdf_pages` from `pdf_image_utils.py` (~70 lines)
- Delegate `convert_pdf_to_image` and `convert_pdf_to_images` to
`unstructured-inference`'s implementation (which already has lazy
per-page rendering since v1.5.5)
- Pass `env_config.PDF_RENDER_DPI` explicitly instead of relying on
internal config
- Bump `unstructured-inference` dep to `>=1.6.2`

Peak memory for `path_only=True` drops from O(n_pages) to O(1 page) —
97% reduction on a 100-page PDF.

## Depends on

- [ ] Unstructured-IO/unstructured-inference#501 (make `dpi` explicit)

---------

Co-authored-by: codeflash-ai[bot] <178395242+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: Kevin Turcios <turcioskevinr@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants