refactor: deduplicate PDF rendering by delegating to unstructured-inference by codeflash-ai[bot] · Pull Request #4315 · Unstructured-IO/unstructured

codeflash-ai · 2026-04-03T01:20:12Z

Summary

Delete _render_pdf_pages from pdf_image_utils.py (~70 lines)
Delegate convert_pdf_to_image and convert_pdf_to_images to unstructured-inference's implementation (which already has lazy per-page rendering since v1.5.5)
Pass env_config.PDF_RENDER_DPI explicitly instead of relying on internal config
Bump unstructured-inference dep to >=1.6.2

Peak memory for path_only=True drops from O(n_pages) to O(1 page) — 97% reduction on a 100-page PDF.

Depends on

refactor: make dpi explicit on convert_pdf_to_image for dedup with unstructured unstructured-inference#501 (make dpi explicit)

Render and save each PDF page individually instead of accumulating all PIL images in a dict before saving. With path_only=True, peak memory drops from O(n_pages) to O(1 page).

…erence Delete _render_pdf_pages and delegate convert_pdf_to_image to unstructured-inference's implementation, which already has lazy per-page rendering. Bumps inference dep to >=1.6.2.

…structured (#501) ## Summary - Make `dpi` an explicit parameter (default 200) on `convert_pdf_to_image` instead of reading `inference_config.PDF_RENDER_DPI` internally - Enables unstructured to import and call this function directly, eliminating the duplicate `_render_pdf_pages` implementation - No behavior change — both internal callers already pass `dpi` explicitly ## Changelog ``` ## 1.6.2 ### Enhancement - Make `dpi` an explicit parameter on `convert_pdf_to_image` (default 200) instead of reading from config internally, enabling unstructured to use this as the single source of truth for PDF rendering ``` ## Depends on / blocks - Blocks Unstructured-IO/unstructured#4315 (dedup of PDF rendering) - Blocks Unstructured-IO/core-product#1480 (version bump)

- Add override-dependencies to relax inference's numpy/pandas floors (numpy>=2.4.2 → >=1.26.0, pandas>=3.0.0 → >=1.5.0) which conflict with kdbai-client via pykx - Add Python version marker so 3.11 falls back to inference >=1.2.0 (1.6.x requires Python >=3.12)

0.22.13 is already taken on main by the standardize_quotes change.

codeflash-ai Bot added Skip-Changelog and removed Skip-Changelog labels Apr 3, 2026

mem: lazy per-page rendering in _render_pdf_pages to reduce peak memory

ef1063e

Render and save each PDF page individually instead of accumulating all PIL images in a dict before saving. With path_only=True, peak memory drops from O(n_pages) to O(1 page).

codeflash-ai Bot force-pushed the mem/lazy-pdf-rendering branch from 973a994 to ef1063e Compare April 3, 2026 01:23

refactor: deduplicate PDF rendering by delegating to unstructured-inf…

6fcc55f

…erence Delete _render_pdf_pages and delegate convert_pdf_to_image to unstructured-inference's implementation, which already has lazy per-page rendering. Bumps inference dep to >=1.6.2.

KRRT7 mentioned this pull request Apr 3, 2026

refactor: make dpi explicit on convert_pdf_to_image for dedup with unstructured Unstructured-IO/unstructured-inference#501

Merged

KRRT7 changed the title ~~mem: lazy per-page rendering in _render_pdf_pages to reduce peak memory~~ refactor: deduplicate PDF rendering by delegating to unstructured-inference Apr 3, 2026

KRRT7 marked this pull request as draft April 3, 2026 02:47

KRRT7 marked this pull request as ready for review April 3, 2026 03:38

KRRT7 added 2 commits April 2, 2026 22:38

merge: resolve changelog conflict with upstream main

de87c19

chore: bump version to 0.22.14

2551cb4

0.22.13 is already taken on main by the standardize_quotes change.

cragwolfe approved these changes Apr 3, 2026

View reviewed changes

KRRT7 enabled auto-merge April 3, 2026 03:44

fix: update test assertion to match inference 1.6.2 error message casing

7bbab45

KRRT7 added this pull request to the merge queue Apr 3, 2026

Merged via the queue into main with commit affb9d6 Apr 3, 2026
52 checks passed

KRRT7 deleted the mem/lazy-pdf-rendering branch April 3, 2026 05:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: deduplicate PDF rendering by delegating to unstructured-inference#4315

refactor: deduplicate PDF rendering by delegating to unstructured-inference#4315
KRRT7 merged 6 commits intomainfrom
mem/lazy-pdf-rendering

codeflash-ai Bot commented Apr 3, 2026 •

edited by KRRT7

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

codeflash-ai Bot commented Apr 3, 2026 • edited by KRRT7 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Depends on

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codeflash-ai Bot commented Apr 3, 2026 •

edited by KRRT7

Loading