Skip to content

refactor: deduplicate PDF rendering by delegating to unstructured-inference#4315

Merged
KRRT7 merged 6 commits intomainfrom
mem/lazy-pdf-rendering
Apr 3, 2026
Merged

refactor: deduplicate PDF rendering by delegating to unstructured-inference#4315
KRRT7 merged 6 commits intomainfrom
mem/lazy-pdf-rendering

Conversation

@codeflash-ai
Copy link
Copy Markdown
Contributor

@codeflash-ai codeflash-ai Bot commented Apr 3, 2026

Summary

  • Delete _render_pdf_pages from pdf_image_utils.py (~70 lines)
  • Delegate convert_pdf_to_image and convert_pdf_to_images to unstructured-inference's implementation (which already has lazy per-page rendering since v1.5.5)
  • Pass env_config.PDF_RENDER_DPI explicitly instead of relying on internal config
  • Bump unstructured-inference dep to >=1.6.2

Peak memory for path_only=True drops from O(n_pages) to O(1 page) — 97% reduction on a 100-page PDF.

Depends on

Render and save each PDF page individually instead of accumulating all
PIL images in a dict before saving. With path_only=True, peak memory
drops from O(n_pages) to O(1 page).
@codeflash-ai codeflash-ai Bot force-pushed the mem/lazy-pdf-rendering branch from 973a994 to ef1063e Compare April 3, 2026 01:23
…erence

Delete _render_pdf_pages and delegate convert_pdf_to_image to
unstructured-inference's implementation, which already has lazy per-page
rendering. Bumps inference dep to >=1.6.2.
@KRRT7 KRRT7 changed the title mem: lazy per-page rendering in _render_pdf_pages to reduce peak memory refactor: deduplicate PDF rendering by delegating to unstructured-inference Apr 3, 2026
@KRRT7 KRRT7 marked this pull request as draft April 3, 2026 02:47
cragwolfe pushed a commit to Unstructured-IO/unstructured-inference that referenced this pull request Apr 3, 2026
…structured (#501)

## Summary

- Make `dpi` an explicit parameter (default 200) on
`convert_pdf_to_image` instead of reading
`inference_config.PDF_RENDER_DPI` internally
- Enables unstructured to import and call this function directly,
eliminating the duplicate `_render_pdf_pages` implementation
- No behavior change — both internal callers already pass `dpi`
explicitly

## Changelog

```
## 1.6.2

### Enhancement
- Make `dpi` an explicit parameter on `convert_pdf_to_image` (default 200) instead of reading from config internally, enabling unstructured to use this as the single source of truth for PDF rendering
```

## Depends on / blocks

- Blocks Unstructured-IO/unstructured#4315 (dedup of PDF rendering)
- Blocks Unstructured-IO/core-product#1480 (version bump)
- Add override-dependencies to relax inference's numpy/pandas floors
  (numpy>=2.4.2 → >=1.26.0, pandas>=3.0.0 → >=1.5.0) which conflict
  with kdbai-client via pykx
- Add Python version marker so 3.11 falls back to inference >=1.2.0
  (1.6.x requires Python >=3.12)
@KRRT7 KRRT7 marked this pull request as ready for review April 3, 2026 03:38
KRRT7 added 2 commits April 2, 2026 22:38
0.22.13 is already taken on main by the standardize_quotes change.
@KRRT7 KRRT7 enabled auto-merge April 3, 2026 03:44
@KRRT7 KRRT7 added this pull request to the merge queue Apr 3, 2026
Merged via the queue into main with commit affb9d6 Apr 3, 2026
52 checks passed
@KRRT7 KRRT7 deleted the mem/lazy-pdf-rendering branch April 3, 2026 05:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants