Refactor pdf image rendering to support chunked isolated execution by PastelStorm · Pull Request #4319 · Unstructured-IO/unstructured

PastelStorm · 2026-04-04T17:02:50Z

Note

Medium Risk
Touches core PDF OCR/rendering and image extraction paths; chunking and new renderer indirection could change performance and page-to-element mapping, though changes are well-covered by new tests.

Overview
Refactors PDF rendering/OCR to support chunked, runtime-resolved rendering. PDF page rasterization is now done in configurable chunks via PDFIUM_CHUNK_SIZE (default 8) for process_file_with_ocr() and save_elements(), and convert_pdf_to_image resolves the underlying renderer at call time to allow downstream monkey-patching.

Improves robustness around streaming inputs and edge cases. File-like inputs have their read position restored after OCR/image extraction, invalid PDFIUM_CHUNK_SIZE values fall back with a warning, and OCR now errors if a PDF renders pages but the provided layout is empty.

Enhances CSV partitioning for tricky inputs. Delimiter detection is hardened for small/truncated/quoted data and long first rows; delimiter-less single-column CSVs are parsed with CSV quoting semantics instead of raw text.

Also bumps CI GitHub Actions versions (notably actions/checkout@v5 and dorny/paths-filter@v4 plus Node24 env), updates license ignore list, fixes a pandas read-only array mutation in Tesseract OCR, bumps version to 0.22.17, and refreshes ingest fixture outputs accordingly.

^{Reviewed by Cursor Bugbot for commit 6235628. Bugbot is set up for automated code reviews on this repo. Configure here.}

unstructured/partition/pdf_image/pdf_image_utils.py

unstructured/partition/csv.py

socket-security · 2026-04-04T18:57:02Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	github/actions/checkout@93cb6efe18208431cddfb8368fd83d5badbf9bfd
	github/dorny/paths-filter@de90cc6fb38fc0963ad72b210f1f284cd68cea36 ⏵ fbd0ab8f3e69293af611ebaee6363fc25e6d187d	⁺¹

View full report

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 429480c. Configure here.}

.github/workflows/ci.yml

cragwolfe · 2026-04-04T19:46:11Z

gpt-pro review: https://docs.google.com/document/d/1Eofs39phN6LNtv8xEWVk4BNcsJ8Zy_OVBQq8h8zYbCk/edit?usp=sharing

Refactor pdf image rendering to support chunked isolated execution

e7cf82f

cursor bot reviewed Apr 4, 2026

View reviewed changes

unstructured/partition/pdf_image/pdf_image_utils.py Outdated Show resolved Hide resolved

unstructured/partition/pdf_image/pdf_image_utils.py Outdated Show resolved Hide resolved

PastelStorm added 4 commits April 4, 2026 10:11

fixes

3f7f656

cursor bot and shfmt fixes

e15db79

more lint

4485b28

fix tests broken after deps bump

35969ea

cursor bot reviewed Apr 4, 2026

View reviewed changes

unstructured/partition/csv.py Outdated Show resolved Hide resolved

unstructured/partition/csv.py Outdated Show resolved Hide resolved

PastelStorm added 6 commits April 4, 2026 10:48

boop

4fb3879

fix csv regression

b4940d8

cursor bot feedback

edf4efd

lint

97f67bd

fixtures

178b3d3

switch to node 24 actions

429480c

cursor bot reviewed Apr 4, 2026

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

indentation

c690e50

PastelStorm added 4 commits April 4, 2026 14:31

Crag's GPT pro review

d7038cf

lint

267f8e6

more lint

e54620c

lint

6235628

PastelStorm closed this Apr 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor pdf image rendering to support chunked isolated execution#4319

Refactor pdf image rendering to support chunked isolated execution#4319
PastelStorm wants to merge 16 commits intomainfrom
evoss/refactor-pdf-image-rendering

PastelStorm commented Apr 4, 2026 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

socket-security bot commented Apr 4, 2026 •

edited

Loading

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

cragwolfe commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PastelStorm commented Apr 4, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

socket-security bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cragwolfe commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PastelStorm commented Apr 4, 2026 •

edited by cursor bot

Loading

socket-security bot commented Apr 4, 2026 •

edited

Loading