Skip to content

Refactor pdf image rendering to support chunked isolated execution#4319

Closed
PastelStorm wants to merge 16 commits intomainfrom
evoss/refactor-pdf-image-rendering
Closed

Refactor pdf image rendering to support chunked isolated execution#4319
PastelStorm wants to merge 16 commits intomainfrom
evoss/refactor-pdf-image-rendering

Conversation

@PastelStorm
Copy link
Copy Markdown
Contributor

@PastelStorm PastelStorm commented Apr 4, 2026

Note

Medium Risk
Touches core PDF OCR/rendering and image extraction paths; chunking and new renderer indirection could change performance and page-to-element mapping, though changes are well-covered by new tests.

Overview
Refactors PDF rendering/OCR to support chunked, runtime-resolved rendering. PDF page rasterization is now done in configurable chunks via PDFIUM_CHUNK_SIZE (default 8) for process_file_with_ocr() and save_elements(), and convert_pdf_to_image resolves the underlying renderer at call time to allow downstream monkey-patching.

Improves robustness around streaming inputs and edge cases. File-like inputs have their read position restored after OCR/image extraction, invalid PDFIUM_CHUNK_SIZE values fall back with a warning, and OCR now errors if a PDF renders pages but the provided layout is empty.

Enhances CSV partitioning for tricky inputs. Delimiter detection is hardened for small/truncated/quoted data and long first rows; delimiter-less single-column CSVs are parsed with CSV quoting semantics instead of raw text.

Also bumps CI GitHub Actions versions (notably actions/checkout@v5 and dorny/paths-filter@v4 plus Node24 env), updates license ignore list, fixes a pandas read-only array mutation in Tesseract OCR, bumps version to 0.22.17, and refreshes ingest fixture outputs accordingly.

Reviewed by Cursor Bugbot for commit 6235628. Bugbot is set up for automated code reviews on this repo. Configure here.

@socket-security
Copy link
Copy Markdown

socket-security bot commented Apr 4, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedgithub/​actions/​checkout@​93cb6efe18208431cddfb8368fd83d5badbf9bfd100100100100100
Updatedgithub/​dorny/​paths-filter@​de90cc6fb38fc0963ad72b210f1f284cd68cea36 ⏵ fbd0ab8f3e69293af611ebaee6363fc25e6d187d100 +1100100100100

View full report

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 429480c. Configure here.

@cragwolfe
Copy link
Copy Markdown
Contributor

gpt-pro review: https://docs.google.com/document/d/1Eofs39phN6LNtv8xEWVk4BNcsJ8Zy_OVBQq8h8zYbCk/edit?usp=sharing

@PastelStorm PastelStorm closed this Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants