Fix to_image() omitting filled AcroForm widget values (closes #1367)#1372
Open
Cyberfilo wants to merge 2 commits into
Open
Fix to_image() omitting filled AcroForm widget values (closes #1367)#1372Cyberfilo wants to merge 2 commits into
to_image() omitting filled AcroForm widget values (closes #1367)#1372Cyberfilo wants to merge 2 commits into
Conversation
`get_page_image` opens a `pypdfium2.PdfDocument` and immediately loads the page without initializing PDFium's form environment. Filled AcroForm field text is drawn via PDFium's form-rendering layer (`FPDF_FFLDraw`), which only runs when a form environment exists. The fix: call `pdfium_doc.init_forms()` after open and before `get_page(page_ix)`, matching pypdfium2's documented order. `init_forms()` is a no-op on documents without a form, so the change is safe for non-form PDFs. Closes jsvine#1367.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
`page.to_image()` (which goes through `pdfplumber.display.get_page_image`) currently opens a `pypdfium2.PdfDocument` and loads the page without initializing PDFium's form environment. Filled AcroForm field text is rendered via PDFium's form-rendering layer (`FPDF_FFLDraw`), which only runs when a form environment exists. Result: filled form-field text is missing from the PIL bitmap, even though every standard PDF viewer shows it.
This is purely a PDFium rasterization gap — pdfminer text extraction (`page.chars`, `extract_text()`) is unaffected.
Reproduction
(Reproducer + sample PDF in #1367.)
Fix
One-line addition in `pdfplumber/display.py::get_page_image`:
Per pypdfium2's documentation, `init_forms()` must be called after open and before loading pages. On a document without a form it is a no-op, so this is safe for the common non-form case.
Test plan
No new test added because the bug is in the PDFium rendering path (binary output) and the existing test suite already covers `to_image()` for non-form PDFs — those continue to pass with the no-op `init_forms()` call. For form PDFs the existing tests cannot detect the regression visually; the rasterized bitmap differs only in a few hundred pixels per filled field.
Happy to add a pixel-comparison or perceptual-hash test against the sample PDF in #1367 if maintainers want — let me know.
CHANGELOG
Added under `## Unreleased`.
Closes #1367.