Skip to content

Fix to_image() omitting filled AcroForm widget values (closes #1367)#1372

Open
Cyberfilo wants to merge 2 commits into
jsvine:developfrom
Cyberfilo:fix/1367-init-forms-for-to-image
Open

Fix to_image() omitting filled AcroForm widget values (closes #1367)#1372
Cyberfilo wants to merge 2 commits into
jsvine:developfrom
Cyberfilo:fix/1367-init-forms-for-to-image

Conversation

@Cyberfilo
Copy link
Copy Markdown

Summary

`page.to_image()` (which goes through `pdfplumber.display.get_page_image`) currently opens a `pypdfium2.PdfDocument` and loads the page without initializing PDFium's form environment. Filled AcroForm field text is rendered via PDFium's form-rendering layer (`FPDF_FFLDraw`), which only runs when a form environment exists. Result: filled form-field text is missing from the PIL bitmap, even though every standard PDF viewer shows it.

This is purely a PDFium rasterization gap — pdfminer text extraction (`page.chars`, `extract_text()`) is unaffected.

Reproduction

import pdfplumber

with pdfplumber.open(\"filled_form.pdf\") as pdf:
    im = pdf.pages[0].to_image(resolution=150).original
    im.save(\"out.png\")  # filled field values absent

(Reproducer + sample PDF in #1367.)

Fix

One-line addition in `pdfplumber/display.py::get_page_image`:

pdfium_doc = pypdfium2.PdfDocument(src, password=password)
...
pdfium_doc.init_forms()  # <-- new
pdfium_page = pdfium_doc.get_page(page_ix)

Per pypdfium2's documentation, `init_forms()` must be called after open and before loading pages. On a document without a form it is a no-op, so this is safe for the common non-form case.

Test plan

No new test added because the bug is in the PDFium rendering path (binary output) and the existing test suite already covers `to_image()` for non-form PDFs — those continue to pass with the no-op `init_forms()` call. For form PDFs the existing tests cannot detect the regression visually; the rasterized bitmap differs only in a few hundred pixels per filled field.

Happy to add a pixel-comparison or perceptual-hash test against the sample PDF in #1367 if maintainers want — let me know.

CHANGELOG

Added under `## Unreleased`.

Closes #1367.

Cyberfilo added 2 commits May 22, 2026 12:03
`get_page_image` opens a `pypdfium2.PdfDocument` and immediately loads
the page without initializing PDFium's form environment. Filled
AcroForm field text is drawn via PDFium's form-rendering layer
(`FPDF_FFLDraw`), which only runs when a form environment exists.
The fix: call `pdfium_doc.init_forms()` after open and before
`get_page(page_ix)`, matching pypdfium2's documented order.

`init_forms()` is a no-op on documents without a form, so the change
is safe for non-form PDFs.

Closes jsvine#1367.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant