Skip to content

Commit f79f8e5

Browse files
author
Bob Strahan
committed
fix(ocr): add page.flatten() for fillable PDFs without appearance streams (#240)
init_forms() alone is insufficient for fillable PDFs that lack pre-generated appearance streams for form fields (common in government forms like VA-21-22a). page.flatten() forces PDFium to generate appearances and merge them into page content before rendering, ensuring all form field values are visible. Changes: - ocr/service.py: add page.flatten() before _extract_page_image() in rendering loop - bda_processresults_function/index.py: add page.flatten() before render() - test_ocr_service.py: verify both init_forms() and flatten() are called - CHANGELOG.md: update fix description with two-part explanation
1 parent 1c3b99d commit f79f8e5

4 files changed

Lines changed: 17 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ SPDX-License-Identifier: MIT-0
3838

3939
### Fixed
4040

41-
- **Fillable PDF form fields missing from rendered page images** — Fixed bug where fillable PDF form fields (text inputs, checkboxes, radio buttons, dropdowns) were not rendered in page images, causing OCR and extraction to miss user-entered data. Root cause: pypdfium2's `render(may_draw_forms=True)` requires `PdfDocument.init_forms()` to be called first to initialize the form rendering engine. Added `init_forms()` call in both Pattern 2 (`OcrService`) and Pattern 1 (`create_pdf_page_images`) PDF rendering pipelines. ([#240](https://github.com/aws-solutions-library-samples/accelerated-intelligent-document-processing-on-aws/issues/240))
41+
- **Fillable PDF form fields missing from rendered page images** — Fixed bug where fillable PDF form fields (text inputs, checkboxes, radio buttons, dropdowns) were not rendered in page images, causing OCR and extraction to miss user-entered data. Two-part fix: (1) `PdfDocument.init_forms()` initializes the form rendering engine so PDFium can process form fields, and (2) `page.flatten()` merges form field appearances into page content before rendering — required because many fillable PDFs (especially government forms) lack pre-generated appearance streams. Applied in both Pattern 2 (`OcrService`) and Pattern 1 (`create_pdf_page_images`) PDF rendering pipelines. ([#240](https://github.com/aws-solutions-library-samples/accelerated-intelligent-document-processing-on-aws/issues/240))
4242

4343
- **Discovery subscription handler dropping errorMessage and other fields** — Fixed bug where the UI subscription handler did `{ ...oldJob, status: updatedJob.status }`, discarding all fields except status from real-time subscription updates. Error messages, discovered class names, and status messages were being sent by the backend but silently dropped by the UI. Now spreads all fields: `{ ...oldJob, ...updatedJob }`.
4444

lib/idp_common_pkg/idp_common/ocr/service.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -426,6 +426,12 @@ def process_document(self, document: Document) -> Document:
426426
page_images: Dict[int, bytes] = {}
427427
for i in pages_to_render:
428428
page = pdf_document[i]
429+
# Flatten form fields into page content before rendering.
430+
# Many fillable PDFs (e.g., government forms) lack appearance
431+
# streams for form fields — flatten() forces PDFium to generate
432+
# them and merge into page content so render() can display them.
433+
# Requires init_forms() to have been called before page retrieval.
434+
page.flatten()
429435
page_images[i] = self._extract_page_image(page, True, i + 1)
430436

431437
pdf_document.close()

lib/idp_common_pkg/tests/unit/ocr/test_ocr_service.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -353,6 +353,11 @@ def test_process_document_calls_init_forms_for_fillable_pdfs(
353353
# Verify init_forms() was called to enable fillable PDF form rendering
354354
mock_pdf_doc.init_forms.assert_called_once()
355355

356+
# Verify flatten() was called on the page to merge form fields
357+
# into page content (needed for PDFs without appearance streams)
358+
mock_page = mock_pdf_doc.__getitem__.return_value
359+
mock_page.flatten.assert_called_once()
360+
356361
@patch("boto3.client")
357362
@patch("idp_common.ocr.service.pdfium.PdfDocument")
358363
def test_process_document_success(

patterns/unified/src/bda_processresults_function/index.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,11 @@ def create_pdf_page_images(bda_result_bucket, output_bucket, object_key):
165165
for page_num in range(len(pdf_document)):
166166
# Render page to a PIL image
167167
page = pdf_document[page_num]
168+
# Flatten form fields into page content before rendering.
169+
# Many fillable PDFs (e.g., government forms) lack appearance
170+
# streams for form fields — flatten() forces PDFium to generate
171+
# them and merge into page content so render() can display them.
172+
page.flatten()
168173
pil_img = page.render(scale=150 / 72).to_pil()
169174

170175
# Save the image to a BytesIO object as JPEG

0 commit comments

Comments
 (0)