Skip to content

Commit 7c1d21d

Browse files
authored
feat(ocr): initialize form rendering for fillable PDFs in OcrService (#241)
- **Fillable PDF form fields missing from rendered page images** — Fixed bug where fillable PDF form fields (text inputs, checkboxes, radio buttons, dropdowns) were not rendered in page images, causing OCR and extraction to miss user-entered data. Root cause: pypdfium2's `render(may_draw_forms=True)` requires `PdfDocument.init_forms()` to be called first to initialize the form rendering engine. Added `init_forms()` call in both Pattern 2 (`OcrService`) and Pattern 1 (`create_pdf_page_images`) PDF rendering pipelines. ([#240](#240))
1 parent c237721 commit 7c1d21d

4 files changed

Lines changed: 61 additions & 0 deletions

File tree

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,8 @@ SPDX-License-Identifier: MIT-0
5656

5757
### Fixed
5858

59+
- **Fillable PDF form fields missing from rendered page images** — Fixed bug where fillable PDF form fields (text inputs, checkboxes, radio buttons, dropdowns) were not rendered in page images, causing OCR and extraction to miss user-entered data. Root cause: pypdfium2's `render(may_draw_forms=True)` requires `PdfDocument.init_forms()` to be called first to initialize the form rendering engine. Added `init_forms()` call in both Pattern 2 (`OcrService`) and Pattern 1 (`create_pdf_page_images`) PDF rendering pipelines. ([#240](https://github.com/aws-solutions-library-samples/accelerated-intelligent-document-processing-on-aws/issues/240))
60+
5961
- **Discovery subscription handler dropping errorMessage and other fields** — Fixed bug where the UI subscription handler did `{ ...oldJob, status: updatedJob.status }`, discarding all fields except status from real-time subscription updates. Error messages, discovered class names, and status messages were being sent by the backend but silently dropped by the UI. Now spreads all fields: `{ ...oldJob, ...updatedJob }`.
6062

6163
- **Discovery processor S3 race condition causing NoSuchKey failures** — The discovery upload resolver sends the SQS message before the browser finishes uploading the file to S3 via presigned POST. Previously worked around with a hardcoded `time.sleep(30)`. Replaced with `_wait_for_s3_object()` that polls S3 with exponential backoff (2s initial, 10s max, 60s timeout), proceeding as soon as the file appears.

lib/idp_common_pkg/idp_common/ocr/service.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -388,6 +388,10 @@ def process_document(self, document: Document) -> Document:
388388
if is_pdf:
389389
# Determine which pages need processing (retry-safe: skip completed pages)
390390
pdf_document = pdfium.PdfDocument(file_content)
391+
# Initialize form rendering engine so fillable PDF form fields
392+
# (text inputs, checkboxes, etc.) appear in rendered page images.
393+
# Without this, may_draw_forms=True in render() has no effect.
394+
pdf_document.init_forms()
391395
num_pages = len(pdf_document)
392396
document.num_pages = num_pages
393397

lib/idp_common_pkg/tests/unit/ocr/test_ocr_service.py

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -302,6 +302,57 @@ def test_init_with_preprocessing_config(self):
302302

303303
assert service.preprocessing_config == preprocessing_config
304304

305+
@patch("boto3.client")
306+
@patch("idp_common.ocr.service.pdfium.PdfDocument")
307+
def test_process_document_calls_init_forms_for_fillable_pdfs(
308+
self, mock_pdfium_doc, mock_boto_client, mock_document, mock_pdf_content
309+
):
310+
"""Test that init_forms() is called on PDF documents to enable fillable form field rendering.
311+
312+
Fillable PDFs (AcroForm) have form fields like text inputs, checkboxes, and
313+
radio buttons stored as separate overlay layers. Without calling init_forms(),
314+
pypdfium2's render() will not include these form field values in the output
315+
image even when may_draw_forms=True (the default). This test ensures we
316+
always initialize the form rendering engine before rendering pages.
317+
318+
Regression test for: https://github.com/aws-solutions-library-samples/accelerated-intelligent-document-processing-on-aws/issues/240
319+
"""
320+
# Mock S3 client
321+
mock_s3_client = MagicMock()
322+
mock_s3_client.get_object.return_value = {"Body": BytesIO(mock_pdf_content)}
323+
mock_boto_client.return_value = mock_s3_client
324+
325+
# Mock PDF document
326+
mock_pdf_doc = MagicMock()
327+
mock_pdf_doc.__len__.return_value = 1
328+
mock_pdf_doc.__iter__.return_value = iter(range(1))
329+
mock_pdfium_doc.return_value = mock_pdf_doc
330+
331+
with (
332+
patch(
333+
"idp_common.ocr.service.OcrService._extract_page_image"
334+
) as mock_extract,
335+
patch(
336+
"idp_common.ocr.service.OcrService._process_page_with_image"
337+
) as mock_process,
338+
):
339+
mock_extract.return_value = b"image_data"
340+
mock_process.return_value = (
341+
{
342+
"raw_text_uri": "s3://output/raw.json",
343+
"parsed_text_uri": "s3://output/parsed.json",
344+
"text_confidence_uri": "s3://output/confidence.json",
345+
"image_uri": "s3://output/image.jpg",
346+
},
347+
{"OCR/textract/detect_document_text": {"pages": 1}},
348+
)
349+
350+
service = OcrService()
351+
service.process_document(mock_document)
352+
353+
# Verify init_forms() was called to enable fillable PDF form rendering
354+
mock_pdf_doc.init_forms.assert_called_once()
355+
305356
@patch("boto3.client")
306357
@patch("idp_common.ocr.service.pdfium.PdfDocument")
307358
def test_process_document_success(

patterns/unified/src/bda_processresults_function/index.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -156,6 +156,10 @@ def create_pdf_page_images(bda_result_bucket, output_bucket, object_key):
156156

157157
# Open the PDF using pypdfium2
158158
pdf_document = pdfium.PdfDocument(pdf_content)
159+
# Initialize form rendering engine so fillable PDF form fields
160+
# (text inputs, checkboxes, etc.) appear in rendered page images.
161+
# Without this, may_draw_forms=True in render() has no effect.
162+
pdf_document.init_forms()
159163

160164
# Process each page
161165
for page_num in range(len(pdf_document)):

0 commit comments

Comments
 (0)