Empty Extraction for PDF Documents

**Describe the bug**
When I run the PDFs attached most fields are incorrectly empty and the single page thumbnails are missing most of the filled in data.

**To Reproduce**
Steps to reproduce the behavior:
1. Deploy a fresh IDP.
2. Click 'Discovery'. Upload `VAF 21-22a_example_1.pdf`.
3. After discovery is finished, ensure that it created a new document class.
4. Go to 'Upload Document(s).' Upload `VAF 21-22a_example_1.pdf`. Use the default config. Click 'Upload'.
6. Once the document status is COMPLETED, click on the document
7. Under 'Document Sections', click 'View Data'. Under 'Visual Editor', you'll see that most of the inputted data is missing from both the 'Document Pages' (the images) and 'Document Data' (the drop downs).

**Expected behavior**
- Thumbnail matches the original
-  The extraction is mostly if not 100% correct

**Screenshots**

<img width="921" height="792" alt="Image" src="https://github.com/user-attachments/assets/0e02e276-5e2c-4a5c-ac02-c1d05c9b48e8" />

<img width="1767" height="668" alt="Image" src="https://github.com/user-attachments/assets/6d0195db-c50f-4e6d-a740-8f906b2e2b71" />

[VAF 21-22a_example_1.pdf](https://github.com/user-attachments/files/26036830/VAF.21-22a_example_1.pdf)

[VAF_21-22a_example_1.pdf_section1_prediction.json](https://github.com/user-attachments/files/26036836/VAF_21-22a_example_1.pdf_section1_prediction.json)

**AWS Region**
us-east-1

**Accelerator Processing Mode**
Pipeline mode

**Accelerator Version / Build**
Which Version did you deploy?  0.5.2

**Custom Stack Parameters**
What non-default stack parameters did you configure when deploying? None

**Custom Configurations**
Ran a discovery job. Created a new class. See above

**Output of the 'TroubleShoot' agent (if issue is a document processing failure)**
```
Summary of Findings
The Critical Problem: While all processing stages completed successfully (Classification, Extraction, OCR, Summarization, etc.), the extraction is completely failing to extract key personal information - specifically the VeteranInformation, ClaimantInformation, and ServiceOrganizationInformation sections show 0.00 confidence scores across all fields.

Specific Issues Identified:
Complete Data Extraction Failure on Key Sections:

VeteranInformation: 0/13 fields extracted (0% success rate)
ClaimantInformation: 0/5 fields extracted (0% success rate)
ServiceOrganizationInformation: 0/7 fields extracted (0% success rate)
All confidence scores are 0.00 across these critical sections
Partial Success on Secondary Sections:

AuthorizationInformation: 3/3 fields extracted successfully (95-92% confidence)
ConditionsOfAppointment: 4/4 fields extracted successfully (99% confidence)
FeesAndPenalties: 2/2 fields extracted successfully (99% confidence)
Output Image Issue: The fact that output images have "removed all inputted information" suggests that the extraction model is likely struggling with the document layout or image quality, potentially:

Document orientation issues
Low image quality or resolution
Text not being properly recognized by the OCR stage
Potential mismatch between document structure and the VA Form 21-22a schema
Recommended Next Steps:
Verify the input PDF quality - Check if the document is clear, properly scanned, and in portrait orientation
Review the OCR output - The OCR stage completed but may have failed to properly read the personal information fields
Check document structure - Ensure the VAF form follows the standard VA Form 21-22a format
Re-upload the document - Try processing the document again to see if this was a temporary extraction failure
```

The PDF is normal quality. Everything is typed so there are no text quality issues. The OCR stage seems to have the failures; it extracts nothing.When I run the document in textract, I see all of the text extracted properly.

FWIW, I found another bug with the 'Agent Companion Chat' during this conversation. If you leave the page while the agent is responding, and then you return to the page, it's frozen.

**Link to DeepWiki answer**
https://deepwiki.com/search/when-i-uploaded-vaf2122aexampl_b6924ee9-2536-4fb5-aadd-1bca24850784?mode=fast

I tried adding ["TABLES", "FORMS", "SIGNATURES", "LAYOUT"] to the Features but it didn't help

**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empty Extraction for PDF Documents #240

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Empty Extraction for PDF Documents #240

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions