Skip to content

Empty Extraction for PDF Documents #240

@adam-weinberger

Description

@adam-weinberger

Describe the bug
When I run the PDFs attached most fields are incorrectly empty and the single page thumbnails are missing most of the filled in data.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy a fresh IDP.
  2. Click 'Discovery'. Upload VAF 21-22a_example_1.pdf.
  3. After discovery is finished, ensure that it created a new document class.
  4. Go to 'Upload Document(s).' Upload VAF 21-22a_example_1.pdf. Use the default config. Click 'Upload'.
  5. Once the document status is COMPLETED, click on the document
  6. Under 'Document Sections', click 'View Data'. Under 'Visual Editor', you'll see that most of the inputted data is missing from both the 'Document Pages' (the images) and 'Document Data' (the drop downs).

Expected behavior

  • Thumbnail matches the original
  • The extraction is mostly if not 100% correct

Screenshots

Image Image

VAF 21-22a_example_1.pdf

VAF_21-22a_example_1.pdf_section1_prediction.json

AWS Region
us-east-1

Accelerator Processing Mode
Pipeline mode

Accelerator Version / Build
Which Version did you deploy? 0.5.2

Custom Stack Parameters
What non-default stack parameters did you configure when deploying? None

Custom Configurations
Ran a discovery job. Created a new class. See above

Output of the 'TroubleShoot' agent (if issue is a document processing failure)

Summary of Findings
The Critical Problem: While all processing stages completed successfully (Classification, Extraction, OCR, Summarization, etc.), the extraction is completely failing to extract key personal information - specifically the VeteranInformation, ClaimantInformation, and ServiceOrganizationInformation sections show 0.00 confidence scores across all fields.

Specific Issues Identified:
Complete Data Extraction Failure on Key Sections:

VeteranInformation: 0/13 fields extracted (0% success rate)
ClaimantInformation: 0/5 fields extracted (0% success rate)
ServiceOrganizationInformation: 0/7 fields extracted (0% success rate)
All confidence scores are 0.00 across these critical sections
Partial Success on Secondary Sections:

AuthorizationInformation: 3/3 fields extracted successfully (95-92% confidence)
ConditionsOfAppointment: 4/4 fields extracted successfully (99% confidence)
FeesAndPenalties: 2/2 fields extracted successfully (99% confidence)
Output Image Issue: The fact that output images have "removed all inputted information" suggests that the extraction model is likely struggling with the document layout or image quality, potentially:

Document orientation issues
Low image quality or resolution
Text not being properly recognized by the OCR stage
Potential mismatch between document structure and the VA Form 21-22a schema
Recommended Next Steps:
Verify the input PDF quality - Check if the document is clear, properly scanned, and in portrait orientation
Review the OCR output - The OCR stage completed but may have failed to properly read the personal information fields
Check document structure - Ensure the VAF form follows the standard VA Form 21-22a format
Re-upload the document - Try processing the document again to see if this was a temporary extraction failure

The PDF is normal quality. Everything is typed so there are no text quality issues. The OCR stage seems to have the failures; it extracts nothing.When I run the document in textract, I see all of the text extracted properly.

FWIW, I found another bug with the 'Agent Companion Chat' during this conversation. If you leave the page while the agent is responding, and then you return to the page, it's frozen.

Link to DeepWiki answer
https://deepwiki.com/search/when-i-uploaded-vaf2122aexampl_b6924ee9-2536-4fb5-aadd-1bca24850784?mode=fast

I tried adding ["TABLES", "FORMS", "SIGNATURES", "LAYOUT"] to the Features but it didn't help

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions