Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0
This pattern implements an intelligent document processing workflow that uses UDOP (Unified Document Processing) for page classification and grouping, followed by Claude for information extraction.
See Fine-Tuning Models on SageMaker
Once you have trained the model, deploy the GenAIIDP stack for Pattern-3 using the path for your new fine tuned model.
The workflow consists of three main processing steps:
- OCR processing using Amazon Textract
- Page classification and grouping using a UDOP model deployed on SageMaker
- Field extraction using Claude via Amazon Bedrock
The Step Functions state machine (workflow.asl.json) orchestrates the following flow:
OCRStep → ClassificationStep → ProcessPageGroups (Map State for Extraction)
Each step includes comprehensive retry logic for handling transient errors:
- Initial retry after 2 seconds
- Exponential backoff with rate of 2
- Maximum of 10 retry attempts
- Purpose: Processes input PDFs using Amazon Textract
- Input:
{ "execution_arn": "<ARN>", "output_bucket": "<BUCKET>", "input": { "detail": { "bucket": { "name": "<BUCKET>" }, "object": { "key": "<KEY>" } } } } - Output:
{ "metadata": { "input_bucket": "<BUCKET>", "object_key": "<KEY>", "output_bucket": "<BUCKET>", "output_prefix": "<PREFIX>", "num_pages": "<NUMBER OF PAGES>" }, "pages": { "<PAGE_NUMBER>": { "rawTextUri": "<S3_URI>", "parsedTextUri": "<S3_URI>", "imageUri": "<S3_URI>" } } }
- Purpose: Classifies pages using UDOP model on SageMaker, and segments into sections using class boundaries
- Input: Output from OCR function plus output bucket
- Output:
{ "metadata": "<FROM_OCR>", "sections": [ { "id": "<GROUP_ID>", "class": "<CLASS>", "pages": [...] } ] }
- Purpose: Extracts fields using Claude via Amazon Bedrock
- Input: Individual section from Classification output
- Output:
{ "section": { "id": "<ID>", "class": "<CLASS>", "page_ids": ["<PAGEID>", "..."], "outputJSONUri": "<S3_URI>" }, "pages": [ { "Id": "<ID>", "Class": "<CLASS>", "TextUri": "<S3_URI>", "ImageUri": "<S3_URI>" } ] }
- Purpose: Aggregates results for all sections
- Input: Extraction output from each section extraction
- Output: Consumed by the GenAIIDP parent stack workflow tracker to update job status/UI etc
{ "Sections": [ { "Id": "<ID>", "PageIds": ["<PAGEID>", "..."], "Class": "<CLASS>", "OutputJSONUri": "<S3_URI>" } ], "Pages": [ { "Id": "<ID>", "Class": "<CLASS>", "TextUri": "<S3_URI>", "ImageUri": "<S3_URI>" } ], "PageCount": "<TOTAL_PAGES>" }
The pattern includes a complete UDOP model deployment:
- SageMaker Endpoint:
sagemaker_classifier_endpoint.yamlprovisions:- SageMaker model
- Endpoint configuration
- Endpoint with auto-scaling
- IAM roles and permissions
To create a new UDOP model fine tuned for your data, see Fine tuning a UDOP model.
The pattern includes a comprehensive CloudWatch dashboard with:
- Document and page throughput (per minute)
- Token usage (input/output/total)
- Bedrock request statistics and latency
- SageMaker endpoint invocation metrics
- Processing latencies by function
- Lambda function errors with detailed logs
- Long-running invocations with duration metrics
- Classification/extraction failures
- API throttling and retry metrics
- SageMaker endpoint errors
- Duration (average, p90, maximum)
- Memory usage and utilization
- Error rates and exception counts
- Concurrent execution metrics
The pattern exports these outputs to the parent stack:
StateMachineName: Name of Step Functions state machineStateMachineArn: ARN of Step Functions state machineStateMachineLogGroup: CloudWatch log group for state machineDashboardName: Name of pattern-specific dashboardDashboardArn: ARN of pattern-specific dashboard
Stack Deployment Parameters:
UDOPModelArtifactPath: S3 path to UDOP model artifacts (see Fine tuning a UDOP model)- Summarization: Control summarization via configuration file
summarization.enabledproperty (replacesIsSummarizationEnabledparameter) ConfigurationDefaultS3Uri: Optional S3 URI to custom configuration (uses default configuration if not specified)MaxConcurrentWorkflows: Workflow concurrency limitLogRetentionDays: CloudWatch log retention periodExecutionTimeThresholdMs: Latency threshold for alerts
Configuration Management:
- Model selection for extraction is now handled through configuration files rather than CloudFormation parameters
- Configuration supports multiple presets per pattern (e.g., default, checkboxed_attributes_extraction, medical_records_summarization)
- Configuration can be updated through the Web UI without stack redeployment
- Extraction model choices are constrained through enum constraints in the configuration schema
- UDOP classification model is still specified via CloudFormation parameter due to SageMaker endpoint requirements
The system uses a combination of prompt engineering and predefined attributes to extract information from documents. You can customize both to match your specific document types and extraction needs.
The extraction prompts are now defined in the configuration files rather than CloudFormation template. The structure is similar to the one shown below:
DEFAULT_SYSTEM_PROMPT = "You are a document assistant. Respond only with JSON..."
BASELINE_PROMPT = """
<background>
You are an expert in business document analysis and information extraction. You can understand and extract key information from various types of business documents including letters, memos, financial documents, scientific papers, news articles, advertisements, emails, forms, handwritten notes, invoices, purchase orders, questionnaires, resumes, scientific publications, and specifications...
</background>
...To modify the extraction behavior:
- Modify the configuration settings through the Web UI or configuration files
- Edit the
system_promptto change the AI assistant's basic behavior - Customize the
task_promptto:- Provide domain expertise for your document types
- Add specific instructions for handling edge cases
- Modify output formatting requirements
- Extraction model selection is now handled through configuration files with enum constraints
- Available models are defined in the configuration schema
- Changes can be made through the Web UI without redeployment
- Classification model (UDOP) is still specified via CloudFormation parameter due to SageMaker endpoint requirements
Attributes to be extracted are defined in the configuration files' classes section using JSON Schema format. The structure is similar to the example below:
Example attribute definition:
classes:
- $schema: "https://json-schema.org/draft/2020-12/schema"
$id: Letter
x-aws-idp-document-type: Letter
type: object
description: A formal written message that is typically sent from one person to another
properties:
SenderName:
type: string
description: The name of the person or entity who wrote or sent the letter. Look for text following or near terms like 'from', 'sender', 'authored by', 'written by', or at the end of the letter before a signature.
SenderAddress:
type: string
description: The physical address of the sender, typically appearing at the top of the letter. May be labeled as 'address', 'location', or 'from address'.
RecipientName:
type: string
description: The name of the person or entity receiving the letter. Look for this after 'to', 'recipient', 'addressee', or at the beginning of the letter.
RecipientAddress:
type: string
description: The physical address where the letter is to be delivered. Often labeled as 'to address' or 'delivery address', typically appearing below the recipient name.
- $schema: "https://json-schema.org/draft/2020-12/schema"
$id: Form
x-aws-idp-document-type: Form
type: object
description: A document with blank spaces for filling in information
properties:
FormType:
type: string
description: The category or purpose of the form, such as 'application', 'registration', 'request', etc. May be identified by 'form name', 'document type', or 'form category'.
FormId:
type: string
description: The unique identifier for the form, typically a number or alphanumeric code. Often labeled as 'form number', 'id', or 'reference number'.To customize attributes:
- Modify the
classessection in the configuration files or through the Web UI - For each attribute, provide a clear name and detailed description
- Changes are applied immediately without requiring function redeployment
Note: Configuration changes through the Web UI take effect immediately for new document processing jobs.
Use the provided test events in the testing/ directory:
# Test OCR function
sam local invoke OCRFunction --env-vars testing/env.json -e testing/OCRFunction-event.json
# Test classification
sam local invoke ClassificationFunction --env-vars testing/env.json -e testing/ClassificationFunctionEvent.json
# Test extraction
sam local invoke ExtractionFunction --env-vars testing/env.json -e testing/ExtractionFunction-event.jsonNote that for proper testing of the classification function, you'll need access to a running SageMaker endpoint with the UDOP model deployed.
-
Configuration Management:
- Use the configuration library for different use cases (default, medical_records, etc.)
- Test configuration changes thoroughly before production deployment
- Leverage the Web UI for extraction configuration updates without redeployment
-
SageMaker Endpoint Management:
- Configure appropriate instance type based on model size and throughput requirements
- Enable auto-scaling for cost optimization during varying loads
- Monitor endpoint performance metrics for potential bottlenecks
-
Retry Handling:
- All functions implement exponential backoff with jitter
- Configure appropriate retry limits for different types of failures
- Handle SageMaker endpoint throttling differently from transient failures
-
Performance Optimization:
- Pre-warm the SageMaker endpoint with sample requests during initialization
- Configure appropriate memory for Lambda functions based on document size
- Use efficient image preprocessing and data handling techniques
-
Monitoring:
- Set up CloudWatch alarms for critical metrics
- Use detailed logging with correlation IDs for request tracking
- Monitor token usage and cost metrics for GenAI components
-
Security:
- Use KMS encryption for data at rest and in transit
- Implement least privilege IAM roles for all components
- Protect SageMaker endpoints with appropriate security groups
-
Error Handling:
- Implement graceful degradation strategies
- Provide clear error messages with actionable information
- Track error rates by category for targeted improvements
