| title | Troubleshooting Guide |
|---|
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0
This guide provides solutions for common issues and optimization techniques for the GenAIIDP solution.
For automated troubleshooting, use the Error Analyzer tool:
- What it is: AI-powered agent that automatically diagnoses document processing failures
- When to use: Document-specific failures, system-wide error patterns, performance issues
- How to access: Web UI → Failed document → Troubleshoot button
- Documentation: See Error Analyzer for complete guide
Quick Start:
# Document-specific analysis
Query: "document: filename.pdf"
# System-wide analysis
Query: "Show recent processing errors"
The Error Analyzer automatically:
- Searches CloudWatch Logs across all Lambda functions
- Correlates errors with DynamoDB tracking data
- Identifies root causes with AI reasoning
- Provides actionable recommendations
For issues not covered by the Error Analyzer, use the manual troubleshooting steps below.
| Issue | Resolution |
|---|---|
| Workflow execution fails | Check CloudWatch logs for specific error messages. Look in the Step Functions execution history to identify which step failed. |
| PDF document not processing | Verify the PDF is not password protected or encrypted. Ensure it's not corrupted by opening it in another application. |
| OCR fails on document | Check if the document is scanned at sufficient quality. Verify the document doesn't exceed size limits (typically 5MB for Textract). |
| Classification returns "other" | Review document class definitions. Consider adding more detailed class descriptions or adding few-shot examples. |
| Extraction missing fields | Review attribute descriptions and prompt engineering. Check if fields are present but in an unusual format or location. |
| Issue | Resolution |
|---|---|
| Cannot login to Web UI | Verify Cognito user status and permissions in AWS Console. Check email for temporary credentials if first-time login. |
| Web UI loads but shows errors | Check browser console for specific error messages. Verify API endpoints are accessible. |
| Cannot see document history | Verify AWS AppSync API permissions. Check CloudWatch Logs for API errors. |
| Configuration changes not saving | Check browser console for validation errors. Verify that the configuration Lambda function has correct permissions. |
| Issue | Resolution |
|---|---|
| Bedrock model throttling | Check CloudWatch metrics for throttling events. Consider increasing MaxConcurrentWorkflows parameter or requesting service quota increases. |
| SageMaker endpoint errors | Verify endpoint status in SageMaker console. Check endpoint logs for specific error messages. |
| Slow document processing | Monitor CloudWatch metrics to identify bottlenecks. Consider optimizing model selection or increasing concurrency limits. |
| Issue | Resolution |
|---|---|
| Lambda function timeouts | Increase function timeout or memory allocation. Consider breaking processing into smaller chunks. |
| DynamoDB capacity exceeded | Check CloudWatch metrics for throttling. Consider increasing provisioned capacity or switching to on-demand capacity. |
| DynamoDB config upload fails: "Item size has exceeded the maximum allowed size" | This error occurred in versions prior to the compression fix when configurations had ~45+ document classes, exceeding DynamoDB's 400KB item limit. Solution: Upgrade to the latest version, which gzip-compresses configuration data (supporting 3,000+ classes). Existing configs auto-migrate on next write. See GitHub Issue #200. |
| S3 permission errors | Verify bucket policies and IAM role permissions. Check for cross-account access issues. |
| Issue | Resolution |
|---|---|
| Agent query shows "processing failed" | Check CloudWatch logs for the Agent Processing Lambda function ({StackName}-AgentProcessorFunction-*). Look for specific error messages, timeout issues, or permission errors. |
| External MCP agent not appearing | Verify the External MCP Agents secret is properly configured with valid JSON array format. Check CloudWatch logs for agent registration errors. |
| Agent responses are incomplete | Check CloudWatch logs for token limits, model throttling, or timeout issues in the Agent Processing function. |
Optimize performance through proper resource sizing:
-
Lambda Memory: Scale based on document complexity
- OCR Function: 1024-2048 MB recommended
- Classification/Extraction: 512-1024 MB for text-only, 1024-2048 MB for image-based processing
-
Timeouts: Configure appropriate timeouts
- Step Functions: 5-15 minutes for standard documents
- Lambda functions: 1-3 minutes for individual processing steps
- SQS visibility timeout: 5-6x Lambda function timeout
-
Concurrency Settings
- Set
MaxConcurrentWorkflowsparameter based on expected volume - Consider Lambda reserved concurrency for critical functions
- Monitor and adjust based on actual usage patterns
- Set
-
Document Size and Quality
- Optimize input document size (600-1200 DPI recommended for scans)
- Reduce file size when possible without losing quality
- Consider preprocessing large documents to split them
-
Model Selection
- Balance accuracy vs. speed based on use case requirements
- Test different models with representative documents
- Consider smaller models for simple documents, larger models for complex extraction
-
Batch Processing
- For high volumes, stagger document uploads
- Use the load simulation scripts to test capacity
- Monitor queue depth and processing latency
If messages end up in a Dead Letter Queue:
- Review the messages in the DLQ using the AWS Console
- Check CloudWatch Logs for corresponding errors
- Fix the underlying issue (permission, configuration, etc.)
- Use the AWS SDK or Console to move messages back to the main queue:
import boto3
sqs = boto3.client('sqs')
# Get messages from DLQ
response = sqs.receive_message(
QueueUrl='dlq-url',
MaxNumberOfMessages=10,
VisibilityTimeout=30
)
# Move to main queue
for message in response.get('Messages', []):
sqs.send_message(
QueueUrl='main-queue-url',
MessageBody=message['Body']
)
# Delete from DLQ
sqs.delete_message(
QueueUrl='dlq-url',
ReceiptHandle=message['ReceiptHandle']
)If too many workflows are running and need to be stopped:
- Use the provided script to stop workflows:
./scripts/stop_workflows.sh <stack-name> <pattern-name>- Purge the SQS queue if needed:
- Navigate to SQS in the AWS Console
- Select the queue
- Choose "Purge" from the Actions menu
If the WAF is blocking legitimate access:
- Check the
WAFAllowedIPv4Rangesparameter value - Update with correct CIDR blocks for allowed IP ranges
- Remember Lambda functions have automatic access regardless of WAF settings
For Cognito authentication problems:
- Verify user exists in Cognito User Pool
- Check user attributes (email verified, status)
- Reset user password if needed
- Review identity pool configuration
- Check browser console for specific authentication errors
- Throttling: Request quota increases or reduce concurrency
- Content Filtering: Review guardrail configuration if content is being filtered unexpectedly
- Prompt Issues: Test prompts directly in Bedrock console or notebook
- Region Availability: Verify model availability in your region
- Endpoint Cold Start: Consider using provisioned concurrency
- GPU Utilization: Monitor utilization and adjust instance type if needed
- Memory Errors: Check inference logs for out-of-memory errors
- Model Loading Errors: Verify model artifacts are correct
Use X-Ray tracing for advanced diagnostics:
- Enable X-Ray tracing in the CloudFormation template
- View service map in X-Ray console
- Analyze trace details for latency and error hotspots
Trace document processing across systems:
- Extract correlation ID from log entries
- Search across log groups using CloudWatch Insights:
fields @timestamp, @message
| filter @message like "correlation-id-here"
| sort @timestamp asc
Test system capacity and identify bottlenecks:
- Use load testing scripts in
./scripts/directory - Start with low document rates and increase gradually
- Monitor CloudWatch metrics for saturation points
- Identify bottlenecks and optimize configuration
| Issue | Resolution |
|---|---|
| Generic "Failed to build" error | Use --verbose flag to see detailed error messages: idp-cli publish --source-dir . --region <region> --verbose |
| Python version mismatch | Ensure Python 3.13 is installed and available in PATH. Check with python3 --version |
| SAM build fails | Verify SAM CLI is installed and up to date. Check Docker is running if using containerized builds |
| Missing dependencies | Install required packages: pip install boto3 typer rich botocore |
| Permission errors | Verify AWS credentials are configured and have necessary S3/CloudFormation permissions |
Python Runtime Error:
Error: PythonPipBuilder:Validation - Binary validation failed for python, searched for python in following locations: [...] which did not satisfy constraints for runtime: python3.12
Resolution: Install Python 3.13 and ensure it's in your PATH, or use the --use-container flag for containerized builds.
Docker Not Running:
Error: Running AWS SAM projects locally requires Docker
Resolution: Start Docker daemon before running the publish script.
AWS Credentials Not Found:
Error: Unable to locate credentials
Resolution: Configure AWS credentials using aws configure or set environment variables.
For detailed debugging information, always use the --verbose flag when troubleshooting build issues:
# Standard usage
idp-cli publish --source-dir . --region us-east-1
# Verbose mode for troubleshooting
idp-cli publish --source-dir . --region us-east-1 --verboseVerbose mode provides:
- Exact SAM build commands being executed
- Complete stdout/stderr from failed operations
- Python environment and dependency information
- Detailed error traces and stack traces
| Issue | Resolution |
|---|---|
| Lambda package exceeds 250MB limit | Pattern-2 uses container images automatically. For Pattern-1/3, consider reducing dependency size or switching to container images in a future update. |
| Docker daemon not running | Start Docker Desktop or Docker service before running container deployment |
| ECR login failed | Ensure AWS credentials have ECR permissions. The script will automatically handle ECR login |
| Container build fails | Check Dockerfile syntax and ensure all referenced files exist |
| Image push timeout | Check network connectivity and ECR repository permissions |
Container Deployment Behavior:
- Pattern-2 builds and pushes container images automatically when Pattern-2 changes are detected.
- Ensure Docker Desktop/service is running and your AWS credentials have ECR permissions.
- Use
--verboseto see detailed build and push logs.