| title | Error Analyzer (Troubleshooting Tool) - PREVIEW |
|---|
The Error Analyzer is an intelligent AI-powered troubleshooting tool that helps diagnose and resolve document processing failures in the GenAI IDP Accelerator. It uses Amazon Bedrock's Claude Sonnet 4 model with the Strands agent framework to automatically analyze CloudWatch logs, DynamoDB tracking data, and Step Functions execution history to identify root causes and provide actionable recommendations.
This tool is not yet mature - we expect to refine and improve the capabilities in successive releases. Try it, and give us feedback via GitHub Issues.
genaiidp-error-analyzer-feature.mov
- Automatic Failure Diagnosis: AI agent automatically investigates document processing failures
- Intelligent Query Routing: Distinguishes between document-specific and system-wide analysis
- Multi-Source Analysis: Correlates data from CloudWatch Logs, DynamoDB, and Step Functions
- Contextual Recommendations: Provides specific guidance for configuration, operational, or code issues
- Real-Time Updates: Live job status with progress tracking and resumption capability
- Evidence-Based Analysis: Shows detailed log evidence supporting diagnostic conclusions
- Document Processing Failures: Investigate why a specific document failed to process
- Recurring Error Patterns: Identify systemic issues affecting multiple documents
- Performance Investigation: Analyze timeout errors and processing bottlenecks
- System Health Checks: Review recent errors across the entire system
- Troubleshooting Support: Generate detailed error reports for support escalation
flowchart TD
UI[Web UI - TroubleshootModal] -->|GraphQL Mutation| Submit[submitAgentQuery]
Submit --> Agent[Error Analyzer Agent]
Agent -->|Route Query| Router{analyze_errors Tool}
Router -->|Document-Specific| DocAnalysis[analyze_document_failure]
Router -->|System-Wide| SysAnalysis[analyze_recent_system_errors]
DocAnalysis --> GetContext[get_document_context]
DocAnalysis --> SearchDocLogs[search_document_logs]
SysAnalysis --> FindTable[find_tracking_table]
SysAnalysis --> ScanDB[scan_dynamodb_table]
SysAnalysis --> SearchStackLogs[search_stack_logs]
GetContext --> DDB[(DynamoDB<br/>TrackingTable)]
SearchDocLogs --> CW[(CloudWatch Logs)]
SearchStackLogs --> CW
ScanDB --> DDB
DocAnalysis --> Result[Analysis Result]
SysAnalysis --> Result
Result -->|GraphQL Subscription| UI
The Error Analyzer uses 8 specialized tools organized in a modular architecture:
flowchart LR
subgraph "Main Router"
Router[analyze_errors]
end
subgraph "Document Analysis"
DocFail[analyze_document_failure]
GetCtx[get_document_context]
SearchDoc[search_document_logs]
end
subgraph "System Analysis"
SysErr[analyze_recent_system_errors]
FindTbl[find_tracking_table]
ScanTbl[scan_dynamodb_table]
SearchStk[search_stack_logs]
end
Router --> DocFail
Router --> SysErr
DocFail --> GetCtx
DocFail --> SearchDoc
SysErr --> FindTbl
SysErr --> ScanTbl
SysErr --> SearchStk
Tool Descriptions:
-
analyze_errors (Main Router)
- Classifies query intent (document-specific vs system-wide)
- Routes to appropriate analysis tool
- Manages time range parsing
-
analyze_document_failure (Document-Specific)
- Investigates individual document failures
- Retrieves execution context and Lambda request IDs
- Searches document-specific logs
-
analyze_recent_system_errors (System-Wide)
- Analyzes error patterns across the system
- Categorizes errors by type
- Provides statistical summaries
-
get_document_context (Lambda Integration)
- Retrieves document tracking data
- Extracts Step Functions execution ARN
- Provides Lambda request IDs for tracing
-
search_document_logs (CloudWatch)
- Filters logs by document ObjectKey
- Searches across multiple log groups
- Returns events with timestamps and context
-
search_stack_logs (CloudWatch)
- System-wide log pattern matching
- Multi-pattern prioritized search
- Adaptive sampling for context management
-
find_tracking_table (DynamoDB Discovery)
- Locates TrackingTable by stack name
- Validates table existence
-
scan_dynamodb_table (DynamoDB Query)
- Scans for failed documents
- Filters by status and time range
- Returns document metadata
flowchart TD
Query[User Query] --> Check{Document-Specific<br/>Pattern?}
Check -->|Match| DocPattern["Patterns:<br/>• document: filename.pdf<br/>• file: report.docx<br/>• ObjectKey: path/file"]
Check -->|No Match| GeneralPattern["General Queries:<br/>• Recent errors<br/>• System failures<br/>• Processing issues"]
DocPattern --> DocAnalysis[Document-Specific<br/>Analysis]
GeneralPattern --> SysAnalysis[System-Wide<br/>Analysis]
DocAnalysis --> DocResult["Results:<br/>• Execution context<br/>• Document-specific logs<br/>• Lambda request IDs"]
SysAnalysis --> SysResult["Results:<br/>• Error categories<br/>• Failed documents<br/>• Pattern statistics"]
- Navigate to Dashboard: Open the GenAI IDP Web UI
- Find Failed Document: Locate a document with
FAILEDstatus - Click Troubleshoot Button: Opens the TroubleshootModal
- Automatic Analysis: Agent immediately begins analyzing the failure
The TroubleshootModal displays:
- Document Information: Shows the ObjectKey being analyzed
- Status Indicator:
PENDING: Job submitted, waiting to startPROCESSING: Agent actively analyzing with real-time messagesCOMPLETED: Analysis finished, results availableFAILED: Analysis encountered an error
- Agent Messages: Live progress updates during processing
- Results Display: Formatted analysis with collapsible sections
- Job Resumption: If you close and reopen the modal, the existing job resumes
Results are structured in three sections:
1. Root Cause
The underlying technical reason for the failure. Focuses on the primary
cause rather than symptoms.
Example: "Bedrock throttling exception due to exceeding token rate limits
for the configured model."
2. Recommendations
Specific, actionable steps to resolve the issue. Limited to top three
recommendations with clear guidance.
Example:
• Increase provisioned throughput for the Bedrock model
• Adjust retry configuration in classification settings
• Consider using batch processing to reduce concurrent requests
3. Evidence (Collapsible)
<details>
<summary><strong>Evidence</strong></summary>
**Log Group:**
/aws-stack-name/lambda/ClassificationFunction
**Log Stream:**
2025/01/03/[$LATEST]abc123def456
[ERROR] 2025-01-03T14:23:45.123Z ThrottlingException: Rate exceeded
</details>Use these patterns to analyze a specific document:
document: lending_package.pdf
file: bank_statement.docx
ObjectKey: uploads/2024/contract.pdf
The query must include the keyword (document:, file:, or ObjectKey:) followed immediately by a colon and the filename.
Use natural language for general analysis:
Find recent processing errors
What errors occurred in the last week?
Show me system failures
Summarize recent problems
The agent interprets time ranges intelligently:
| Query Phrase | Time Range |
|---|---|
| "recent" or "recently" | 1 hour |
| "last hour" | 1 hour |
| "last day" or "yesterday" | 24 hours |
| "last week" | 168 hours (7 days) |
| No time specified | 24 hours (default) |
The Error Analyzer is configured in the CloudFormation template under the agents section of the configuration schema:
agents:
error_analyzer:
type: object
sectionLabel: Error Analysis Agent
properties:
model_id:
type: string
enum: [
"anthropic.claude-3-sonnet-20240229-v1:0",
"us.anthropic.claude-3-5-sonnet-20241022-v2:0",
"us.anthropic.claude-3-7-sonnet-20250219-v1:0",
"us.anthropic.claude-sonnet-4-20250514-v1:0"
]
default: "us.anthropic.claude-sonnet-4-20250514-v1:0"
system_prompt:
type: string
format: textarea
parameters:
type: object
properties:
max_log_events:
type: integer
default: 5
time_range_hours_default:
type: integer
default: 24Purpose: Selects the Bedrock model for error analysis
Recommended: us.anthropic.claude-sonnet-4-20250514-v1:0
- Superior reasoning for complex error diagnosis
- Better structured output formatting
- More accurate root cause identification
Alternative Options:
us.anthropic.claude-3-7-sonnet-20250219-v1:0: Good balance of cost and capabilityus.anthropic.claude-3-5-sonnet-20241022-v2:0: Cost-effective for simple errorsanthropic.claude-3-sonnet-20240229-v1:0: Legacy option
Purpose: Defines agent behavior and response formatting
Key Requirements:
- Enforce three-section structure (Root Cause, Recommendations, Evidence)
- Specify evidence formatting with collapsible HTML details
- Define recommendation guidelines for different issue types
- Set time range parsing rules
Default Prompt Highlights:
You are an intelligent error analysis agent for the GenAI IDP system.
ALWAYS format your response with exactly these three sections:
## Root Cause
## Recommendations
<details><summary><strong>Evidence</strong></summary>...</details>
RECOMMENDATION GUIDELINES:
For code/system bugs: Do not suggest code modifications
For configuration issues: Direct users to UI configuration panel
For operational issues: Provide immediate troubleshooting steps
Purpose: Limits log events returned to manage context window
Default: 5 events
Range: 1-100
Considerations:
- Higher values provide more context but consume more tokens
- For system-wide analysis, uses adaptive sampling across patterns
- Individual log messages are truncated if exceeding 200 characters
Tuning Guidance:
- Simple errors: 3-5 events sufficient
- Complex investigations: 10-20 events
- Pattern analysis: 20-50 events
Purpose: Default lookback period when not specified in query
Default: 24 hours (1 day)
Range: 1-168 hours (1 week)
Considerations:
- Longer ranges increase CloudWatch Logs query time
- Wider time windows may return less relevant results
- Balance between coverage and performance
Tuning Guidance:
- Active development: 1-6 hours
- Production monitoring: 24 hours
- Post-mortem analysis: 72-168 hours
agents:
error_analyzer:
model_id: us.anthropic.claude-sonnet-4-20250514-v1:0
system_prompt: |
You are an intelligent error analysis agent for the GenAI IDP system.
Use the analyze_errors tool to investigate issues. ALWAYS format your
response with exactly these three sections in this order:
## Root Cause
Identify the specific underlying technical reason why the error occurred.
## Recommendations
Provide specific, actionable steps to resolve the issue. Limit to top
three recommendations only.
<details>
<summary><strong>Evidence</strong></summary>
Format log entries with their source information...
</details>
parameters:
max_log_events: 5
time_range_hours_default: 24The Root Cause section identifies the underlying technical reason for the failure, not just symptoms.
Good Root Cause Examples:
✓ "Bedrock model returned ValidationException due to malformed JSON in the
extraction prompt, caused by unescaped special characters in attribute
descriptions"
✓ "Lambda function timeout after 900 seconds during assessment processing,
triggered by processing a document with 247 pages exceeding memory limits"
✓ "Access denied error when reading OCR results from S3, caused by missing
kms:Decrypt permission on the customer-managed encryption key"
Poor Root Cause Examples (too vague):
✗ "The document failed to process"
✗ "There was an error in the system"
✗ "Lambda function had a problem"
Recommendations are specific, actionable steps tailored to the issue type.
For configuration issues, the agent directs users to the UI:
Recommendations:
• Navigate to Configuration panel in the Web UI
• Update 'extraction.model' to use a higher-capacity model
• Adjust 'assessment.granular.max_workers' from 4 to 2 to reduce memory pressure
For operational issues, provides immediate troubleshooting:
Recommendations:
• Retry the failed document - error appears transient
• Check AWS Service Health Dashboard for Bedrock service issues
• Monitor CloudWatch metrics for throttling patterns in next 30 minutes
For code issues, focuses on reporting not fixing:
Recommendations:
• Report to development team with error details and timestamp
• Include Lambda request ID: abc-123-def-456 for debugging
• Avoid reprocessing this document type until patch is deployed
The Evidence section provides verifiable log data supporting the analysis.
Structure:
<details>
<summary><strong>Evidence</strong></summary>
**Log Group:**
/aws-stack-name/lambda/ExtractionFunction
**Log Stream:**
2025/01/03/[$LATEST]abc123def456
**Events:**[ERROR] 2025-01-03T15:42:13.456Z RequestId: xyz-789 ValidationException: JSON parsing error at line 42
</details>
Reading Evidence:
- Log Group: Identifies which Lambda function encountered the error
- Log Stream: Provides exact execution instance for deep-dive investigation
- Events: Shows actual error messages with timestamps
- Truncation: Long messages truncated to "... [truncated]" for readability
The agent uses regex pattern matching to determine analysis type:
# Document-specific patterns (require colon immediately after keyword)
document: filename.pdf # Matches
file: report.docx # Matches
ObjectKey: path/file # Matches
# General analysis (no specific pattern)
recent errors # System-wide
find failures # System-wide
what happened # System-widePattern Detection Logic:
If query matches "document:\s*([^\s]+)" → Document-Specific Analysis
If query matches "file:\s*([^\s]+)" → Document-Specific Analysis
If query matches "ObjectKey:\s*([^\s]+)" → Document-Specific Analysis
Otherwise → System-Wide Analysis
System-wide analysis uses prioritized pattern matching:
flowchart TD
Start[System-Wide Query] --> P1[Pattern 1: ERROR<br/>Priority: High<br/>Max Events: 5]
P1 --> P2[Pattern 2: Exception<br/>Priority: Medium<br/>Max Events: 3]
P2 --> P3[Pattern 3: ValidationException<br/>Priority: Medium<br/>Max Events: 2]
P3 --> P4[Pattern 4: Failed<br/>Priority: Low<br/>Max Events: 2]
P4 --> P5[Pattern 5: Timeout<br/>Priority: Low<br/>Max Events: 1]
P5 --> Dedupe[Deduplication &<br/>Filtering]
Dedupe --> Result[Final Event Set]
Prioritization Strategy:
- ERROR: Highest priority, captures 5 events
- Exception: Important errors, captures 3 events
- ValidationException: Specific validation issues, 2 events
- Failed: General failures, 2 events
- Timeout: Performance issues, 1 event
Context Management:
- Respects
max_log_eventsparameter as total limit - Uses adaptive sampling across patterns
- Deduplicates similar error messages
- Truncates long messages at 200 characters
System-wide analysis categorizes errors for pattern identification:
Category Definitions:
-
validation_errors
- Keywords: "validation", "invalid"
- Indicates data quality or format issues
- Often fixable through configuration
-
processing_errors
- Keywords: "exception", "error"
- Core processing failures
- May require code fixes
-
timeout_errors
- Keywords: "timeout"
- Performance/resource issues
- Adjustable through memory/timeout settings
-
access_errors
- Keywords: "access", "denied"
- Permission problems
- Requires IAM policy updates
-
system_errors
- Catch-all for other errors
- Infrastructure or service issues
Category Summary Example:
{
"error_categories": {
"validation_errors": {
"count": 12,
"sample": "ValidationException: Invalid attribute schema..."
},
"timeout_errors": {
"count": 5,
"sample": "Task timed out after 900.00 seconds"
}
}
}The Web UI maintains job state for seamless resumption:
stateDiagram-v2
[*] --> Creating: Open Modal
Creating --> Pending: Job Created
Pending --> Processing: Agent Starts
Processing --> Completed: Success
Processing --> Failed: Error
Processing --> Stored: User Closes Modal
Stored --> Processing: User Reopens Modal
Completed --> [*]: User Closes
Failed --> [*]: User Closes
How It Works:
- Job Creation: Modal creates job with unique
jobId - Parent Tracking: Component stores job state in parent
- Modal Close: Job continues running in background
- Modal Reopen: Automatically resumes displaying existing job
- Status Updates: Real-time updates via GraphQL subscription
User Experience:
- Users can close modal without losing analysis
- Reopening shows current progress or final results
- No need to re-submit for in-progress or completed jobs
- New jobs only created when previous job is COMPLETED/FAILED
-
Document Processing Failures
Scenario: Specific document failed with FAILED status Query: "document: customer_form_2024.pdf" Benefit: Pinpoints exact Lambda function and error cause -
Recurring Error Patterns
Scenario: Multiple documents failing with similar errors Query: "What errors occurred in the last 6 hours?" Benefit: Identifies systemic issues affecting multiple documents -
Performance Investigation
Scenario: Documents timing out during processing Query: "Show me timeout errors in the last day" Benefit: Reveals resource constraints and bottlenecks -
Post-Deployment Validation
Scenario: New configuration deployed, checking for issues Query: "Recent processing errors" Benefit: Quick health check after changes -
Support Ticket Creation
Scenario: Need detailed error report for escalation Query: "document: problem_file.pdf" Benefit: Generates formatted report with evidence
- Pre-deployment Testing: Use evaluation tools and test sets instead
- Performance Optimization: Use CloudWatch metrics and dashboards
- Capacity Planning: Use monitoring and reporting features
- Cost Analysis: Use the cost calculator and pricing reports
Good ✓
document: lending_package_2024_Q1.pdf
file: bank_statement_january.docx
ObjectKey: uploads/healthcare/prior_auth_12345.pdf
Poor ✗
document lending package # Missing colon
find document # Too vague
check that failed file # No specific ID
Good ✓
Show errors in the last hour # Recent issues
What happened yesterday? # Specific timeframe
Recent system failures # Uses default 24h
Poor ✗
Show all errors ever # Too broad, slow query
Find problems # No time context
Check everything # Vague and expensive
Good ✓
document: contract.pdf # Clear document-specific
Recent validation errors # Clear system-wide
What went wrong today? # Natural language OK
Poor ✗
Analyze document contract.pdf # Ambiguous format
Find errors for file: x.pdf and system # Mixed intents
Example Analysis:
Root Cause: Lambda function exhausted 4096 MB memory limit while processing
a 150-page document with high-resolution images during OCR conversion
Symptoms (don't focus on these):
- Lambda timeout after 15 minutes
- No results written to S3
- Document stuck in PROCESSING status
Action: Address the root cause (memory limit) rather than symptoms (timeout).
The agent limits recommendations to top three most impactful actions:
Recommendation Priority:
1. Immediate Fix: Increase OCR Lambda memory to 8192 MB
2. Short-term: Implement image preprocessing to reduce resolution
3. Long-term: Add document size validation before processing
Don't try to implement all suggestions at once - start with #1.
Cross-reference recommendations with evidence:
Recommendation: "Increase Lambda memory allocation"
Evidence Validation:
✓ Log shows: "@maxMemoryUsed: 4089 MB" (near 4096 MB limit)
✓ Event type: "Task timed out after 900.00 seconds"
✓ Pattern: Occurs on documents with >100 pages
If evidence doesn't support recommendation, request clarification.
System-wide analysis categorizes errors:
Categories Found:
- validation_errors: 15 (most common)
- timeout_errors: 3
- access_errors: 1
Action: Focus on validation_errors first as they affect most documents
Choose model based on error complexity:
# Simple validation errors, frequent analysis
model_id: us.anthropic.claude-3-5-sonnet-20241022-v2:0
# Complex multi-component failures, critical analysis
model_id: us.anthropic.claude-sonnet-4-20250514-v1:0 # Recommended
# Legacy support only
model_id: anthropic.claude-3-sonnet-20240229-v1:0Tune based on analysis type:
parameters:
# Development/testing - need detailed context
max_log_events: 20
# Production monitoring - balance detail and cost
max_log_events: 5 # Default
# Quick health checks - minimize cost
max_log_events: 3Set default based on deployment frequency:
parameters:
# Frequent deployments (multiple per day)
time_range_hours_default: 6
# Daily deployments
time_range_hours_default: 24 # Default
# Weekly deployments
time_range_hours_default: 72Symptom: Error message "Error-Analyzer-Agent-v1 agent is not available"
Causes:
- Agent configuration not deployed
- Stack configuration outdated
- Agent ID mismatch
Resolution:
1. Check template.yaml includes agents.error_analyzer section
2. Verify configuration deployed:
aws cloudformation describe-stacks --stack-name <stack-name>
3. Check available agents in Web UI Configuration panel
4. Redeploy stack if configuration missing
Symptom: Job status shows FAILED or times out
Causes:
- Lambda function timeout (15 min limit)
- Insufficient memory
- Invalid permissions
- Bedrock throttling
Resolution:
1. Check CloudWatch Logs for agent Lambda function:
/aws/<stack-name>/lambda/AgentFunction
2. Look for specific error messages:
- "Task timed out" → Increase memory or reduce query scope
- "AccessDeniedException" → Check IAM permissions
- "ThrottlingException" → Wait and retry
3. For document-specific queries, ensure document exists:
- Verify ObjectKey is correct
- Check document in DynamoDB TrackingTable
Symptom: Analysis missing Root Cause or Recommendations sections
Causes:
- Model output formatting issue
- System prompt not enforced
- Token limit exceeded
Resolution:
1. Verify system_prompt includes formatting requirements:
"ALWAYS format your response with exactly these three sections"
2. Check model_id is using recommended Claude Sonnet 4:
us.anthropic.claude-sonnet-4-20250514-v1:0
3. If token limit reached, reduce max_log_events or time range
Symptom: "Access denied" or "Permission denied" errors
Causes:
- Missing CloudWatch Logs permissions
- DynamoDB access denied
- KMS key permissions
Resolution:
IAM Permissions Required:
- CloudWatch Logs:
* logs:FilterLogEvents
* logs:DescribeLogGroups
* logs:DescribeLogStreams
- DynamoDB:
* dynamodb:GetItem
* dynamodb:Query
* dynamodb:Scan
- KMS (if using customer-managed keys):
* kms:Decrypt
* kms:DescribeKey
Check Lambda execution role has these permissions.
Symptom: Evidence section is empty or missing
Causes:
- No matching log events in time range
- CloudWatch log retention expired
- Incorrect log group names
Resolution:
1. Increase time range: "Show errors in the last week"
2. Check log retention in CloudWatch console
3. Verify log group naming convention:
/{StackName}/lambda/{FunctionName}
4. Use system-wide query to check if any logs available
Mutations:
mutation SubmitAgentQuery {
submitAgentQuery(
query: "document: lending_package.pdf"
agentIds: ["Error-Analyzer-Agent-v1"]
) {
jobId
status
}
}Queries:
query GetAgentJobStatus($jobId: ID!) {
getAgentJobStatus(jobId: $jobId) {
jobId
status
result
agent_messages
error
}
}Subscriptions:
subscription OnAgentJobComplete($jobId: ID!) {
onAgentJobComplete(jobId: $jobId) {
jobId
status
}
}Log Group Discovery:
# Pattern for stack log groups
log_group_pattern = f"/{stack_name}/lambda/"
# Searches across:
- /stack-name/lambda/OCRFunction
- /stack-name/lambda/ClassificationFunction
- /stack-name/lambda/ExtractionFunction
- /stack-name/lambda/AssessmentFunction
- /stack-name/lambda/SummarizationFunctionLog Filtering:
# Document-specific filter
filter_pattern = f'"ObjectKey" = "{document_id}" "ERROR"'
# System-wide patterns (prioritized)
patterns = ["ERROR", "Exception", "ValidationException", "Failed", "Timeout"]Table Schema (relevant fields):
{
"ObjectKey": "uploads/document.pdf", # Partition key
"ObjectStatus": "FAILED", # Document status
"ExecutionArn": "arn:aws:states:...", # Step Functions ARN
"CompletionTime": "2025-01-03T15:30:00Z",
"ErrorMessage": "Processing failed...", # Optional error
"LastModified": "2025-01-03T15:30:00Z"
}Query Patterns:
# Find document by ObjectKey
response = table.get_item(Key={"ObjectKey": document_id})
# Scan for recent failures
response = table.scan(
FilterExpression="ObjectStatus = :status AND CompletionTime > :time",
ExpressionAttributeValues={
":status": "FAILED",
":time": threshold_timestamp
}
)Execution Context:
# Extract execution ID from DynamoDB
execution_arn = "arn:aws:states:us-east-1:123456789012:execution:StateMachine:abc-123"
execution_id = execution_arn.split(":")[-1] # "abc-123"
# Used for log filtering
filter_pattern = f'"execution_id" = "{execution_id}"'Location: lib/idp_common_pkg/idp_common/agents/error_analyzer/tools/error_analysis_tool.py
Function Signature:
@tool
def analyze_errors(query: str, time_range_hours: int = 1) -> Dict[str, Any]:
"""
Intelligent error analysis with precise query classification.
Args:
query: User's error analysis query
time_range_hours: Hours to look back (default: 1, uses config default)
Returns:
Dict containing analysis results or error information
"""Classification Logic:
def _classify_query_intent(query: str) -> Tuple[str, str]:
"""
Classify query as document-specific vs general system analysis.
Returns:
Tuple of (intent_type, document_id)
- intent_type: "document_specific" or "general_analysis"
- document_id: Extracted document ID or empty string
"""
specific_doc_patterns = [
r"document:\s*([^\s]+)",
r"file:\s*([^\s]+)",
r"ObjectKey:\s*([^\s]+)",
]
for pattern in specific_doc_patterns:
match = re.search(pattern, query, re.IGNORECASE)
if match:
return ("document_specific", match.group(1).strip())
return ("general_analysis", "")Location: lib/idp_common_pkg/idp_common/agents/error_analyzer/tools/document_analysis_tool.py
Purpose: Document-specific failure analysis
Key Operations:
- Retrieves document context from DynamoDB
- Searches CloudWatch logs filtered by ObjectKey
- Extracts Lambda request IDs for tracing
- Correlates execution context with errors
Location: lib/idp_common_pkg/idp_common/agents/error_analyzer/tools/general_analysis_tool.py
Purpose: System-wide error pattern analysis
Key Operations:
- Scans DynamoDB for recent failures
- Multi-pattern CloudWatch log search
- Error categorization and statistics
- Adaptive sampling for context management
CloudWatch Logs Queries:
- Each query scans specified time range across log groups
- Longer time ranges increase query latency
- Max 10,000 events per FilterLogEvents call
Cost Optimization:
# Efficient queries
max_log_events = 5 # Minimal context window usage
time_range_hours = 1 # Recent errors only
# Expensive queries (use sparingly)
max_log_events = 50 # Large context window
time_range_hours = 168 # Full week scanToken Usage:
- System prompt: ~800 tokens
- Log events: ~100-200 tokens each
- Analysis response: ~500-1000 tokens
- Total per query: ~2000-4000 tokens average
- Troubleshooting Guide: General troubleshooting for common issues (manual steps)
- Use the Error Analyzer for automated diagnosis
- Refer to Troubleshooting Guide for manual resolution steps, performance tuning, and infrastructure issues
- Monitoring: CloudWatch dashboards and metrics
- Web UI: User interface features and navigation
- Architecture: Overall system architecture
- Configuration: Configuration management
Use Error Analyzer for:
- Document processing failures (root cause analysis)
- Recent error patterns across the system
- Automated log correlation and diagnosis
- Quick troubleshooting with AI-powered recommendations
Use Manual Troubleshooting Guide for:
- Infrastructure and deployment issues
- Performance optimization and tuning
- Security and authentication problems
- Build and configuration management
- DLQ processing and queue management
Error Analyzer:
- AI-powered root cause identification
- Automated correlation across services
- Natural language query interface
- Actionable recommendations
- Integrated with IDP workflow
CloudWatch Insights:
- Manual query writing required
- Single log group analysis
- Technical query language
- Raw log data output
- Generic AWS service
Yes, the system prompt is fully customizable in the configuration:
- Navigate to Configuration panel in Web UI
- Expand "Agent Configuration" section
- Edit "Error Analysis Agent" → "system_prompt"
- Save configuration
Caution: Modifying the system prompt may affect output formatting and quality.
The Error Analyzer supports:
- Multiple users: Each can have active jobs
- Job per document: One active job per user per document
- System-wide queries: Unlimited concurrent queries
- Resource limits: Subject to Lambda concurrency and Bedrock quotas
Timeout Handling:
- Lambda has 15-minute timeout
- Job status set to FAILED
- Partial results (if any) are saved
- User can retry with narrower scope:
- Reduce
time_range_hours - Reduce
max_log_events - Use document-specific query
- Reduce
Export Options:
- Copy from UI: Select and copy formatted text
- API Access: Use
getAgentJobStatusquery - CloudWatch Logs: Agent logs contain full results
- Future Enhancement: Export to PDF/JSON (roadmap)
Retention Policy:
- In-memory: Active jobs only
- DynamoDB: Not persisted (stateless)
- CloudWatch Logs: Per log group retention (default: 7-90 days)
- Recommendation: Screenshot or copy important analyses
Yes, if custom Lambda functions:
- Write to CloudWatch Logs with stack-based log group names
- Include ObjectKey in log messages
- Follow standard error logging patterns
The analyzer will automatically discover and search these logs.
- Single Agent: Only Error-Analyzer-Agent-v1 supported
- English Only: Optimized for English log messages
- AWS Services: CloudWatch and DynamoDB only (no external logs)
- Pattern Matching: Regex-based classification may miss edge cases
- Context Window: Limited by Bedrock model token limits
- Long Document IDs: ObjectKeys >200 characters may be truncated
- Special Characters: Some Unicode in logs may cause parsing issues
- High Volume: Systems with >1000 errors/hour may hit throttling
- Multi-Region: Analyzer only searches current region
- Multi-language Support: Non-English log analysis
- Custom Patterns: User-defined error patterns
- Trend Analysis: Historical error pattern tracking
- Predictive Alerts: Proactive failure prediction
- Export Features: PDF/JSON report generation
- Integration: Slack/Teams notifications
The Error Analyzer is a powerful AI-driven troubleshooting tool that:
✓ Automates failure diagnosis with AI-powered analysis
✓ Accelerates root cause identification from hours to minutes
✓ Correlates data across CloudWatch, DynamoDB, and Step Functions
✓ Provides actionable, context-specific recommendations
✓ Integrates seamlessly with the Web UI workflow
✓ Supports both document-specific and system-wide analysis
For optimal results:
- Use Claude Sonnet 4 model for complex errors
- Be specific with document IDs in queries
- Focus on root causes, not symptoms
- Verify recommendations with evidence
- Adjust configuration based on deployment patterns
The Error Analyzer significantly reduces troubleshooting time and improves operational efficiency for GenAI IDP deployments.