Skip to content

Commit ee24b16

Browse files
authored
Improve troubleshooting execution steering guide (#161)
* docs: Improve troubleshooting execution steering guide Expand trigger keywords in POWER.md for better routing to the troubleshooting guide. Rewrite troubleshooting-executions.md with clearer diagnostic steps, ARN-based workflow, console link generation, structured output format, and additional usage examples. * docs: Add CloudWatch Logs querying to troubleshooting guide Add steps to fetch the log group and query logs filtered by invocation ID and execution name when execution history alone is insufficient to determine root cause. * docs: Add alias resolution and list executions step Add step 0 to resolve function alias to version and list executions when user provides function name instead of full ARN. Also add STOPPED status and improve permissions error guidance.
1 parent 5867312 commit ee24b16

2 files changed

Lines changed: 106 additions & 22 deletions

File tree

aws-lambda-durable-functions-power/POWER.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ Load the appropriate reference file based on what the user is working on:
8787
- **Testing**, **local testing**, **cloud testing**, **test runner**, or **flaky tests** -> see [testing-patterns.md](steering/testing-patterns.md)
8888
- **Deployment**, **CloudFormation**, **CDK**, **SAM**, **log groups**, **deploy**, or **infrastructure** -> see [deployment-iac.md](steering/deployment-iac.md)
8989
- **Advanced patterns**, **GenAI agents**, **completion policies**, **step semantics**, or **custom serialization** -> see [advanced-patterns.md](steering/advanced-patterns.md)
90-
- **troubleshooting**, **stuck execution**, **failed execution**, **debug execution ID**, or **execution history** -> see [troubleshooting-executions.md](steering/troubleshooting-executions.md)
90+
- **troubleshooting**, **stuck execution**, **failed execution**, **debug execution ID**, **execution history**, **execution error**, **why did my execution fail**, **execution timed out**, **callback not received**, **diagnose execution**, or **root cause execution** -> see [troubleshooting-executions.md](steering/troubleshooting-executions.md)
9191

9292
## Quick Reference
9393

aws-lambda-durable-functions-power/steering/troubleshooting-executions.md

Lines changed: 105 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,8 @@ When spawning the troubleshooting agent, provide:
1919

2020
```
2121
Diagnose durable function execution issue:
22-
- Function: <function-name>:<alias> (must be qualified ARN)
23-
- Execution ID: <execution-id>
22+
- Durable Execution ARN: <durable-execution-arn>
23+
- Region: <region> (infer from ARN)
2424
2525
CRITICAL SAFETY RULES:
2626
- This is READ-ONLY diagnosis
@@ -29,31 +29,115 @@ CRITICAL SAFETY RULES:
2929
- Only suggest manual remediation if user explicitly requests it
3030
3131
Steps:
32-
1. Run: aws lambda get-durable-execution-history --function-name <function> --execution-id <id>
33-
2. Analyze execution status (RUNNING/SUCCEEDED/FAILED/TIMED_OUT)
34-
3. Check for stuck operations (PENDING/RUNNING status)
35-
4. Identify failed operations and error messages
36-
5. Calculate operation durations and timeline
37-
6. Diagnose specific issue:
38-
- Stuck in WAIT_FOR_CALLBACK: Extract callback ID, suggest manual callback
39-
- Failed operations: Show error and retry attempts
40-
- Timeout: Calculate total duration, identify slow operations
41-
- Unexpected behavior: Compare operation order with expected flow
42-
7. Provide specific recommendations and next steps
43-
44-
Use jq for JSON parsing and analysis.
32+
0. If the user provides a function name + alias (e.g., my-function:prod) instead of a full ARN:
33+
- Resolve the alias to a version: aws lambda get-alias --function-name <functionName> --name <alias> --region <region> --query 'FunctionVersion' --output text
34+
- List executions for that function: aws lambda list-durable-executions-by-function --function-name <functionName>:<version> --region <region>
35+
- Ask the user to identify the execution, or use the most recent one.
36+
37+
1. Fetch the execution history directly:
38+
Run: aws lambda get-durable-execution-history --durable-execution-arn <durable-execution-arn> --region <region> --include-execution-data
39+
40+
2. If the command succeeds, analyze and provide a user-friendly diagnosis:
41+
a. Report the execution status (RUNNING/SUCCEEDED/FAILED/STOPPED/TIMED_OUT)
42+
b. Identify the root cause:
43+
- Failed operations: Show the EXACT error message verbatim in a code block
44+
- Stuck in WAIT_FOR_CALLBACK: Extract callback ID, show how long it's been waiting
45+
- Timeout: Show which operation was running when timeout occurred
46+
- Unexpected behavior: Compare operation order with expected flow
47+
c. Calculate operation durations and timeline
48+
d. Provide a clear, plain-language explanation of what went wrong and why
49+
50+
3. If the command fails:
51+
- Execution not found: Tell the user the execution ID may be incorrect or the execution may have been purged. Ask them to verify the ARN.
52+
- Permissions/network error: check that your caller identity has lambda:GetDurableExecutionHistory on the function ARN.
53+
- In either case, direct them to the console as a fallback (see step 4)
54+
55+
4. ALWAYS provide a direct link to the Execution Details page in the Lambda console.
56+
Parse the ARN (arn:<partition>:lambda:<region>:<accountId>:function:<functionName>:<functionVersion>/durable-execution/<executionName>/<invocationId>)
57+
to extract region, functionName, functionVersion, executionName, and invocationId, then construct:
58+
https://<region>.console.aws.amazon.com/lambda/home?region=<region>#/functions/<functionName>/versions/<functionVersion>/executions/<executionName>/<invocationId>
59+
60+
Frame it as: "**[View this execution in the console](<url>)**"
61+
62+
5. Provide specific, actionable next steps based on the diagnosis.
63+
64+
6. If unable to determine the root cause from execution history:
65+
- Provide the console link (step 4)
66+
- Offer to fetch the log group and pull relevant logs:
67+
a. Get the log group:
68+
aws lambda get-function-configuration --function-name <functionName>:<functionVersion> --region <region> --query 'LoggingConfig.LogGroup'
69+
b. Query logs filtered by invocation ID (parsed from the ARN):
70+
aws logs filter-log-events --log-group-name <logGroup> --region <region> --filter-pattern '"<invocationId>"'
71+
c. If the function uses SDK structured logging (context.logger), query for step-level logs.
72+
Use execution start time from step 2 as start-time, and execution end time (or current time if still running) as end-time:
73+
QUERY_ID=$(aws logs start-query --log-group-name <logGroup> --region <region> \
74+
--start-time <startEpoch> --end-time <endEpoch> \
75+
--query-string 'fields @timestamp, @message | filter executionName = "<executionName>" | sort @timestamp asc' \
76+
--query 'queryId' --output text)
77+
Then poll for results:
78+
aws logs get-query-results --query-id $QUERY_ID --region <region>
79+
- If unable to determine the log group, ask the user to check the function's logs
80+
81+
OUTPUT FORMAT:
82+
- Lead with a short summary: "Your execution [status] because [reason]."
83+
- Show the error message verbatim in a code block if one exists.
84+
- Explain what this means in plain language.
85+
- Provide the console link for visual debugging.
86+
- List recommended next steps.
87+
88+
Use jq for JSON parsing if available, otherwise use --query (AWS CLI built-in JMESPath) or python -m json.tool.
4589
```
4690

4791
## Example Usage
4892

4993
```
50-
User: "My durable function execution abc-123 is stuck on my-function:prod"
94+
User: "My durable function execution abc-123 is stuck on my-function:prod in us-west-2"
5195
52-
Claude: [Spawns Task agent with troubleshooting instructions]
5396
Agent: [Runs get-durable-execution-history command]
5497
Agent: [Analyzes with jq queries]
55-
Agent: [Returns: "Execution stuck in WAIT_FOR_CALLBACK operation 'wait-for-approval'.
56-
Callback ID: xyz789. Waiting since 2026-02-14. Timeout in 12 hours.
57-
Recommendation: Check if approval email was sent, or manually send callback."]
58-
Claude: [Presents findings and offers to send manual callback if needed]
98+
99+
Claude: Your execution is **waiting for an external callback** that hasn't arrived yet.
100+
101+
The operation `wait-for-approval` has been pending since 2026-02-14 (3 days).
102+
It will time out in 12 hours if no callback is received.
103+
104+
Callback ID: `xyz789`
105+
106+
**[View this execution in the console](https://us-west-2.console.aws.amazon.com/lambda/home?region=us-west-2#/functions/my-function/versions/3/executions/abc-123/inv-456)**
107+
```
108+
https://us-west-2.console.aws.amazon.com/lambda/home?region=us-west-2#/functions/my-function/versions/3/executions/abc-123/inv-456
109+
```
110+
111+
**Next steps:**
112+
1. Check if the approval email was sent to the recipient
113+
2. Verify the callback ID `xyz789` was included in the email
114+
3. If you'd like me to help send the callback manually, let me know
115+
```
116+
117+
```
118+
User: "Execution def-456 failed on order-processor:live in us-east-1"
119+
120+
Agent: [Runs get-durable-execution-history command]
121+
Agent: [Analyzes with jq queries]
122+
123+
Claude: Your execution **failed** at the `charge-payment` step after 3 retry attempts.
124+
125+
Error message:
126+
```
127+
PaymentGatewayError: Card declined - insufficient funds (code: card_declined)
128+
```
129+
130+
This means the payment provider rejected the charge. The durable function
131+
retried 3 times with exponential backoff but the error persisted because
132+
it's a permanent failure (not a transient network issue).
133+
134+
**[View this execution in the console](https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/order-processor/versions/5/executions/def-456/inv-789)**
135+
```
136+
https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/order-processor/versions/5/executions/def-456/inv-789
137+
```
138+
139+
**Next steps:**
140+
1. This is a business logic failure, not an infrastructure issue
141+
2. Consider adding a non-retryable error classification for `card_declined`
142+
3. Implement a compensation step to release the reserved inventory
59143
```

0 commit comments

Comments
 (0)