Improve troubleshooting execution steering guide (#161)

vishalsatam · web-flow · commit ee24b160d089 · 2026-05-06T19:57:40.000-04:00
* docs: Improve troubleshooting execution steering guide

Expand trigger keywords in POWER.md for better routing to the
troubleshooting guide. Rewrite troubleshooting-executions.md with
clearer diagnostic steps, ARN-based workflow, console link generation,
structured output format, and additional usage examples.

* docs: Add CloudWatch Logs querying to troubleshooting guide

Add steps to fetch the log group and query logs filtered by invocation
ID and execution name when execution history alone is insufficient to
determine root cause.

* docs: Add alias resolution and list executions step

Add step 0 to resolve function alias to version and list executions
when user provides function name instead of full ARN. Also add STOPPED
status and improve permissions error guidance.
diff --git a/aws-lambda-durable-functions-power/POWER.md b/aws-lambda-durable-functions-power/POWER.md
@@ -87,7 +87,7 @@ Load the appropriate reference file based on what the user is working on:
 - **Testing**, **local testing**, **cloud testing**, **test runner**, or **flaky tests** -> see [testing-patterns.md](steering/testing-patterns.md)
 - **Deployment**, **CloudFormation**, **CDK**, **SAM**, **log groups**, **deploy**, or **infrastructure** -> see [deployment-iac.md](steering/deployment-iac.md)
 - **Advanced patterns**, **GenAI agents**, **completion policies**, **step semantics**, or **custom serialization** -> see [advanced-patterns.md](steering/advanced-patterns.md)
-- **troubleshooting**, **stuck execution**, **failed execution**, **debug execution ID**, or **execution history** -> see [troubleshooting-executions.md](steering/troubleshooting-executions.md)
+- **troubleshooting**, **stuck execution**, **failed execution**, **debug execution ID**, **execution history**, **execution error**, **why did my execution fail**, **execution timed out**, **callback not received**, **diagnose execution**, or **root cause execution** -> see [troubleshooting-executions.md](steering/troubleshooting-executions.md)
 
 ## Quick Reference
 
diff --git a/aws-lambda-durable-functions-power/steering/troubleshooting-executions.md b/aws-lambda-durable-functions-power/steering/troubleshooting-executions.md
@@ -19,8 +19,8 @@ When spawning the troubleshooting agent, provide:
 
 ```
 Diagnose durable function execution issue:
-- Function: <function-name>:<alias> (must be qualified ARN)
-- Execution ID: <execution-id>
+- Durable Execution ARN: <durable-execution-arn>
+- Region: <region> (infer from ARN)
 
 CRITICAL SAFETY RULES:
 - This is READ-ONLY diagnosis
@@ -29,31 +29,115 @@ CRITICAL SAFETY RULES:
 - Only suggest manual remediation if user explicitly requests it
 
 Steps:
-1. Run: aws lambda get-durable-execution-history --function-name <function> --execution-id <id>
-2. Analyze execution status (RUNNING/SUCCEEDED/FAILED/TIMED_OUT)
-3. Check for stuck operations (PENDING/RUNNING status)
-4. Identify failed operations and error messages
-5. Calculate operation durations and timeline
-6. Diagnose specific issue:
-   - Stuck in WAIT_FOR_CALLBACK: Extract callback ID, suggest manual callback
-   - Failed operations: Show error and retry attempts
-   - Timeout: Calculate total duration, identify slow operations
-   - Unexpected behavior: Compare operation order with expected flow
-7. Provide specific recommendations and next steps
-
-Use jq for JSON parsing and analysis.
+0. If the user provides a function name + alias (e.g., my-function:prod) instead of a full ARN:
+   - Resolve the alias to a version: aws lambda get-alias --function-name <functionName> --name <alias> --region <region> --query 'FunctionVersion' --output text
+   - List executions for that function: aws lambda list-durable-executions-by-function --function-name <functionName>:<version> --region <region>
+   - Ask the user to identify the execution, or use the most recent one.
+
+1. Fetch the execution history directly:
+   Run: aws lambda get-durable-execution-history --durable-execution-arn <durable-execution-arn> --region <region> --include-execution-data
+
+2. If the command succeeds, analyze and provide a user-friendly diagnosis:
+   a. Report the execution status (RUNNING/SUCCEEDED/FAILED/STOPPED/TIMED_OUT)
+   b. Identify the root cause:
+      - Failed operations: Show the EXACT error message verbatim in a code block
+      - Stuck in WAIT_FOR_CALLBACK: Extract callback ID, show how long it's been waiting
+      - Timeout: Show which operation was running when timeout occurred
+      - Unexpected behavior: Compare operation order with expected flow
+   c. Calculate operation durations and timeline
+   d. Provide a clear, plain-language explanation of what went wrong and why
+
+3. If the command fails:
+   - Execution not found: Tell the user the execution ID may be incorrect or the execution may have been purged. Ask them to verify the ARN.
+   - Permissions/network error: check that your caller identity has lambda:GetDurableExecutionHistory on the function ARN.
+   - In either case, direct them to the console as a fallback (see step 4)
+
+4. ALWAYS provide a direct link to the Execution Details page in the Lambda console.
+   Parse the ARN (arn:<partition>:lambda:<region>:<accountId>:function:<functionName>:<functionVersion>/durable-execution/<executionName>/<invocationId>)
+   to extract region, functionName, functionVersion, executionName, and invocationId, then construct:
+   https://<region>.console.aws.amazon.com/lambda/home?region=<region>#/functions/<functionName>/versions/<functionVersion>/executions/<executionName>/<invocationId>
+
+   Frame it as: "**[View this execution in the console](<url>)**"
+
+5. Provide specific, actionable next steps based on the diagnosis.
+
+6. If unable to determine the root cause from execution history:
+   - Provide the console link (step 4)
+   - Offer to fetch the log group and pull relevant logs:
+     a. Get the log group:
+        aws lambda get-function-configuration --function-name <functionName>:<functionVersion> --region <region> --query 'LoggingConfig.LogGroup'
+     b. Query logs filtered by invocation ID (parsed from the ARN):
+        aws logs filter-log-events --log-group-name <logGroup> --region <region> --filter-pattern '"<invocationId>"'
+     c. If the function uses SDK structured logging (context.logger), query for step-level logs.
+        Use execution start time from step 2 as start-time, and execution end time (or current time if still running) as end-time:
+        QUERY_ID=$(aws logs start-query --log-group-name <logGroup> --region <region> \
+          --start-time <startEpoch> --end-time <endEpoch> \
+          --query-string 'fields @timestamp, @message | filter executionName = "<executionName>" | sort @timestamp asc' \
+          --query 'queryId' --output text)
+        Then poll for results:
+        aws logs get-query-results --query-id $QUERY_ID --region <region>
+   - If unable to determine the log group, ask the user to check the function's logs
+
+OUTPUT FORMAT:
+- Lead with a short summary: "Your execution [status] because [reason]."
+- Show the error message verbatim in a code block if one exists.
+- Explain what this means in plain language.
+- Provide the console link for visual debugging.
+- List recommended next steps.
+
+Use jq for JSON parsing if available, otherwise use --query (AWS CLI built-in JMESPath) or python -m json.tool.
 ```
 
 ## Example Usage
 
 ```
-User: "My durable function execution abc-123 is stuck on my-function:prod"
+User: "My durable function execution abc-123 is stuck on my-function:prod in us-west-2"
 
-Claude: [Spawns Task agent with troubleshooting instructions]
 Agent: [Runs get-durable-execution-history command]
 Agent: [Analyzes with jq queries]
-Agent: [Returns: "Execution stuck in WAIT_FOR_CALLBACK operation 'wait-for-approval'.
-         Callback ID: xyz789. Waiting since 2026-02-14. Timeout in 12 hours.
-         Recommendation: Check if approval email was sent, or manually send callback."]
-Claude: [Presents findings and offers to send manual callback if needed]
+
+Claude: Your execution is **waiting for an external callback** that hasn't arrived yet.
+
+        The operation `wait-for-approval` has been pending since 2026-02-14 (3 days).
+        It will time out in 12 hours if no callback is received.
+
+        Callback ID: `xyz789`
+
+        **[View this execution in the console](https://us-west-2.console.aws.amazon.com/lambda/home?region=us-west-2#/functions/my-function/versions/3/executions/abc-123/inv-456)**
+        ```
+        https://us-west-2.console.aws.amazon.com/lambda/home?region=us-west-2#/functions/my-function/versions/3/executions/abc-123/inv-456
+        ```
+
+        **Next steps:**
+        1. Check if the approval email was sent to the recipient
+        2. Verify the callback ID `xyz789` was included in the email
+        3. If you'd like me to help send the callback manually, let me know
+```
+
+```
+User: "Execution def-456 failed on order-processor:live in us-east-1"
+
+Agent: [Runs get-durable-execution-history command]
+Agent: [Analyzes with jq queries]
+
+Claude: Your execution **failed** at the `charge-payment` step after 3 retry attempts.
+
+        Error message:
+        ```
+        PaymentGatewayError: Card declined - insufficient funds (code: card_declined)
+        ```
+
+        This means the payment provider rejected the charge. The durable function
+        retried 3 times with exponential backoff but the error persisted because
+        it's a permanent failure (not a transient network issue).
+
+        **[View this execution in the console](https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/order-processor/versions/5/executions/def-456/inv-789)**
+        ```
+        https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/order-processor/versions/5/executions/def-456/inv-789
+        ```
+
+        **Next steps:**
+        1. This is a business logic failure, not an infrastructure issue
+        2. Consider adding a non-retryable error classification for `card_declined`
+        3. Implement a compensation step to release the reserved inventory
 ```