Skip to content

Commit 2583e03

Browse files
alexghrcritesjosh
authored andcommitted
chore: dispatch CB on failed deployments (#22367)
.
1 parent 2500ac7 commit 2583e03

File tree

2 files changed

+182
-12
lines changed

2 files changed

+182
-12
lines changed
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
# ClaudeBox Deploy Investigation
2+
3+
Instructions for ClaudeBox when investigating a deployment failure.
4+
This is triggered by `deploy-network.yml` when a deployment fails.
5+
6+
## Context
7+
8+
You will receive a prompt like:
9+
> Deployment of NETWORK (version SEMVER) failed.
10+
> Follow .claude/claudebox/deploy-investigation.md to investigate.
11+
> GitHub Actions run: RUN_URL. Network: NETWORK. Version: SEMVER.
12+
> Docker image: IMAGE_TAG. Git ref: REF. Namespace: NAMESPACE.
13+
> Deploy contracts: true|false.
14+
15+
Extract these variables from the prompt:
16+
- `NETWORK`: the network name (e.g., `testnet`, `staging-public`, `next-net`)
17+
- `SEMVER`: the version being deployed
18+
- `RUN_URL`: the GitHub Actions run URL
19+
- `RUN_ID`: the numeric ID at the end of `RUN_URL`
20+
- `NAMESPACE`: the Kubernetes namespace (usually same as NETWORK)
21+
- `IMAGE_TAG`: the Docker image tag
22+
- `DEPLOY_CONTRACTS`: whether fresh contract deployment was requested
23+
24+
## Constraints
25+
26+
You are running inside ClaudeBox. You do **not** have `gh` CLI or `git push`.
27+
Use MCP tools instead: `github_api`, `respond_to_user`.
28+
29+
## Workflow
30+
31+
### 1. Fetch the Failed Job
32+
33+
```
34+
github_api(method="GET", path="repos/AztecProtocol/aztec-packages/actions/runs/<RUN_ID>/jobs")
35+
```
36+
37+
Find the job with `conclusion: "failure"`. Extract its `id` and note which step failed
38+
(look at the `steps` array for the step with `conclusion: "failure"`).
39+
40+
### 2. Download GitHub Actions Job Logs
41+
42+
```
43+
github_api(method="GET", path="repos/AztecProtocol/aztec-packages/actions/jobs/<JOB_ID>/logs")
44+
```
45+
46+
The GitHub Actions logs are a wrapper. The **actual detailed logs** are on
47+
ci.aztec-labs.com. Look for lines like:
48+
49+
```
50+
Executing: <command> (http://ci.aztec-labs.com/<HASH>)
51+
0 . failed (Xs) (http://ci.aztec-labs.com/<HASH>)
52+
```
53+
54+
Extract the `ci.aztec-labs.com/<HASH>` URL(s) from the logs. The hash after the
55+
failed command is the one you need.
56+
57+
### 3. Download and Analyze CI Logs
58+
59+
Use `yarn ci dlog` to download the actual failure logs:
60+
61+
```bash
62+
cd yarn-project && yarn ci dlog <HASH> > /tmp/<HASH>.log 2>&1
63+
```
64+
65+
Read the downloaded log to find the root cause. These logs contain the real
66+
Terraform output, Helm errors, script failures, etc.
67+
68+
If the log references further nested ci.aztec-labs.com URLs, follow them the same
69+
way to get to the deepest failure.
70+
71+
Look for:
72+
- Terraform errors (plan/apply failures, state locks, quota limits)
73+
- Helm/Kubernetes errors (pod timeouts, image pull failures)
74+
- Contract deployment errors (L1 tx reverts, gas issues)
75+
- Script errors (missing env vars, bad config)
76+
77+
### 4. Categorize the Failure
78+
79+
**Infrastructure failures:**
80+
- Terraform plan/apply errors (resource conflicts, quota limits, state locks)
81+
- GCP authentication issues
82+
- GKE cluster connectivity problems
83+
- Helm release failures (timeout waiting for pods)
84+
85+
**Application failures:**
86+
- Container crash loops (OOMKilled, startup probe failures)
87+
- Contract deployment failures (L1 transaction reverts, gas issues)
88+
- Configuration errors (missing env vars, invalid addresses)
89+
90+
**Network/external failures:**
91+
- L1 RPC endpoint unreachable
92+
- Docker image not found or pull rate limited
93+
- DNS resolution failures
94+
95+
### 5. Query Network Logs (conditional)
96+
97+
If the deployment got far enough that pods were created but the network failed to
98+
start (e.g., "waiting for network to be ready" timeout, Helm succeeded but health
99+
checks failed), query network logs for application-level errors.
100+
101+
Read `.claude/skills/network-logs/SKILL.md` for instructions, then use the
102+
`network-logs` agent to query:
103+
- Namespace: `<NAMESPACE>`
104+
- Freshness: 30 minutes
105+
- Focus: startup errors, crash loops, contract deployment failures
106+
107+
**Skip this step** if the failure was clearly infrastructure-level (Terraform error,
108+
image pull failure, GCP auth issue, etc.).
109+
110+
### 6. Report Findings
111+
112+
Use `respond_to_user()` to reply to the Slack alert thread with a concise summary.
113+
114+
Format for Slack mrkdwn:
115+
116+
```
117+
:red_circle: *Deploy Investigation: <NETWORK> v<SEMVER>*
118+
119+
*Root cause*: <one-line summary of what went wrong>
120+
121+
*Failed step*: "<step name>"
122+
123+
*Error*:
124+
> <key error message, 1-3 lines>
125+
126+
*Category*: <infrastructure | application | external>
127+
128+
*CI log*: http://ci.aztec-labs.com/<HASH>
129+
130+
<If applicable>
131+
*Suggested fix*: <what to do next — retry, fix config, increase quota, etc.>
132+
133+
<If network logs were queried>
134+
*Network logs*: <relevant findings>
135+
136+
<RUN_URL|Full workflow logs>
137+
```
138+
139+
Keep it concise and actionable — operators need to act quickly on deploy failures.
140+
141+
If you cannot determine the root cause, say so and provide what you found.
142+
143+
**Do NOT attempt to fix the deployment or re-run it. Investigation only.**

.github/workflows/deploy-network.yml

Lines changed: 39 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -207,24 +207,51 @@ jobs:
207207
fi
208208
} >> "$GITHUB_STEP_SUMMARY"
209209
210-
- name: Notify Slack on failure
210+
- name: Notify Slack and dispatch ClaudeBox on failure
211211
if: failure()
212212
env:
213213
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
214+
GH_TOKEN: ${{ secrets.AZTEC_BOT_GITHUB_TOKEN }}
214215
run: |
215-
if [ -n "${SLACK_BOT_TOKEN}" ]; then
216-
read -r -d '' data <<EOF || true
217-
{
218-
"channel": "#alerts-${{ inputs.network }}",
219-
"text": "Deploy Network workflow FAILED for *${{ inputs.network }}* (version ${{ inputs.semver }}): <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Run>"
220-
}
221-
EOF
222-
curl -X POST https://slack.com/api/chat.postMessage \
223-
-H "Authorization: Bearer $SLACK_BOT_TOKEN" \
224-
-H "Content-type: application/json" \
225-
--data "$data"
216+
if [ -z "${SLACK_BOT_TOKEN:-}" ]; then
217+
echo "No SLACK_BOT_TOKEN, skipping notification"
218+
exit 0
226219
fi
227220
221+
CHANNEL="#alerts-${{ inputs.network }}"
222+
RUN_URL="https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}"
223+
TEXT="Deploy Network workflow FAILED for *${{ inputs.network }}* (version ${{ inputs.semver }}): <${RUN_URL}|View Run> (🤖)"
224+
225+
# Post to Slack and capture timestamp for permalink
226+
RESP=$(curl -sS -X POST https://slack.com/api/chat.postMessage \
227+
-H "Authorization: Bearer $SLACK_BOT_TOKEN" \
228+
-H "Content-type: application/json" \
229+
-d "$(jq -n --arg c "$CHANNEL" --arg t "$TEXT" '{channel:$c, text:$t}')")
230+
echo "Slack response: $RESP"
231+
232+
TS=$(echo "$RESP" | jq -r '.ts // empty')
233+
CHANNEL_ID=$(echo "$RESP" | jq -r '.channel // empty')
234+
235+
LINK=""
236+
if [[ -n "$TS" && -n "$CHANNEL_ID" ]]; then
237+
LINK="https://aztecprotocol.slack.com/archives/$CHANNEL_ID/p${TS//./}"
238+
fi
239+
240+
# Dispatch ClaudeBox to investigate the failure
241+
PROMPT="Deployment of ${{ inputs.network }} (version ${{ inputs.semver }}) failed. \
242+
Follow .claude/claudebox/deploy-investigation.md to investigate. \
243+
GitHub Actions run: ${RUN_URL}. \
244+
Network: ${{ inputs.network }}. Version: ${{ inputs.semver }}. \
245+
Docker image: ${{ inputs.docker_image_tag || inputs.semver }}. \
246+
Git ref: ${{ steps.checkout-ref.outputs.ref }}. \
247+
Namespace: ${{ inputs.namespace || inputs.network }}. \
248+
Deploy contracts: ${{ inputs.deploy_contracts }}."
249+
250+
gh workflow run claudebox.yml \
251+
-f prompt="$PROMPT" \
252+
-f link="${LINK:-$RUN_URL}" \
253+
-f target_ref="${{ steps.checkout-ref.outputs.ref }}" || true
254+
228255
update-irm:
229256
needs: deploy-network
230257
if: inputs.network == 'testnet' || inputs.network == 'mainnet'

0 commit comments

Comments
 (0)