chore: dispatch CB on failed deployments (#22367)

alexghr · critesjosh · commit 2583e0354b9d · 2026-04-14T13:56:47.000-04:00
.
diff --git a/.claude/claudebox/deploy-investigation.md b/.claude/claudebox/deploy-investigation.md
@@ -0,0 +1,143 @@
+# ClaudeBox Deploy Investigation
+
+Instructions for ClaudeBox when investigating a deployment failure.
+This is triggered by `deploy-network.yml` when a deployment fails.
+
+## Context
+
+You will receive a prompt like:
+> Deployment of NETWORK (version SEMVER) failed.
+> Follow .claude/claudebox/deploy-investigation.md to investigate.
+> GitHub Actions run: RUN_URL. Network: NETWORK. Version: SEMVER.
+> Docker image: IMAGE_TAG. Git ref: REF. Namespace: NAMESPACE.
+> Deploy contracts: true|false.
+
+Extract these variables from the prompt:
+- `NETWORK`: the network name (e.g., `testnet`, `staging-public`, `next-net`)
+- `SEMVER`: the version being deployed
+- `RUN_URL`: the GitHub Actions run URL
+- `RUN_ID`: the numeric ID at the end of `RUN_URL`
+- `NAMESPACE`: the Kubernetes namespace (usually same as NETWORK)
+- `IMAGE_TAG`: the Docker image tag
+- `DEPLOY_CONTRACTS`: whether fresh contract deployment was requested
+
+## Constraints
+
+You are running inside ClaudeBox. You do **not** have `gh` CLI or `git push`.
+Use MCP tools instead: `github_api`, `respond_to_user`.
+
+## Workflow
+
+### 1. Fetch the Failed Job
+
+```
+github_api(method="GET", path="repos/AztecProtocol/aztec-packages/actions/runs/<RUN_ID>/jobs")
+```
+
+Find the job with `conclusion: "failure"`. Extract its `id` and note which step failed
+(look at the `steps` array for the step with `conclusion: "failure"`).
+
+### 2. Download GitHub Actions Job Logs
+
+```
+github_api(method="GET", path="repos/AztecProtocol/aztec-packages/actions/jobs/<JOB_ID>/logs")
+```
+
+The GitHub Actions logs are a wrapper. The **actual detailed logs** are on
+ci.aztec-labs.com. Look for lines like:
+
+```
+Executing: <command> (http://ci.aztec-labs.com/<HASH>)
+   0 . failed (Xs) (http://ci.aztec-labs.com/<HASH>)
+```
+
+Extract the `ci.aztec-labs.com/<HASH>` URL(s) from the logs. The hash after the
+failed command is the one you need.
+
+### 3. Download and Analyze CI Logs
+
+Use `yarn ci dlog` to download the actual failure logs:
+
+```bash
+cd yarn-project && yarn ci dlog <HASH> > /tmp/<HASH>.log 2>&1
+```
+
+Read the downloaded log to find the root cause. These logs contain the real
+Terraform output, Helm errors, script failures, etc.
+
+If the log references further nested ci.aztec-labs.com URLs, follow them the same
+way to get to the deepest failure.
+
+Look for:
+- Terraform errors (plan/apply failures, state locks, quota limits)
+- Helm/Kubernetes errors (pod timeouts, image pull failures)
+- Contract deployment errors (L1 tx reverts, gas issues)
+- Script errors (missing env vars, bad config)
+
+### 4. Categorize the Failure
+
+**Infrastructure failures:**
+- Terraform plan/apply errors (resource conflicts, quota limits, state locks)
+- GCP authentication issues
+- GKE cluster connectivity problems
+- Helm release failures (timeout waiting for pods)
+
+**Application failures:**
+- Container crash loops (OOMKilled, startup probe failures)
+- Contract deployment failures (L1 transaction reverts, gas issues)
+- Configuration errors (missing env vars, invalid addresses)
+
+**Network/external failures:**
+- L1 RPC endpoint unreachable
+- Docker image not found or pull rate limited
+- DNS resolution failures
+
+### 5. Query Network Logs (conditional)
+
+If the deployment got far enough that pods were created but the network failed to
+start (e.g., "waiting for network to be ready" timeout, Helm succeeded but health
+checks failed), query network logs for application-level errors.
+
+Read `.claude/skills/network-logs/SKILL.md` for instructions, then use the
+`network-logs` agent to query:
+- Namespace: `<NAMESPACE>`
+- Freshness: 30 minutes
+- Focus: startup errors, crash loops, contract deployment failures
+
+**Skip this step** if the failure was clearly infrastructure-level (Terraform error,
+image pull failure, GCP auth issue, etc.).
+
+### 6. Report Findings
+
+Use `respond_to_user()` to reply to the Slack alert thread with a concise summary.
+
+Format for Slack mrkdwn:
+
+```
+:red_circle: *Deploy Investigation: <NETWORK> v<SEMVER>*
+
+*Root cause*: <one-line summary of what went wrong>
+
+*Failed step*: "<step name>"
+
+*Error*:
+> <key error message, 1-3 lines>
+
+*Category*: <infrastructure | application | external>
+
+*CI log*: http://ci.aztec-labs.com/<HASH>
+
+<If applicable>
+*Suggested fix*: <what to do next — retry, fix config, increase quota, etc.>
+
+<If network logs were queried>
+*Network logs*: <relevant findings>
+
+<RUN_URL|Full workflow logs>
+```
+
+Keep it concise and actionable — operators need to act quickly on deploy failures.
+
+If you cannot determine the root cause, say so and provide what you found.
+
+**Do NOT attempt to fix the deployment or re-run it. Investigation only.**
diff --git a/.github/workflows/deploy-network.yml b/.github/workflows/deploy-network.yml
@@ -207,24 +207,51 @@ jobs:
             fi
           } >> "$GITHUB_STEP_SUMMARY"
 
-      - name: Notify Slack on failure
+      - name: Notify Slack and dispatch ClaudeBox on failure
         if: failure()
         env:
           SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
+          GH_TOKEN: ${{ secrets.AZTEC_BOT_GITHUB_TOKEN }}
         run: |
-          if [ -n "${SLACK_BOT_TOKEN}" ]; then
-            read -r -d '' data <<EOF || true
-            {
-              "channel": "#alerts-${{ inputs.network }}",
-              "text": "Deploy Network workflow FAILED for *${{ inputs.network }}* (version ${{ inputs.semver }}): <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Run>"
-            }
-          EOF
-            curl -X POST https://slack.com/api/chat.postMessage \
-              -H "Authorization: Bearer $SLACK_BOT_TOKEN" \
-              -H "Content-type: application/json" \
-              --data "$data"
+          if [ -z "${SLACK_BOT_TOKEN:-}" ]; then
+            echo "No SLACK_BOT_TOKEN, skipping notification"
+            exit 0
           fi
 
+          CHANNEL="#alerts-${{ inputs.network }}"
+          RUN_URL="https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}"
+          TEXT="Deploy Network workflow FAILED for *${{ inputs.network }}* (version ${{ inputs.semver }}): <${RUN_URL}|View Run> (🤖)"
+
+          # Post to Slack and capture timestamp for permalink
+          RESP=$(curl -sS -X POST https://slack.com/api/chat.postMessage \
+            -H "Authorization: Bearer $SLACK_BOT_TOKEN" \
+            -H "Content-type: application/json" \
+            -d "$(jq -n --arg c "$CHANNEL" --arg t "$TEXT" '{channel:$c, text:$t}')")
+          echo "Slack response: $RESP"
+
+          TS=$(echo "$RESP" | jq -r '.ts // empty')
+          CHANNEL_ID=$(echo "$RESP" | jq -r '.channel // empty')
+
+          LINK=""
+          if [[ -n "$TS" && -n "$CHANNEL_ID" ]]; then
+            LINK="https://aztecprotocol.slack.com/archives/$CHANNEL_ID/p${TS//./}"
+          fi
+
+          # Dispatch ClaudeBox to investigate the failure
+          PROMPT="Deployment of ${{ inputs.network }} (version ${{ inputs.semver }}) failed. \
+          Follow .claude/claudebox/deploy-investigation.md to investigate. \
+          GitHub Actions run: ${RUN_URL}. \
+          Network: ${{ inputs.network }}. Version: ${{ inputs.semver }}. \
+          Docker image: ${{ inputs.docker_image_tag || inputs.semver }}. \
+          Git ref: ${{ steps.checkout-ref.outputs.ref }}. \
+          Namespace: ${{ inputs.namespace || inputs.network }}. \
+          Deploy contracts: ${{ inputs.deploy_contracts }}."
+
+          gh workflow run claudebox.yml \
+            -f prompt="$PROMPT" \
+            -f link="${LINK:-$RUN_URL}" \
+            -f target_ref="${{ steps.checkout-ref.outputs.ref }}" || true
+
   update-irm:
     needs: deploy-network
     if: inputs.network == 'testnet' || inputs.network == 'mainnet'