|
| 1 | +# ClaudeBox Deploy Investigation |
| 2 | + |
| 3 | +Instructions for ClaudeBox when investigating a deployment failure. |
| 4 | +This is triggered by `deploy-network.yml` when a deployment fails. |
| 5 | + |
| 6 | +## Context |
| 7 | + |
| 8 | +You will receive a prompt like: |
| 9 | +> Deployment of NETWORK (version SEMVER) failed. |
| 10 | +> Follow .claude/claudebox/deploy-investigation.md to investigate. |
| 11 | +> GitHub Actions run: RUN_URL. Network: NETWORK. Version: SEMVER. |
| 12 | +> Docker image: IMAGE_TAG. Git ref: REF. Namespace: NAMESPACE. |
| 13 | +> Deploy contracts: true|false. |
| 14 | +
|
| 15 | +Extract these variables from the prompt: |
| 16 | +- `NETWORK`: the network name (e.g., `testnet`, `staging-public`, `next-net`) |
| 17 | +- `SEMVER`: the version being deployed |
| 18 | +- `RUN_URL`: the GitHub Actions run URL |
| 19 | +- `RUN_ID`: the numeric ID at the end of `RUN_URL` |
| 20 | +- `NAMESPACE`: the Kubernetes namespace (usually same as NETWORK) |
| 21 | +- `IMAGE_TAG`: the Docker image tag |
| 22 | +- `DEPLOY_CONTRACTS`: whether fresh contract deployment was requested |
| 23 | + |
| 24 | +## Constraints |
| 25 | + |
| 26 | +You are running inside ClaudeBox. You do **not** have `gh` CLI or `git push`. |
| 27 | +Use MCP tools instead: `github_api`, `respond_to_user`. |
| 28 | + |
| 29 | +## Workflow |
| 30 | + |
| 31 | +### 1. Fetch the Failed Job |
| 32 | + |
| 33 | +``` |
| 34 | +github_api(method="GET", path="repos/AztecProtocol/aztec-packages/actions/runs/<RUN_ID>/jobs") |
| 35 | +``` |
| 36 | + |
| 37 | +Find the job with `conclusion: "failure"`. Extract its `id` and note which step failed |
| 38 | +(look at the `steps` array for the step with `conclusion: "failure"`). |
| 39 | + |
| 40 | +### 2. Download GitHub Actions Job Logs |
| 41 | + |
| 42 | +``` |
| 43 | +github_api(method="GET", path="repos/AztecProtocol/aztec-packages/actions/jobs/<JOB_ID>/logs") |
| 44 | +``` |
| 45 | + |
| 46 | +The GitHub Actions logs are a wrapper. The **actual detailed logs** are on |
| 47 | +ci.aztec-labs.com. Look for lines like: |
| 48 | + |
| 49 | +``` |
| 50 | +Executing: <command> (http://ci.aztec-labs.com/<HASH>) |
| 51 | + 0 . failed (Xs) (http://ci.aztec-labs.com/<HASH>) |
| 52 | +``` |
| 53 | + |
| 54 | +Extract the `ci.aztec-labs.com/<HASH>` URL(s) from the logs. The hash after the |
| 55 | +failed command is the one you need. |
| 56 | + |
| 57 | +### 3. Download and Analyze CI Logs |
| 58 | + |
| 59 | +Use `yarn ci dlog` to download the actual failure logs: |
| 60 | + |
| 61 | +```bash |
| 62 | +cd yarn-project && yarn ci dlog <HASH> > /tmp/<HASH>.log 2>&1 |
| 63 | +``` |
| 64 | + |
| 65 | +Read the downloaded log to find the root cause. These logs contain the real |
| 66 | +Terraform output, Helm errors, script failures, etc. |
| 67 | + |
| 68 | +If the log references further nested ci.aztec-labs.com URLs, follow them the same |
| 69 | +way to get to the deepest failure. |
| 70 | + |
| 71 | +Look for: |
| 72 | +- Terraform errors (plan/apply failures, state locks, quota limits) |
| 73 | +- Helm/Kubernetes errors (pod timeouts, image pull failures) |
| 74 | +- Contract deployment errors (L1 tx reverts, gas issues) |
| 75 | +- Script errors (missing env vars, bad config) |
| 76 | + |
| 77 | +### 4. Categorize the Failure |
| 78 | + |
| 79 | +**Infrastructure failures:** |
| 80 | +- Terraform plan/apply errors (resource conflicts, quota limits, state locks) |
| 81 | +- GCP authentication issues |
| 82 | +- GKE cluster connectivity problems |
| 83 | +- Helm release failures (timeout waiting for pods) |
| 84 | + |
| 85 | +**Application failures:** |
| 86 | +- Container crash loops (OOMKilled, startup probe failures) |
| 87 | +- Contract deployment failures (L1 transaction reverts, gas issues) |
| 88 | +- Configuration errors (missing env vars, invalid addresses) |
| 89 | + |
| 90 | +**Network/external failures:** |
| 91 | +- L1 RPC endpoint unreachable |
| 92 | +- Docker image not found or pull rate limited |
| 93 | +- DNS resolution failures |
| 94 | + |
| 95 | +### 5. Query Network Logs (conditional) |
| 96 | + |
| 97 | +If the deployment got far enough that pods were created but the network failed to |
| 98 | +start (e.g., "waiting for network to be ready" timeout, Helm succeeded but health |
| 99 | +checks failed), query network logs for application-level errors. |
| 100 | + |
| 101 | +Read `.claude/skills/network-logs/SKILL.md` for instructions, then use the |
| 102 | +`network-logs` agent to query: |
| 103 | +- Namespace: `<NAMESPACE>` |
| 104 | +- Freshness: 30 minutes |
| 105 | +- Focus: startup errors, crash loops, contract deployment failures |
| 106 | + |
| 107 | +**Skip this step** if the failure was clearly infrastructure-level (Terraform error, |
| 108 | +image pull failure, GCP auth issue, etc.). |
| 109 | + |
| 110 | +### 6. Report Findings |
| 111 | + |
| 112 | +Use `respond_to_user()` to reply to the Slack alert thread with a concise summary. |
| 113 | + |
| 114 | +Format for Slack mrkdwn: |
| 115 | + |
| 116 | +``` |
| 117 | +:red_circle: *Deploy Investigation: <NETWORK> v<SEMVER>* |
| 118 | +
|
| 119 | +*Root cause*: <one-line summary of what went wrong> |
| 120 | +
|
| 121 | +*Failed step*: "<step name>" |
| 122 | +
|
| 123 | +*Error*: |
| 124 | +> <key error message, 1-3 lines> |
| 125 | +
|
| 126 | +*Category*: <infrastructure | application | external> |
| 127 | +
|
| 128 | +*CI log*: http://ci.aztec-labs.com/<HASH> |
| 129 | +
|
| 130 | +<If applicable> |
| 131 | +*Suggested fix*: <what to do next — retry, fix config, increase quota, etc.> |
| 132 | +
|
| 133 | +<If network logs were queried> |
| 134 | +*Network logs*: <relevant findings> |
| 135 | +
|
| 136 | +<RUN_URL|Full workflow logs> |
| 137 | +``` |
| 138 | + |
| 139 | +Keep it concise and actionable — operators need to act quickly on deploy failures. |
| 140 | + |
| 141 | +If you cannot determine the root cause, say so and provide what you found. |
| 142 | + |
| 143 | +**Do NOT attempt to fix the deployment or re-run it. Investigation only.** |
0 commit comments