Skip to content

Commit 4576993

Browse files
authored
feat: merge-train/spartan (#22352)
BEGIN_COMMIT_OVERRIDE chore: fix mempool limit test (#22332) fix(bot): bot fee juice funding (#21949) fix(foundation): flush current batch on BatchQueue.stop() (#22341) chore: (A-750) read JSON body then parse to avoid double stream consumption on error message (#22247) chore: bump log level in stg-public (#22354) chore: fix main.tf syntax (#22356) chore: wire up spartan checks to make (#22358) fix(p2p): reduce flakiness in proposal tx collector benchmark (#22240) fix: disable sponsored fpc and test accounts for devnet (#22331) chore: add v4-devnet-3 to tf network ingress (#22327) chore: remove unused env var (#22365) chore: add pdb (#22364) chore: dispatch CB on failed deployments (#22367) chore: (A-749) single character url join (#22269) feat: support different docker image for HA validator nodes (#22371) chore: fix the daily healthchecks (#22373) chore: remove v4-devnet-2 references (#22372) fix: rename #team-alpha → #e-team-alpha slack channel (#22374) chore(pipeline): timetable adjustments under pipelining (#21076) feat(pipeline): handle pipeline prunes (#21250) fix: handle error types serialization errors (#22379) feat(spartan): configurable HA validator replica count (#22384) fix(e2e): increase prune timeout in epochs_mbps_pipeline test (#22392) fix(epoch-cache): use TTL-based caching with finalization tracking and correct lag (#22204) chore: deflake e2e ha sync test (#22403) chore(ci): skip prunes-uncheckpointed test in epochs_mbps_pipeline (#22401) refactor(slasher): remove empire slasher model (#21830) fix: use strict equality in world-state ops queue (#22398) fix: remove unused BLOCK reqresp sub-protocol (#22407) refactor(sequencer): sign last block before archiver sync (#22117) feat(world-state): add genesis timestamp support and GenesisData type (#22359) fix: use Int64Value instead of Uint32Value for 64-bit map sizes (#22400) chore: Reduce logging verbosity (#22423) fix(p2p): include values in tx validation error messages (#22422) END_COMMIT_OVERRIDE
2 parents 2f77ec5 + 3744482 commit 4576993

246 files changed

Lines changed: 3493 additions & 6746 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
# ClaudeBox Deploy Investigation
2+
3+
Instructions for ClaudeBox when investigating a deployment failure.
4+
This is triggered by `deploy-network.yml` when a deployment fails.
5+
6+
## Context
7+
8+
You will receive a prompt like:
9+
> Deployment of NETWORK (version SEMVER) failed.
10+
> Follow .claude/claudebox/deploy-investigation.md to investigate.
11+
> GitHub Actions run: RUN_URL. Network: NETWORK. Version: SEMVER.
12+
> Docker image: IMAGE_TAG. Git ref: REF. Namespace: NAMESPACE.
13+
> Deploy contracts: true|false.
14+
15+
Extract these variables from the prompt:
16+
- `NETWORK`: the network name (e.g., `testnet`, `staging-public`, `next-net`)
17+
- `SEMVER`: the version being deployed
18+
- `RUN_URL`: the GitHub Actions run URL
19+
- `RUN_ID`: the numeric ID at the end of `RUN_URL`
20+
- `NAMESPACE`: the Kubernetes namespace (usually same as NETWORK)
21+
- `IMAGE_TAG`: the Docker image tag
22+
- `DEPLOY_CONTRACTS`: whether fresh contract deployment was requested
23+
24+
## Constraints
25+
26+
You are running inside ClaudeBox. You do **not** have `gh` CLI or `git push`.
27+
Use MCP tools instead: `github_api`, `respond_to_user`.
28+
29+
## Workflow
30+
31+
### 1. Fetch the Failed Job
32+
33+
```
34+
github_api(method="GET", path="repos/AztecProtocol/aztec-packages/actions/runs/<RUN_ID>/jobs")
35+
```
36+
37+
Find the job with `conclusion: "failure"`. Extract its `id` and note which step failed
38+
(look at the `steps` array for the step with `conclusion: "failure"`).
39+
40+
### 2. Download GitHub Actions Job Logs
41+
42+
```
43+
github_api(method="GET", path="repos/AztecProtocol/aztec-packages/actions/jobs/<JOB_ID>/logs")
44+
```
45+
46+
The GitHub Actions logs are a wrapper. The **actual detailed logs** are on
47+
ci.aztec-labs.com. Look for lines like:
48+
49+
```
50+
Executing: <command> (http://ci.aztec-labs.com/<HASH>)
51+
0 . failed (Xs) (http://ci.aztec-labs.com/<HASH>)
52+
```
53+
54+
Extract the `ci.aztec-labs.com/<HASH>` URL(s) from the logs. The hash after the
55+
failed command is the one you need.
56+
57+
### 3. Download and Analyze CI Logs
58+
59+
Use `yarn ci dlog` to download the actual failure logs:
60+
61+
```bash
62+
cd yarn-project && yarn ci dlog <HASH> > /tmp/<HASH>.log 2>&1
63+
```
64+
65+
Read the downloaded log to find the root cause. These logs contain the real
66+
Terraform output, Helm errors, script failures, etc.
67+
68+
If the log references further nested ci.aztec-labs.com URLs, follow them the same
69+
way to get to the deepest failure.
70+
71+
Look for:
72+
- Terraform errors (plan/apply failures, state locks, quota limits)
73+
- Helm/Kubernetes errors (pod timeouts, image pull failures)
74+
- Contract deployment errors (L1 tx reverts, gas issues)
75+
- Script errors (missing env vars, bad config)
76+
77+
### 4. Categorize the Failure
78+
79+
**Infrastructure failures:**
80+
- Terraform plan/apply errors (resource conflicts, quota limits, state locks)
81+
- GCP authentication issues
82+
- GKE cluster connectivity problems
83+
- Helm release failures (timeout waiting for pods)
84+
85+
**Application failures:**
86+
- Container crash loops (OOMKilled, startup probe failures)
87+
- Contract deployment failures (L1 transaction reverts, gas issues)
88+
- Configuration errors (missing env vars, invalid addresses)
89+
90+
**Network/external failures:**
91+
- L1 RPC endpoint unreachable
92+
- Docker image not found or pull rate limited
93+
- DNS resolution failures
94+
95+
### 5. Query Network Logs (conditional)
96+
97+
If the deployment got far enough that pods were created but the network failed to
98+
start (e.g., "waiting for network to be ready" timeout, Helm succeeded but health
99+
checks failed), query network logs for application-level errors.
100+
101+
Read `.claude/skills/network-logs/SKILL.md` for instructions, then use the
102+
`network-logs` agent to query:
103+
- Namespace: `<NAMESPACE>`
104+
- Freshness: 30 minutes
105+
- Focus: startup errors, crash loops, contract deployment failures
106+
107+
**Skip this step** if the failure was clearly infrastructure-level (Terraform error,
108+
image pull failure, GCP auth issue, etc.).
109+
110+
### 6. Report Findings
111+
112+
Use `respond_to_user()` to reply to the Slack alert thread with a concise summary.
113+
114+
Format for Slack mrkdwn:
115+
116+
```
117+
:red_circle: *Deploy Investigation: <NETWORK> v<SEMVER>*
118+
119+
*Root cause*: <one-line summary of what went wrong>
120+
121+
*Failed step*: "<step name>"
122+
123+
*Error*:
124+
> <key error message, 1-3 lines>
125+
126+
*Category*: <infrastructure | application | external>
127+
128+
*CI log*: http://ci.aztec-labs.com/<HASH>
129+
130+
<If applicable>
131+
*Suggested fix*: <what to do next — retry, fix config, increase quota, etc.>
132+
133+
<If network logs were queried>
134+
*Network logs*: <relevant findings>
135+
136+
<RUN_URL|Full workflow logs>
137+
```
138+
139+
Keep it concise and actionable — operators need to act quickly on deploy failures.
140+
141+
If you cannot determine the root cause, say so and provide what you found.
142+
143+
**Do NOT attempt to fix the deployment or re-run it. Investigation only.**

.claude/skills/merge-trains/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ A merge train is an automated batching system (inspired by [Rust rollups](https:
1818
| `merge-train/ci` | CI infrastructure / ci3 | `#help-ci` |
1919
| `merge-train/docs` | Documentation | `#dev-rels` |
2020
| `merge-train/fairies` | aztec-nr | `#team-fairies` |
21-
| `merge-train/spartan` | Spartan / infra / yarn-project sequencer and prover orchestration | `#team-alpha` |
21+
| `merge-train/spartan` | Spartan / infra / yarn-project sequencer and prover orchestration | `#e-team-alpha` |
2222

2323
## How to Use a Merge Train
2424

.github/workflows/deploy-network.yml

Lines changed: 48 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,10 @@ on:
3030
required: false
3131
type: boolean
3232
default: false
33+
ha_docker_image:
34+
description: "Full docker image for HA validator nodes (optional, defaults to aztec docker image)"
35+
required: false
36+
type: string
3337
source_tag:
3438
description: "Source tag that triggered this deploy"
3539
required: false
@@ -62,6 +66,10 @@ on:
6266
required: false
6367
type: boolean
6468
default: false
69+
ha_docker_image:
70+
description: "Full docker image for HA validator nodes (optional, defaults to aztec docker image)"
71+
required: false
72+
type: string
6573
source_tag:
6674
description: "Source tag that triggered this deploy"
6775
required: false
@@ -172,6 +180,7 @@ jobs:
172180
AZTEC_DOCKER_IMAGE: "aztecprotocol/aztec:${{ inputs.docker_image_tag || inputs.semver }}"
173181
CREATE_ROLLUP_CONTRACTS: ${{ inputs.deploy_contracts == true && 'true' || '' }}
174182
PROVER_AGENT_DOCKER_IMAGE: "aztecprotocol/aztec-prover-agent:${{ inputs.docker_image_tag || inputs.semver }}"
183+
VALIDATOR_HA_DOCKER_IMAGE: ${{ inputs.ha_docker_image || '' }}
175184
run: |
176185
echo "Deploying network: ${{ inputs.network }}"
177186
echo "Using image: $AZTEC_DOCKER_IMAGE"
@@ -207,24 +216,51 @@ jobs:
207216
fi
208217
} >> "$GITHUB_STEP_SUMMARY"
209218
210-
- name: Notify Slack on failure
219+
- name: Notify Slack and dispatch ClaudeBox on failure
211220
if: failure()
212221
env:
213222
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
223+
GH_TOKEN: ${{ secrets.AZTEC_BOT_GITHUB_TOKEN }}
214224
run: |
215-
if [ -n "${SLACK_BOT_TOKEN}" ]; then
216-
read -r -d '' data <<EOF || true
217-
{
218-
"channel": "#alerts-${{ inputs.network }}",
219-
"text": "Deploy Network workflow FAILED for *${{ inputs.network }}* (version ${{ inputs.semver }}): <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Run>"
220-
}
221-
EOF
222-
curl -X POST https://slack.com/api/chat.postMessage \
223-
-H "Authorization: Bearer $SLACK_BOT_TOKEN" \
224-
-H "Content-type: application/json" \
225-
--data "$data"
225+
if [ -z "${SLACK_BOT_TOKEN:-}" ]; then
226+
echo "No SLACK_BOT_TOKEN, skipping notification"
227+
exit 0
226228
fi
227229
230+
CHANNEL="#alerts-${{ inputs.network }}"
231+
RUN_URL="https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}"
232+
TEXT="Deploy Network workflow FAILED for *${{ inputs.network }}* (version ${{ inputs.semver }}): <${RUN_URL}|View Run> (🤖)"
233+
234+
# Post to Slack and capture timestamp for permalink
235+
RESP=$(curl -sS -X POST https://slack.com/api/chat.postMessage \
236+
-H "Authorization: Bearer $SLACK_BOT_TOKEN" \
237+
-H "Content-type: application/json" \
238+
-d "$(jq -n --arg c "$CHANNEL" --arg t "$TEXT" '{channel:$c, text:$t}')")
239+
echo "Slack response: $RESP"
240+
241+
TS=$(echo "$RESP" | jq -r '.ts // empty')
242+
CHANNEL_ID=$(echo "$RESP" | jq -r '.channel // empty')
243+
244+
LINK=""
245+
if [[ -n "$TS" && -n "$CHANNEL_ID" ]]; then
246+
LINK="https://aztecprotocol.slack.com/archives/$CHANNEL_ID/p${TS//./}"
247+
fi
248+
249+
# Dispatch ClaudeBox to investigate the failure
250+
PROMPT="Deployment of ${{ inputs.network }} (version ${{ inputs.semver }}) failed. \
251+
Follow .claude/claudebox/deploy-investigation.md to investigate. \
252+
GitHub Actions run: ${RUN_URL}. \
253+
Network: ${{ inputs.network }}. Version: ${{ inputs.semver }}. \
254+
Docker image: ${{ inputs.docker_image_tag || inputs.semver }}. \
255+
Git ref: ${{ steps.checkout-ref.outputs.ref }}. \
256+
Namespace: ${{ inputs.namespace || inputs.network }}. \
257+
Deploy contracts: ${{ inputs.deploy_contracts }}."
258+
259+
gh workflow run claudebox.yml \
260+
-f prompt="$PROMPT" \
261+
-f link="${LINK:-$RUN_URL}" \
262+
-f target_ref="${{ steps.checkout-ref.outputs.ref }}" || true
263+
228264
update-irm:
229265
needs: deploy-network
230266
if: inputs.network == 'testnet' || inputs.network == 'mainnet'

.github/workflows/network-healthcheck.yml

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,4 +22,28 @@ jobs:
2222
env:
2323
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
2424
GH_TOKEN: ${{ secrets.AZTEC_BOT_GITHUB_TOKEN }}
25-
run: ./ci3/network_healthcheck "${{ inputs.networks || 'v4-devnet-2,testnet,mainnet,staging-public,next-net' }}"
25+
CI: "1"
26+
run: |
27+
NETWORKS="${{ inputs.networks || 'next-net,staging-public,testnet,mainnet' }}"
28+
29+
PROMPT="Run a network healthcheck for: ${NETWORKS}.
30+
31+
For each network, query Cloud Logging to report:
32+
1. Components running
33+
2. Latest L2 block and slot numbers
34+
3. Peer counts
35+
4. Block production cadence (last ~10 checkpoints)
36+
5. Any errors (level >= 50) or warnings (level 40) in the last 8 hours
37+
6. Bot status if applicable
38+
39+
Create a gist with the full healthcheck report. Then post a concise summary to the #e-team-alpha channel via respond_to_user. Flag anything that needs attention (stopped bots, missed slots, errors, low peer counts).
40+
41+
Format the respond_to_user message as a brief network status overview, e.g.:
42+
- testnet: healthy, block 5570, 100 peers
43+
- mainnet: healthy, block 1234, 50 peers
44+
- devnet: WARNING - bot stopped (insufficient balance)
45+
Link to the gist for full details."
46+
47+
./ci3/slack_notify_with_claudebox_kickoff "#e-team-alpha" \
48+
"Starting network healthcheck for: ${NETWORKS}" \
49+
"$PROMPT"

.test_patterns.yml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -253,6 +253,13 @@ tests:
253253
owners:
254254
- *phil
255255

256+
# Consistently times out — prune detection timing too tight for CI resources
257+
# See https://github.com/AztecProtocol/aztec-packages/pull/22392
258+
- regex: "epochs_mbps.pipeline.parallel.test.ts.*prunes uncheckpointed"
259+
skip: true
260+
owners:
261+
- *sean
262+
256263
# Blanket flake patterns for unstable p2p, epoch, and l1 tx utils tests
257264
# This is a temporary measure while team bandwidth is constrained
258265
# Replaced many specific patterns - see https://github.com/AztecProtocol/aztec-packages/pull/17962 for historical context

Makefile

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,15 +47,15 @@ endef
4747
# PHONY TARGETS - List every target that has a file/dir of the same name.
4848
#==============================================================================
4949

50-
.PHONY: noir barretenberg noir-projects l1-contracts release-image boxes playground docs aztec-up
50+
.PHONY: noir barretenberg noir-projects l1-contracts release-image boxes playground docs aztec-up spartan
5151

5252
#==============================================================================
5353
# BOOTSTRAP TARGETS
5454
#==============================================================================
5555

5656
# Fast bootstrap.
5757
fast: release-image barretenberg boxes playground docs aztec-up \
58-
bb-tests l1-contracts-tests yarn-project-tests boxes-tests playground-tests aztec-up-tests docs-tests noir-protocol-circuits-tests release-image-tests
58+
bb-tests l1-contracts-tests yarn-project-tests boxes-tests playground-tests aztec-up-tests docs-tests noir-protocol-circuits-tests release-image-tests spartan
5959

6060
# Full bootstrap.
6161
full: fast bb-full-tests bb-cpp-full yarn-project-benches
@@ -364,3 +364,6 @@ aztec-up: yarn-project
364364

365365
aztec-up-tests: aztec-up
366366
$(call test,$@,aztec-up)
367+
368+
spartan:
369+
$(call build,$@,spartan)

barretenberg/cpp/src/barretenberg/nodejs_module/lmdb_store/lmdb_store_wrapper.cpp

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,12 @@ LMDBStoreWrapper::LMDBStoreWrapper(const Napi::CallbackInfo& info)
3535
uint64_t map_size = DEFAULT_MAP_SIZE;
3636
if (info.Length() > map_size_index) {
3737
if (info[map_size_index].IsNumber()) {
38-
map_size = info[map_size_index].As<Napi::Number>().Uint32Value();
38+
// Int64Value is the widest integer accessor in N-API (no Uint64Value exists)
39+
int64_t val = info[map_size_index].As<Napi::Number>().Int64Value();
40+
if (val <= 0) {
41+
throw Napi::TypeError::New(env, "Map size must be a positive number");
42+
}
43+
map_size = static_cast<uint64_t>(val);
3944
} else {
4045
throw Napi::TypeError::New(env, "Map size must be a number or an object");
4146
}

barretenberg/cpp/src/barretenberg/nodejs_module/world_state/world_state.cpp

Lines changed: 26 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -119,19 +119,39 @@ WorldStateWrapper::WorldStateWrapper(const Napi::CallbackInfo& info)
119119
throw Napi::TypeError::New(env, "Header generator point needs to be a number");
120120
}
121121

122+
uint64_t genesis_timestamp = 0;
123+
size_t genesis_timestamp_index = 5;
124+
if (info.Length() > genesis_timestamp_index) {
125+
if (info[genesis_timestamp_index].IsNumber()) {
126+
genesis_timestamp = static_cast<uint64_t>(info[genesis_timestamp_index].As<Napi::Number>().Int64Value());
127+
} else {
128+
throw Napi::TypeError::New(env, "Genesis timestamp needs to be a number");
129+
}
130+
}
131+
122132
// optional parameters
123-
size_t map_size_index = 5;
133+
size_t map_size_index = 6;
124134
if (info.Length() > map_size_index) {
125135
if (info[map_size_index].IsObject()) {
126136
Napi::Object obj = info[map_size_index].As<Napi::Object>();
127137

128138
for (auto tree_id : tree_ids) {
129139
if (obj.Has(tree_id)) {
130-
map_size[tree_id] = obj.Get(tree_id).As<Napi::Number>().Uint32Value();
140+
// Int64Value is the widest integer accessor in N-API (no Uint64Value exists)
141+
int64_t val = obj.Get(tree_id).As<Napi::Number>().Int64Value();
142+
if (val <= 0) {
143+
throw Napi::TypeError::New(env, "Map size must be a positive number");
144+
}
145+
map_size[tree_id] = static_cast<uint64_t>(val);
131146
}
132147
}
133148
} else if (info[map_size_index].IsNumber()) {
134-
uint64_t size = info[map_size_index].As<Napi::Number>().Uint32Value();
149+
// Int64Value is the widest integer accessor in N-API (no Uint64Value exists)
150+
int64_t val = info[map_size_index].As<Napi::Number>().Int64Value();
151+
if (val <= 0) {
152+
throw Napi::TypeError::New(env, "Map size must be a positive number");
153+
}
154+
uint64_t size = static_cast<uint64_t>(val);
135155
for (auto tree_id : tree_ids) {
136156
map_size[tree_id] = size;
137157
}
@@ -140,7 +160,7 @@ WorldStateWrapper::WorldStateWrapper(const Napi::CallbackInfo& info)
140160
}
141161
}
142162

143-
size_t thread_pool_size_index = 6;
163+
size_t thread_pool_size_index = 7;
144164
if (info.Length() > thread_pool_size_index) {
145165
if (!info[thread_pool_size_index].IsNumber()) {
146166
throw Napi::TypeError::New(env, "Thread pool size must be a number");
@@ -155,7 +175,8 @@ WorldStateWrapper::WorldStateWrapper(const Napi::CallbackInfo& info)
155175
tree_height,
156176
tree_prefill,
157177
prefilled_public_data,
158-
initial_header_generator_point);
178+
initial_header_generator_point,
179+
genesis_timestamp);
159180

160181
_dispatcher.register_target(
161182
WorldStateMessageType::GET_TREE_INFO,

0 commit comments

Comments
 (0)