Skip to content

feat: merge-train/spartan#23580

Open
AztecBot wants to merge 25 commits into
nextfrom
merge-train/spartan
Open

feat: merge-train/spartan#23580
AztecBot wants to merge 25 commits into
nextfrom
merge-train/spartan

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 27, 2026

BEGIN_COMMIT_OVERRIDE
fix(archiver): skip descendants of invalid-attestations checkpoints (#23502)
chore: scale network validators (#23579)
fix(ci): nightly 10 TPS bench GCP auth and checkout (#23586)
chore: set eth node resource profile (#23583)
fix: wait for checkpoint before sentinel assertions (#23573)
fix: slash attestations for invalid checkpoint proposals (#23506)
test: fix web3signer pipelining e2e_multi_validator_node_key_store.test.ts (#23568)
fix: cap CI devbox hostname (#23591)
test: stabilize invalid checkpoint descendant e2e (#23582)
test(e2e): stabilize invalidation slots in proposer invalidates multiple checkpoints (#23590)
test(e2e): stabilize invalid proposal slashing target slot in attested_invalid_proposal (#23589)
chore(foundation): faster toBufferBE via zero fast-path (#23592)
fix: honour BB_BINARY_PATH (#23570)
chore: bump reth and lighthouse (#23588)
chore: add web3signer and postgres node selectors (#23598)
fix: do not symlink .codex folders (#23593)
chore: fix claude and codex symlinking tests (#23599)
test(e2e): narrow down sentinel check in multiple_validators_sentinel (#23604)
test(e2e): fix proposer invalidates multiple checkpoints timeout (#23608)
END_COMMIT_OVERRIDE

…23502)

## Motivation

`archiver/src/modules/l1_synchronizer.ts` skipped checkpoints with
insufficient/invalid attestations under the assumption that the next
proposer would invalidate them before publishing. When that assumption
was violated — i.e., proposer P2 published a valid-attestations
checkpoint that extended P1's invalid one — the archiver hit
`InitialCheckpointNumberNotSequentialError` in
`block_store.addCheckpoints`, the catch handler rolled back the L1 sync
point, and the next poll re-fetched the same range and re-threw. The
archiver looped indefinitely. The protocol already defines
`OffenseType.PROPOSED_DESCENDANT_OF_CHECKPOINT_WITH_INVALID_ATTESTATIONS`
for exactly this case but the slasher couldn't see valid-attestations
descendants because the archiver threw before emitting any event.

### Human Note

This is particularly relevant under pipelining. Attestors now attest to
a checkpoint _before_ the previous one is pushed to L1, so they can be
inadvertently attesting to a checkpoint built on top of one that became
invalid as it was published to the rollup the contract with wrong
attestations. So an honest attestor could get slashed if the proposer
was malicious.

## Approach

In the synchronizer, persist rejected ancestors in the block store keyed
by archive root. On each new checkpoint, before attestation validation,
compare its `header.lastArchiveRoot` against the persisted set — if it
matches, skip the checkpoint as a descendant of an invalid ancestor and
emit a new
`L2BlockSourceEvents.CheckpointBuiltOnInvalidAncestorDetected` event
with enough metadata to resolve the proposer. The slasher's
`AttestationsBlockWatcher` is fixed to slash the proposer (not the
attestors) under the new event.

Fixes A-1072
alexghr and others added 2 commits May 27, 2026 14:27
## Summary

- Run `gcp_auth` before `setup_gcp_secrets` in `source_network_env` so
EC2 benchmark jobs can read Secret Manager (e.g. `otel-collector-url`).
- Improve `setup_gcp_secrets.sh` diagnostics and activate the CI service
account before secret fetches.
- Install Terraform on Linux in `install_deps.sh`; add `setup-terraform`
on nightly wait jobs.
- Fix `deploy-network` checkout for pinned submodules (`fetch-depth: 0`,
`lfs: true`).
- Checkout `github.sha` on the benchmark job so workflow_dispatch from a
feature branch runs that branch on EC2 (not `next`).

Validated manually via Nightly Bench 10 TPS workflow_dispatch on this
branch (run succeeded).

## Test plan

- [x] Nightly Bench 10 TPS workflow_dispatch from
`spy/10tps-bench-terraform` (deploy, wait, benchmark)
@PhilWindle PhilWindle requested a review from charlielye as a code owner May 27, 2026 13:56
alexghr and others added 22 commits May 27, 2026 14:08
## Summary

- Stabilizes the multiple-validator sentinel e2e by waiting for a
post-warmup checkpoint before recording the assertion window.
- Reuses the same warm-up helper in the second test so isolated runs
avoid the same fresh-network startup noise before stopping a validator.

## Failed run

Failed CI run: http://ci.aztec-labs.com/07fb31bc0706159f

The failing test was `e2e_p2p_multiple_validators_sentinel > collects
attestations for all validators on a node`. The test expected no
`attestation-missed` entries, but the assertion window started while the
network was still in the first pipelined slots after startup. In the
failed run, slot 8 was built on a pending, not-yet-checkpointed parent,
so some remote validators could not validate/attest in time and the
sentinel recorded a missed attestation.

## Fix

The test now waits for one warm-up slot and then waits for the observed
checkpoint number to advance before capturing `initialSlot`. That keeps
startup pipelining behavior out of the strict sentinel assertion window
while preserving the test's actual coverage: once the network is past
warm-up, every validator should be observed attesting or proposing as
expected.

## Verification

- `yarn format end-to-end`
- `yarn build`
- `yarn workspace @aztec/end-to-end test:e2e
e2e_p2p/multiple_validators_sentinel.parallel.test.ts -t 'collects
attestations for all validators on a node'`
…est.ts` (#23568)

Fix web3signer e2e `e2e_multi_validator_node_key_store.test.ts` by
removing the minTxsPerBlock override so the pipelining preset can
publish empty checkpoints while txs arrive. Also anchors the test PXE to
the checkpointed chain tip to prevent checkpoint prunes from killing
sent txs.
Running ci.sh grind was failing with `sethostname: invalid argument`.
Codex attributed the failure to a long branch name, causing a long
instance name, which was too long for `sethostname`. Confirmed that
switching to a shorter branch name fixed the issue.

```
--- request build instance (SSH) ---
Requesting m6a.48xlarge spot instance (name: spl_fix-web3signer-pipelining-test_amd64_grind-test-cdfb13e6637062de) (type: m6a.48xlarge) (ami: ami-067627aa971a1dcbb) (bid: 8.3136)...
Waiting for instance id for spot request: sir-dvtzjepj...
Timeout waiting for spot request.
Requesting m6a.48xlarge on-demand instance (name: spl_fix-web3signer-pipelining-test_amd64_grind-test-cdfb13e6637062de) (type: m6a.48xlarge) (ami: ami-067627aa971a1dcbb) (bid: 8.3136)...
Instance id: i-0fd2be01d28ec47e5
Waiting for SSH at 13.58.96.227...
--- connect via SSH ---
Stdout is not a tty, running in background...
Host processes pinned to OS CPUs: 88-95,184-191
HOST: fetching EC2 metadata token...
HOST: metadata token acquired.
HOST: decoding credentials...
HOST: starting devbox container...
HOST: devbox container launched (pid=10513). Monitoring for spot termination...
HOST: preparing devbox (uid/gid, docker run)...
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: sethostname: invalid argument
```
Fixes the invalid checkpoint descendant e2e timing by keeping sequencers
stopped until the test has selected adjacent target proposers, installed
listeners, applied malicious configs, and warped to the intended
pipelined build window.

This avoids applying malicious config to an earlier slot owned by the
same validator, which is what caused the CI run for PR #23502 to miss
the intended P1/P2 checkpoint pair.
…iple checkpoints` (#23590)

Summary:
- Scan for consecutive bad checkpoint slots whose prior pipelined target
slot is not owned by either intended bad proposer.
- Keep the malicious-config injection tied to the selected bad proposers
and remove the now-unnecessary non-null assertion.
- Add an inline comment documenting why the prior pipelined target slot
matters.

Why:
The test applies malicious checkpoint config while sequencers are
already running. With proposer pipelining, the previous target slot can
snapshot that config before the intended bad slots are built. If that
prior proposer is one of the intended bad proposers, the test may spend
the malicious config on the wrong checkpoint and stop validating the
intended two-checkpoint invalidation path. This mirrors the
slot-selection issue fixed for the invalid proposal slashing test, but
applies it to the consecutive checkpoint invalidation scenario.

Testing:
- yarn format end-to-end
- yarn build
- LOG_LEVEL="info; debug:sequencer,publisher,validator" yarn workspace
@aztec/end-to-end test:e2e
e2e_epochs/epochs_invalidate_block.parallel.test.ts -t "proposer
invalidates multiple checkpoints"
…ed_invalid_proposal` (#23589)

## Summary
- skip target slots in attested invalid proposal slashing when the
previous pipelined target slot has the same bad proposer
- log the previous pipelined target proposer while selecting the test
slot

## Why
CI run http://ci.aztec-labs.com/bf99262466eae1dd selected slot 21 for
the invalid checkpoint scenario, but the same bad proposer could first
run a prior pipelined slot and build only a partial checkpoint. That
left the test waiting for block-proposed events on the intended slot
that never arrived. Requiring the previous pipelined target slot to have
a different proposer keeps the malicious config from being consumed by
the wrong slot after the epoch warp.

## Testing
- yarn format end-to-end
- yarn build
- LOG_LEVEL='info; debug:sequencer,publisher,validator' yarn workspace
@aztec/end-to-end test:e2e
e2e_slashing/attested_invalid_proposal.test.ts
## Summary

Adds a zero fast-path to `toBufferBE`, the bigint→big-endian-buffer
conversion underlying `Fr.toBuffer()`. Field elements serialized in
protocol structs are overwhelmingly zero (kernel public inputs are
mostly fixed-size zero-padding), so short-circuiting the zero case
avoids a wasteful `bigint → hex string → Buffer.from(hex)` round-trip.

```ts
if (num === 0n) {
  return Buffer.alloc(width);
}
```

## Why

Profiling `Tx.toBuffer()` showed it spends ~6.7ms almost entirely in
per-field `Fr.toBuffer()` across ~3900 fields, and **96% of those fields
are zero**. The scalar conversion is already near-optimal otherwise — a
64-bit-words variant (`writeBigUInt64BE`×4) is actually *slower* on real
(non-zero) field elements because V8's bigint shifts allocate.

Micro-benchmark of `toBufferBE` variants (width=32, correctness-checked
against current):

| variant | 96%-zero (real) | all-random (worst case) |
|---|---|---|
| current | 452 ns | 382 ns |
| 64-bit words | 215 ns | 503 ns (slower) |
| **zero fast-path** | **55 ns** | 387 ns (free) |

The fast-path is ~8× on the real workload and costs one `=== 0n` compare
on the worst case.

## Impact

End-to-end on `mockTx(42)`:

| | before | after |
|---|---|---|
| `tx.toBuffer()` total | 6.66 ms | 4.20 ms (−37%) |
| `data.toBuffer()` | 4.34 ms | 2.25 ms (−48%) |

`data.toBuffer()` (the kernel public inputs) is the production-relevant
figure: the mock serializes an uncompressed proof, whereas real txs
carry a compressed proof that serializes as a single blob. The benefit
applies to every `Fr.toBuffer()` / serialization path in the monorepo,
not just txs.

The remaining cost is structural — a Buffer is allocated per field and
then `Buffer.concat`'d across thousands of them. Eliminating that needs
a single-preallocated-buffer serializer; this change is the safe,
broadly-beneficial first step.

## Testing

`toBufferBE` previously had no direct unit tests; added coverage for the
zero path, big-endian left-padding, exact-width values, and the
negative-input throw. The conversion is otherwise byte-identical to
before.
This causes Codex sandbox to fail and the apply_patch command to fail.
Fix is to remove the symlinks for all the .codex folders, and instead
create actual folders with symlinks in their contents. A pre-commit hook
checks that all contents are symlinked.

>   The issue is the tracked symlink:
> 
>   yarn-project/.codex -> .claude
> 
> The sandbox is trying to enforce
/home/santiago/Projects/aztec-4/yarn-project/.codex as a read-only
> path, but yarn-project is also a writable root. Since .codex is a
symlink inside that writable root,
>   bubblewrap refuses to set up the sandbox:
> 
>   Fatal error: cannot enforce sandbox read-only path .../.codex
>   because it crosses writable symlink .../.codex
> 
> So apply_patch is not uniquely broken. I reproduced the same sandbox
setup failure with simple
> sandboxed commands like pwd and ls. Commands that are already approved
or explicitly escalated can
>   still run because they bypass that sandbox path setup.

This issue had been introduced in #23400.
Fixes issue introduced in #23593.

Also fixes the content hash so they run on any change to claude or codex
folders, which caused the test failure to go unnoticed in the PR where
it was introduced.
…l` (#23604)

Instead of checking a range of slots, we only check the slot we're
interested in. This prevents any build errors that occured until things
got stable from interfering. For instance, the sequencer we stop could
cause the _next_ sequencer to miss their block. Looking just into the
`sentinelSlot` removes this indeterminism.
…23608)

Fixes flake in `proposer invalidates multiple checkpoints`
`e2e_epochs/epochs_invalidate_block.parallel.test.ts` test that caused a
timeout (see [this run](http://ci.aztec-labs.com/8b1c0f4ec6031f2b)). See
below for the Codex analysis and fix.

---

**Test Summary**
`proposer invalidates multiple checkpoints` verifies that two intended
bad checkpoints land with insufficient attestations, a later good
proposer invalidates the first bad checkpoint, and the chain then
progresses.

**Failed Run Error**
CI run `8b1c0f4ec6031f2b` timed out at Jest’s 600s limit. The failure
was not the shutdown L1 send error; that happened after the timeout
while teardown was interrupting pending work.

**Failed vs Successful Divergence**
First meaningful divergence: checkpoint 4 at slot 23.

Failed log: slot 23 published checkpoint 4 with only 1 attestation, then
archivers reported `Insufficient attestations ... actualAttestations:1`.
Successful log: slot 23 collected all 5 attestations before publishing
checkpoint 4, so the first intentionally bad checkpoints were later.

**Timeline**
Failed:
- `15:59:11` selected intended bad slots 25/26, applied bad config to
proposer `0x15...`
- `15:59:35` slot 23 job prepared by that same proposer
- `16:00:15` checkpoint 4 at slot 23 landed with 1 attestation
- repeated rollback/retry consumed enough time to hit Jest timeout

Successful:
- slot 23 checkpoint landed cleanly with 5 attestations
- intended bad checkpoints at slots 24/25 landed with 1 attestation
- checkpoint 5 was invalidated
- test completed successfully

**Hypothesis**
High confidence: the test’s bad-slot selection only excluded
`candidateSlot1 - 1` as a pre-bad pipelined target. In the failed run,
`candidateSlot1 - 2` was still unsnapshotted and owned by a bad
proposer, so applying malicious config leaked into slot 23.

**Evidence**
- Logs: failed run selected slots 25/26 but slot 23 later published with
1 attestation from the newly bad proposer.
- Source: pipelined checkpoint jobs snapshot sequencer config when the
target-slot job is created, so applying config while sequencers are
running can affect any not-yet-created pre-bad job.
- Skeptic check: no contradiction found; it also caught a broken local
timeout race.

**Proposed Fix**
Implemented in
[epochs_invalidate_block.parallel.test.ts](/home/santiago/Projects/aztec-1/yarn-project/end-to-end/src/e2e_epochs/epochs_invalidate_block.parallel.test.ts:393):
the selector now excludes bad proposers from every pre-bad target slot
from `currentSlot + 2` through `candidateSlot1 - 1`, not just the
immediately prior slot.

Also fixed the broken timeout race at [line
475](/home/santiago/Projects/aztec-1/yarn-project/end-to-end/src/e2e_epochs/epochs_invalidate_block.parallel.test.ts:475)
by removing the accidental inner `await`.
Copy link
Copy Markdown
Collaborator

@ludamad ludamad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Auto-approved

@AztecBot AztecBot enabled auto-merge May 28, 2026 03:04
@AztecBot
Copy link
Copy Markdown
Collaborator Author

🤖 Auto-merge enabled after 4 hours of inactivity. This PR will be merged automatically once all checks pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants