feat: merge-train/spartan#23580
Open
AztecBot wants to merge 25 commits into
Open
Conversation
…23502) ## Motivation `archiver/src/modules/l1_synchronizer.ts` skipped checkpoints with insufficient/invalid attestations under the assumption that the next proposer would invalidate them before publishing. When that assumption was violated — i.e., proposer P2 published a valid-attestations checkpoint that extended P1's invalid one — the archiver hit `InitialCheckpointNumberNotSequentialError` in `block_store.addCheckpoints`, the catch handler rolled back the L1 sync point, and the next poll re-fetched the same range and re-threw. The archiver looped indefinitely. The protocol already defines `OffenseType.PROPOSED_DESCENDANT_OF_CHECKPOINT_WITH_INVALID_ATTESTATIONS` for exactly this case but the slasher couldn't see valid-attestations descendants because the archiver threw before emitting any event. ### Human Note This is particularly relevant under pipelining. Attestors now attest to a checkpoint _before_ the previous one is pushed to L1, so they can be inadvertently attesting to a checkpoint built on top of one that became invalid as it was published to the rollup the contract with wrong attestations. So an honest attestor could get slashed if the proposer was malicious. ## Approach In the synchronizer, persist rejected ancestors in the block store keyed by archive root. On each new checkpoint, before attestation validation, compare its `header.lastArchiveRoot` against the persisted set — if it matches, skip the checkpoint as a descendant of an invalid ancestor and emit a new `L2BlockSourceEvents.CheckpointBuiltOnInvalidAncestorDetected` event with enough metadata to resolve the proposer. The slasher's `AttestationsBlockWatcher` is fixed to slash the proposer (not the attestors) under the new event. Fixes A-1072
## Summary - Run `gcp_auth` before `setup_gcp_secrets` in `source_network_env` so EC2 benchmark jobs can read Secret Manager (e.g. `otel-collector-url`). - Improve `setup_gcp_secrets.sh` diagnostics and activate the CI service account before secret fetches. - Install Terraform on Linux in `install_deps.sh`; add `setup-terraform` on nightly wait jobs. - Fix `deploy-network` checkout for pinned submodules (`fetch-depth: 0`, `lfs: true`). - Checkout `github.sha` on the benchmark job so workflow_dispatch from a feature branch runs that branch on EC2 (not `next`). Validated manually via Nightly Bench 10 TPS workflow_dispatch on this branch (run succeeded). ## Test plan - [x] Nightly Bench 10 TPS workflow_dispatch from `spy/10tps-bench-terraform` (deploy, wait, benchmark)
## Summary - Stabilizes the multiple-validator sentinel e2e by waiting for a post-warmup checkpoint before recording the assertion window. - Reuses the same warm-up helper in the second test so isolated runs avoid the same fresh-network startup noise before stopping a validator. ## Failed run Failed CI run: http://ci.aztec-labs.com/07fb31bc0706159f The failing test was `e2e_p2p_multiple_validators_sentinel > collects attestations for all validators on a node`. The test expected no `attestation-missed` entries, but the assertion window started while the network was still in the first pipelined slots after startup. In the failed run, slot 8 was built on a pending, not-yet-checkpointed parent, so some remote validators could not validate/attest in time and the sentinel recorded a missed attestation. ## Fix The test now waits for one warm-up slot and then waits for the observed checkpoint number to advance before capturing `initialSlot`. That keeps startup pipelining behavior out of the strict sentinel assertion window while preserving the test's actual coverage: once the network is past warm-up, every validator should be observed attesting or proposing as expected. ## Verification - `yarn format end-to-end` - `yarn build` - `yarn workspace @aztec/end-to-end test:e2e e2e_p2p/multiple_validators_sentinel.parallel.test.ts -t 'collects attestations for all validators on a node'`
…est.ts` (#23568) Fix web3signer e2e `e2e_multi_validator_node_key_store.test.ts` by removing the minTxsPerBlock override so the pipelining preset can publish empty checkpoints while txs arrive. Also anchors the test PXE to the checkpointed chain tip to prevent checkpoint prunes from killing sent txs.
Running ci.sh grind was failing with `sethostname: invalid argument`. Codex attributed the failure to a long branch name, causing a long instance name, which was too long for `sethostname`. Confirmed that switching to a shorter branch name fixed the issue. ``` --- request build instance (SSH) --- Requesting m6a.48xlarge spot instance (name: spl_fix-web3signer-pipelining-test_amd64_grind-test-cdfb13e6637062de) (type: m6a.48xlarge) (ami: ami-067627aa971a1dcbb) (bid: 8.3136)... Waiting for instance id for spot request: sir-dvtzjepj... Timeout waiting for spot request. Requesting m6a.48xlarge on-demand instance (name: spl_fix-web3signer-pipelining-test_amd64_grind-test-cdfb13e6637062de) (type: m6a.48xlarge) (ami: ami-067627aa971a1dcbb) (bid: 8.3136)... Instance id: i-0fd2be01d28ec47e5 Waiting for SSH at 13.58.96.227... --- connect via SSH --- Stdout is not a tty, running in background... Host processes pinned to OS CPUs: 88-95,184-191 HOST: fetching EC2 metadata token... HOST: metadata token acquired. HOST: decoding credentials... HOST: starting devbox container... HOST: devbox container launched (pid=10513). Monitoring for spot termination... HOST: preparing devbox (uid/gid, docker run)... docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: sethostname: invalid argument ```
Fixes the invalid checkpoint descendant e2e timing by keeping sequencers stopped until the test has selected adjacent target proposers, installed listeners, applied malicious configs, and warped to the intended pipelined build window. This avoids applying malicious config to an earlier slot owned by the same validator, which is what caused the CI run for PR #23502 to miss the intended P1/P2 checkpoint pair.
…iple checkpoints` (#23590) Summary: - Scan for consecutive bad checkpoint slots whose prior pipelined target slot is not owned by either intended bad proposer. - Keep the malicious-config injection tied to the selected bad proposers and remove the now-unnecessary non-null assertion. - Add an inline comment documenting why the prior pipelined target slot matters. Why: The test applies malicious checkpoint config while sequencers are already running. With proposer pipelining, the previous target slot can snapshot that config before the intended bad slots are built. If that prior proposer is one of the intended bad proposers, the test may spend the malicious config on the wrong checkpoint and stop validating the intended two-checkpoint invalidation path. This mirrors the slot-selection issue fixed for the invalid proposal slashing test, but applies it to the consecutive checkpoint invalidation scenario. Testing: - yarn format end-to-end - yarn build - LOG_LEVEL="info; debug:sequencer,publisher,validator" yarn workspace @aztec/end-to-end test:e2e e2e_epochs/epochs_invalidate_block.parallel.test.ts -t "proposer invalidates multiple checkpoints"
…ed_invalid_proposal` (#23589) ## Summary - skip target slots in attested invalid proposal slashing when the previous pipelined target slot has the same bad proposer - log the previous pipelined target proposer while selecting the test slot ## Why CI run http://ci.aztec-labs.com/bf99262466eae1dd selected slot 21 for the invalid checkpoint scenario, but the same bad proposer could first run a prior pipelined slot and build only a partial checkpoint. That left the test waiting for block-proposed events on the intended slot that never arrived. Requiring the previous pipelined target slot to have a different proposer keeps the malicious config from being consumed by the wrong slot after the epoch warp. ## Testing - yarn format end-to-end - yarn build - LOG_LEVEL='info; debug:sequencer,publisher,validator' yarn workspace @aztec/end-to-end test:e2e e2e_slashing/attested_invalid_proposal.test.ts
## Summary
Adds a zero fast-path to `toBufferBE`, the bigint→big-endian-buffer
conversion underlying `Fr.toBuffer()`. Field elements serialized in
protocol structs are overwhelmingly zero (kernel public inputs are
mostly fixed-size zero-padding), so short-circuiting the zero case
avoids a wasteful `bigint → hex string → Buffer.from(hex)` round-trip.
```ts
if (num === 0n) {
return Buffer.alloc(width);
}
```
## Why
Profiling `Tx.toBuffer()` showed it spends ~6.7ms almost entirely in
per-field `Fr.toBuffer()` across ~3900 fields, and **96% of those fields
are zero**. The scalar conversion is already near-optimal otherwise — a
64-bit-words variant (`writeBigUInt64BE`×4) is actually *slower* on real
(non-zero) field elements because V8's bigint shifts allocate.
Micro-benchmark of `toBufferBE` variants (width=32, correctness-checked
against current):
| variant | 96%-zero (real) | all-random (worst case) |
|---|---|---|
| current | 452 ns | 382 ns |
| 64-bit words | 215 ns | 503 ns (slower) |
| **zero fast-path** | **55 ns** | 387 ns (free) |
The fast-path is ~8× on the real workload and costs one `=== 0n` compare
on the worst case.
## Impact
End-to-end on `mockTx(42)`:
| | before | after |
|---|---|---|
| `tx.toBuffer()` total | 6.66 ms | 4.20 ms (−37%) |
| `data.toBuffer()` | 4.34 ms | 2.25 ms (−48%) |
`data.toBuffer()` (the kernel public inputs) is the production-relevant
figure: the mock serializes an uncompressed proof, whereas real txs
carry a compressed proof that serializes as a single blob. The benefit
applies to every `Fr.toBuffer()` / serialization path in the monorepo,
not just txs.
The remaining cost is structural — a Buffer is allocated per field and
then `Buffer.concat`'d across thousands of them. Eliminating that needs
a single-preallocated-buffer serializer; this change is the safe,
broadly-beneficial first step.
## Testing
`toBufferBE` previously had no direct unit tests; added coverage for the
zero path, big-endian left-padding, exact-width values, and the
negative-input throw. The conversion is otherwise byte-identical to
before.
Fix A-1109
This causes Codex sandbox to fail and the apply_patch command to fail. Fix is to remove the symlinks for all the .codex folders, and instead create actual folders with symlinks in their contents. A pre-commit hook checks that all contents are symlinked. > The issue is the tracked symlink: > > yarn-project/.codex -> .claude > > The sandbox is trying to enforce /home/santiago/Projects/aztec-4/yarn-project/.codex as a read-only > path, but yarn-project is also a writable root. Since .codex is a symlink inside that writable root, > bubblewrap refuses to set up the sandbox: > > Fatal error: cannot enforce sandbox read-only path .../.codex > because it crosses writable symlink .../.codex > > So apply_patch is not uniquely broken. I reproduced the same sandbox setup failure with simple > sandboxed commands like pwd and ls. Commands that are already approved or explicitly escalated can > still run because they bypass that sandbox path setup. This issue had been introduced in #23400.
Fixes issue introduced in #23593. Also fixes the content hash so they run on any change to claude or codex folders, which caused the test failure to go unnoticed in the PR where it was introduced.
…l` (#23604) Instead of checking a range of slots, we only check the slot we're interested in. This prevents any build errors that occured until things got stable from interfering. For instance, the sequencer we stop could cause the _next_ sequencer to miss their block. Looking just into the `sentinelSlot` removes this indeterminism.
…23608) Fixes flake in `proposer invalidates multiple checkpoints` `e2e_epochs/epochs_invalidate_block.parallel.test.ts` test that caused a timeout (see [this run](http://ci.aztec-labs.com/8b1c0f4ec6031f2b)). See below for the Codex analysis and fix. --- **Test Summary** `proposer invalidates multiple checkpoints` verifies that two intended bad checkpoints land with insufficient attestations, a later good proposer invalidates the first bad checkpoint, and the chain then progresses. **Failed Run Error** CI run `8b1c0f4ec6031f2b` timed out at Jest’s 600s limit. The failure was not the shutdown L1 send error; that happened after the timeout while teardown was interrupting pending work. **Failed vs Successful Divergence** First meaningful divergence: checkpoint 4 at slot 23. Failed log: slot 23 published checkpoint 4 with only 1 attestation, then archivers reported `Insufficient attestations ... actualAttestations:1`. Successful log: slot 23 collected all 5 attestations before publishing checkpoint 4, so the first intentionally bad checkpoints were later. **Timeline** Failed: - `15:59:11` selected intended bad slots 25/26, applied bad config to proposer `0x15...` - `15:59:35` slot 23 job prepared by that same proposer - `16:00:15` checkpoint 4 at slot 23 landed with 1 attestation - repeated rollback/retry consumed enough time to hit Jest timeout Successful: - slot 23 checkpoint landed cleanly with 5 attestations - intended bad checkpoints at slots 24/25 landed with 1 attestation - checkpoint 5 was invalidated - test completed successfully **Hypothesis** High confidence: the test’s bad-slot selection only excluded `candidateSlot1 - 1` as a pre-bad pipelined target. In the failed run, `candidateSlot1 - 2` was still unsnapshotted and owned by a bad proposer, so applying malicious config leaked into slot 23. **Evidence** - Logs: failed run selected slots 25/26 but slot 23 later published with 1 attestation from the newly bad proposer. - Source: pipelined checkpoint jobs snapshot sequencer config when the target-slot job is created, so applying config while sequencers are running can affect any not-yet-created pre-bad job. - Skeptic check: no contradiction found; it also caught a broken local timeout race. **Proposed Fix** Implemented in [epochs_invalidate_block.parallel.test.ts](/home/santiago/Projects/aztec-1/yarn-project/end-to-end/src/e2e_epochs/epochs_invalidate_block.parallel.test.ts:393): the selector now excludes bad proposers from every pre-bad target slot from `currentSlot + 2` through `candidateSlot1 - 1`, not just the immediately prior slot. Also fixed the broken timeout race at [line 475](/home/santiago/Projects/aztec-1/yarn-project/end-to-end/src/e2e_epochs/epochs_invalidate_block.parallel.test.ts:475) by removing the accidental inner `await`.
Collaborator
Author
|
🤖 Auto-merge enabled after 4 hours of inactivity. This PR will be merged automatically once all checks pass. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
BEGIN_COMMIT_OVERRIDE
fix(archiver): skip descendants of invalid-attestations checkpoints (#23502)
chore: scale network validators (#23579)
fix(ci): nightly 10 TPS bench GCP auth and checkout (#23586)
chore: set eth node resource profile (#23583)
fix: wait for checkpoint before sentinel assertions (#23573)
fix: slash attestations for invalid checkpoint proposals (#23506)
test: fix web3signer pipelining
e2e_multi_validator_node_key_store.test.ts(#23568)fix: cap CI devbox hostname (#23591)
test: stabilize invalid checkpoint descendant e2e (#23582)
test(e2e): stabilize invalidation slots in
proposer invalidates multiple checkpoints(#23590)test(e2e): stabilize invalid proposal slashing target slot in
attested_invalid_proposal(#23589)chore(foundation): faster toBufferBE via zero fast-path (#23592)
fix: honour BB_BINARY_PATH (#23570)
chore: bump reth and lighthouse (#23588)
chore: add web3signer and postgres node selectors (#23598)
fix: do not symlink .codex folders (#23593)
chore: fix claude and codex symlinking tests (#23599)
test(e2e): narrow down sentinel check in
multiple_validators_sentinel(#23604)test(e2e): fix
proposer invalidates multiple checkpointstimeout (#23608)END_COMMIT_OVERRIDE