feat(ci3): run uploadable benchmarks on a dedicated on-demand instance#24028
Merged
Conversation
ludamad
approved these changes
Jun 11, 2026
2d57676 to
e33bb2f
Compare
9c8cbfb to
eadb025
Compare
> [!IMPORTANT] > Depends on the IAM change aztec-labs-eng/iac#6 (grants `ci3-build-instance-role` the launch/SSM/PassRole surface). **That must apply first**, else the build instance's `create-fleet` hits `UnauthorizedOperation`. ## Problem Spot diversification (create-fleet) means build instances now land on variable EC2 types — m6a/m7a/m6i/r6a/r7a at 16/32/48xlarge, AMD vs Intel. The in-build benchmark phase runs on that box, so wall-time numbers vary by hardware family far more than the 105% regression alert threshold → false regressions. (The instance type isn't even recorded in the bench JSON.) ## Approach Only the canonical **merge-queue→next** series (the one used for real regression tracking) runs benches on a **dedicated, fixed, on-demand m6a.16xlarge**. PR `ci-full` runs keep running benches inline on the contended build box purely as a **breakage check** — no dedicated box, no upload. Benches are scheduled by the existing test engine: when the build completes in `build_and_test` (full builds only), - **upload runs** (`SHOULD_UPLOAD_BENCHMARKS=1`): launch the dedicated box via `./ci.sh bench` as a backgrounded, colored, denoised job (logged like the test engine) and `wait` on it (non-fatal) before returning; - **otherwise**: `bench_cmds >> $test_cmds_file` — benches become ordinary test commands. `ci.sh bench` → `bootstrap_ec2` blocks until the remote `ci-bench` finishes (ending in `cache_upload bench-<treehash>`), so the `wait` is the whole rendezvous. Results reach the GA `Upload benchmarks` step unchanged via that cache key (`ci3_success.sh` `gh-bench`). ## Changes - **`bootstrap.sh`**: drop inline `bench` from `ci-full`/`ci-full-no-test-cache`; add the `build_and_test` launch/append hook + non-fatal `wait`; new `ci-bench` mode = cache-hit `make full` + `bench` (no test engine). - **`ci.sh`**: new `bench` launcher — `AWS_INSTANCE=m6a.16xlarge NO_SPOT=1` (pins a fixed on-demand type; `CPUS` not needed since `AWS_INSTANCE` bypasses pool sizing). - **`ci3/bench_engine`**: drop the 8-core OS isolation / HT-disable / pinning. Dedicated box → benches use the full machine, honouring per-bench `CPUS` via the strict scheduler (defaults to `nproc/2` without `BENCH_CPU_COUNT`). This is what lets the 64-vCPU 16xlarge satisfy the `CPUS=32` bb rollup bench. - **`.github/ci3_labels_to_env.sh`**: scope `SHOULD_UPLOAD_BENCHMARKS` to merge-queue→next (it now also gates the dedicated box). **`ci3/bootstrap_ec2`**: pass it through to the instance. ## Notes - **One-time baseline shift** in `bench/next`: different machine + no isolation changes absolute numbers once; stable thereafter. May want to annotate the series. - **Soft failure**: a bench-box failure is logged and the run proceeds (no fresh numbers) rather than blocking the merge. - **PR benches-as-tests**: `:PARALLEL=0` serial benches lose one-at-a-time isolation and run contended — fine for breakage-only; real numbers come from the dedicated box's `bench_engine` path. - Validated: all touched scripts pass `bash -n`; the `AWS_INSTANCE`+`NO_SPOT` fixed-on-demand launch mechanism was verified live during the create-fleet work. Full e2e is exercised by a merge-queue→next run once the iac PR lands.
eadb025 to
19de9f1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Important
Depends on the IAM change aztec-labs-eng/iac#6 (grants
ci3-build-instance-rolethe launch/SSM/PassRole surface). That must apply first, else the build instance'screate-fleethitsUnauthorizedOperation.Problem
Spot diversification (create-fleet) means build instances now land on variable EC2 types — m6a/m7a/m6i/r6a/r7a at 16/32/48xlarge, AMD vs Intel. The in-build benchmark phase runs on that box, so wall-time numbers vary by hardware family far more than the 105% regression alert threshold → false regressions. (The instance type isn't even recorded in the bench JSON.)
Approach
Only the canonical merge-queue→next series (the one used for real regression tracking) runs benches on a dedicated, fixed, on-demand m6a.16xlarge. PR
ci-fullruns keep running benches inline on the contended build box purely as a breakage check — no dedicated box, no upload.Benches are scheduled by the existing test engine: when the build completes in
build_and_test(full builds only),SHOULD_UPLOAD_BENCHMARKS=1): launch the dedicated box via./ci.sh benchas a backgrounded, colored, denoised job (logged like the test engine) andwaiton it (non-fatal) before returning;bench_cmds >> $test_cmds_file— benches become ordinary test commands.ci.sh bench→bootstrap_ec2blocks until the remoteci-benchfinishes (ending incache_upload bench-<treehash>), so thewaitis the whole rendezvous. Results reach the GAUpload benchmarksstep unchanged via that cache key (ci3_success.shgh-bench).Changes
bootstrap.sh: drop inlinebenchfromci-full/ci-full-no-test-cache; add thebuild_and_testlaunch/append hook + non-fatalwait; newci-benchmode = cache-hitmake full+bench(no test engine).ci.sh: newbenchlauncher —AWS_INSTANCE=m6a.16xlarge NO_SPOT=1(pins a fixed on-demand type;CPUSnot needed sinceAWS_INSTANCEbypasses pool sizing).ci3/bench_engine: drop the 8-core OS isolation / HT-disable / pinning. Dedicated box → benches use the full machine, honouring per-benchCPUSvia the strict scheduler (defaults tonproc/2withoutBENCH_CPU_COUNT). This is what lets the 64-vCPU 16xlarge satisfy theCPUS=32bb rollup bench..github/ci3_labels_to_env.sh: scopeSHOULD_UPLOAD_BENCHMARKSto merge-queue→next (it now also gates the dedicated box).ci3/bootstrap_ec2: pass it through to the instance.Notes
bench/next: different machine + no isolation changes absolute numbers once; stable thereafter. May want to annotate the series.:PARALLEL=0serial benches lose one-at-a-time isolation and run contended — fine for breakage-only; real numbers come from the dedicated box'sbench_enginepath.bash -n; theAWS_INSTANCE+NO_SPOTfixed-on-demand launch mechanism was verified live during the create-fleet work. Full e2e is exercised by a merge-queue→next run once the iac PR lands.