You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(ci3): run uploadable benchmarks on a dedicated on-demand instance (#24028)
> [!IMPORTANT]
> Depends on the IAM change aztec-labs-eng/iac#6 (grants
`ci3-build-instance-role` the launch/SSM/PassRole surface). **That must
apply first**, else the build instance's `create-fleet` hits
`UnauthorizedOperation`.
## Problem
Spot diversification (create-fleet) means build instances now land on
variable EC2 types — m6a/m7a/m6i/r6a/r7a at 16/32/48xlarge, AMD vs
Intel. The in-build benchmark phase runs on that box, so wall-time
numbers vary by hardware family far more than the 105% regression alert
threshold → false regressions. (The instance type isn't even recorded in
the bench JSON.)
## Approach
Only the canonical **merge-queue→next** series (the one used for real
regression tracking) runs benches on a **dedicated, fixed, on-demand
m6a.16xlarge**. PR `ci-full` runs keep running benches inline on the
contended build box purely as a **breakage check** — no dedicated box,
no upload.
Benches are scheduled by the existing test engine: when the build
completes in `build_and_test` (full builds only),
- **upload runs** (`SHOULD_UPLOAD_BENCHMARKS=1`): launch the dedicated
box via `./ci.sh bench` as a backgrounded, colored, denoised job (logged
like the test engine) and `wait` on it (non-fatal) before returning;
- **otherwise**: `bench_cmds >> $test_cmds_file` — benches become
ordinary test commands.
`ci.sh bench` → `bootstrap_ec2` blocks until the remote `ci-bench`
finishes (ending in `cache_upload bench-<treehash>`), so the `wait` is
the whole rendezvous. Results reach the GA `Upload benchmarks` step
unchanged via that cache key (`ci3_success.sh` `gh-bench`).
## Changes
- **`bootstrap.sh`**: drop inline `bench` from
`ci-full`/`ci-full-no-test-cache`; add the `build_and_test`
launch/append hook + non-fatal `wait`; new `ci-bench` mode = cache-hit
`make full` + `bench` (no test engine).
- **`ci.sh`**: new `bench` launcher — `AWS_INSTANCE=m6a.16xlarge
NO_SPOT=1` (pins a fixed on-demand type; `CPUS` not needed since
`AWS_INSTANCE` bypasses pool sizing).
- **`ci3/bench_engine`**: drop the 8-core OS isolation / HT-disable /
pinning. Dedicated box → benches use the full machine, honouring
per-bench `CPUS` via the strict scheduler (defaults to `nproc/2` without
`BENCH_CPU_COUNT`). This is what lets the 64-vCPU 16xlarge satisfy the
`CPUS=32` bb rollup bench.
- **`.github/ci3_labels_to_env.sh`**: scope `SHOULD_UPLOAD_BENCHMARKS`
to merge-queue→next (it now also gates the dedicated box).
**`ci3/bootstrap_ec2`**: pass it through to the instance.
## Notes
- **One-time baseline shift** in `bench/next`: different machine + no
isolation changes absolute numbers once; stable thereafter. May want to
annotate the series.
- **Soft failure**: a bench-box failure is logged and the run proceeds
(no fresh numbers) rather than blocking the merge.
- **PR benches-as-tests**: `:PARALLEL=0` serial benches lose
one-at-a-time isolation and run contended — fine for breakage-only; real
numbers come from the dedicated box's `bench_engine` path.
- Validated: all touched scripts pass `bash -n`; the
`AWS_INSTANCE`+`NO_SPOT` fixed-on-demand launch mechanism was verified
live during the create-fleet work. Full e2e is exercised by a
merge-queue→next run once the iac PR lands.
echo_cmd "chonk-input-update""Spin up an EC2 instance to update pinned Chonk IVC inputs and push the diff."
40
40
echo_cmd "release""Spin up an EC2 instance and run bootstrap release."
41
41
echo_cmd "shell-new""Spin up an EC2 instance, clone the repo, and drop into a shell."
42
-
echo_cmd "shell""Drop into a shell in the current running build instance container."
43
-
echo_cmd "shell-host""Drop into a shell in the current running build host."
42
+
echo_cmd "shell-container""Shell into a running build container. Optional filter tokens (e.g. 'pr-123 bench') select the instance; defaults to the current branch."
43
+
echo_cmd "shell-host""Shell into a running build host. Same instance selection as shell-container."
44
44
echo_cmd "log""Display the log of the given log ID."
45
-
echo_cmd "kill""Terminate running EC2 instance with instance_name."
45
+
echo_cmd "kill""Terminate running build instances matching the filter tokens (default: current branch)."
46
46
echo_cmd "draft""Mark the current PR as draft (no automatic CI runs when pushing)."
47
47
echo_cmd "ready""Mark the current PR as ready (enable automatic CI runs when pushing)."
48
48
echo_cmd "pr-url""Print the URL of the current PR associated with the branch."
@@ -53,27 +53,69 @@ function print_usage {
53
53
54
54
[ -n"$cmd" ] &&shift
55
55
56
-
# Keep this in sync with bootstrap_ec2's instance_name scheme (repo-scoped) so the
57
-
# shell/kill/get-ip helpers find instances launched by a CI run for this repo.
0 commit comments