Skip to content

Commit fa06339

Browse files
authored
fix(ci): refresh grind launcher checkout to origin/next before launching (#24039)
## Problem The dashboard `grind` option always fails to SSH into the build instance: ``` Waiting for SSH at 3.144.255.68... Timeout: SSH could not login to 3.144.255.68 within 60 seconds. ``` The instance launches fine (spot/on-demand fulfilled, IP assigned) but SSH never connects, so grind cycles through every instance type and gives up. ## Root cause CI build boxes were migrated from SSH to **SSM**. In `ci3/bootstrap_ec2` the default is now `CI_USE_SSH=0` (SSM); only `shell-new` forces SSH, and `grind-test` does not. So on current `next`, grind runs over SSM like the rest of CI. But the dashboard launches grind from a long-lived checkout at `REPO_PATH` (the `/grind` handler in `rk.py` shells out to `cd $REPO_PATH && ./ci.sh grind-test ...`). That checkout had drifted to a pre-SSM commit, so grind alone still took the legacy SSH branch — launching into the retired SSH security group + `build-instance` key pair, whose port-22 / key-injection preconditions were torn down during the SSM lockdown. The stale checkout also explains the old AMI (`ami-09d27244b23be8891`) in the logs vs. current `next`'s `ami-067627aa971a1dcbb`. Nothing kept `REPO_PATH` current: the `ci3-dashboard-deploy.yml` workflow only rebuilds the `rkapp` Flask container (and is path-filtered to `ci3/dashboard/**`), so changes to the `ci3/` launcher scripts never refreshed it. ## Fix Refresh the launcher checkout to `origin/next` at grind launch time, before shelling out. This is self-healing and independent of deploys. It matches the existing design where the launcher always runs current-`next` orchestration scripts while the grind *target commit* is checked out on the remote box — so this does **not** restrict which branch/commit you can grind. If the refresh fails (e.g. transient network), the error is surfaced in the run log instead of silently grinding on a stale tree. ## Testing `python3 -m py_compile ci3/dashboard/rk.py` passes. The behavior change is host-side (requires the dashboard's `REPO_PATH` checkout) and can't be exercised in unit CI; it will take effect on the next dashboard deploy. The immediate one-time unblock is still to refresh `REPO_PATH` on `ci.aztec-labs.com` and restart `rkapp`. --- *Created by [claudebox](https://claudebox.work/v2/sessions/1c05a513cb601b21) · group: `slackbot`*
2 parents e26c0c1 + 3b462ff commit fa06339

1 file changed

Lines changed: 20 additions & 0 deletions

File tree

ci3/dashboard/rk.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -556,6 +556,26 @@ def make_options(param_name, options, current_value, suffix=''):
556556
# Dashboard server needs local repo checkout at REPO_PATH
557557
repo_path = os.environ.get('REPO_PATH')
558558
if repo_path:
559+
# Refresh the launcher checkout to current origin/next before launching.
560+
# REPO_PATH only supplies the orchestration scripts (ci.sh/bootstrap_ec2);
561+
# the grind target commit is checked out on the remote box. The launcher
562+
# must stay current so grind uses the same transport (SSM) as the rest of
563+
# CI -- a drifted checkout silently falls back to the retired SSH path and
564+
# every instance times out waiting for SSH.
565+
refresh = subprocess.run(
566+
['git', '-C', repo_path, 'fetch', '--quiet', 'origin', 'next'],
567+
stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True
568+
)
569+
if refresh.returncode == 0:
570+
refresh = subprocess.run(
571+
['git', '-C', repo_path, 'checkout', '--quiet', '--force', 'origin/next'],
572+
stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True
573+
)
574+
if refresh.returncode != 0:
575+
r.setex(run_id, 86400,
576+
f'Failed to refresh launcher checkout at {repo_path}:\n{refresh.stdout}\n'.encode())
577+
return redirect(f'/{run_id}')
578+
559579
subprocess.Popen(
560580
['bash', '-c', f'cd {repo_path} && RUN_ID={run_id} CPUS={cpus} ./ci.sh grind-test {shlex.quote(full_cmd)} {grind_time} {jobs_pct} {memsuspend_pct} {commit}'],
561581
stdout=subprocess.DEVNULL,

0 commit comments

Comments
 (0)