Commit fa06339
authored
fix(ci): refresh grind launcher checkout to origin/next before launching (#24039)
## Problem
The dashboard `grind` option always fails to SSH into the build
instance:
```
Waiting for SSH at 3.144.255.68...
Timeout: SSH could not login to 3.144.255.68 within 60 seconds.
```
The instance launches fine (spot/on-demand fulfilled, IP assigned) but
SSH never connects, so grind cycles through every instance type and
gives up.
## Root cause
CI build boxes were migrated from SSH to **SSM**. In `ci3/bootstrap_ec2`
the default is now `CI_USE_SSH=0` (SSM); only `shell-new` forces SSH,
and `grind-test` does not. So on current `next`, grind runs over SSM
like the rest of CI.
But the dashboard launches grind from a long-lived checkout at
`REPO_PATH` (the `/grind` handler in `rk.py` shells out to `cd
$REPO_PATH && ./ci.sh grind-test ...`). That checkout had drifted to a
pre-SSM commit, so grind alone still took the legacy SSH branch —
launching into the retired SSH security group + `build-instance` key
pair, whose port-22 / key-injection preconditions were torn down during
the SSM lockdown. The stale checkout also explains the old AMI
(`ami-09d27244b23be8891`) in the logs vs. current `next`'s
`ami-067627aa971a1dcbb`.
Nothing kept `REPO_PATH` current: the `ci3-dashboard-deploy.yml`
workflow only rebuilds the `rkapp` Flask container (and is path-filtered
to `ci3/dashboard/**`), so changes to the `ci3/` launcher scripts never
refreshed it.
## Fix
Refresh the launcher checkout to `origin/next` at grind launch time,
before shelling out. This is self-healing and independent of deploys. It
matches the existing design where the launcher always runs
current-`next` orchestration scripts while the grind *target commit* is
checked out on the remote box — so this does **not** restrict which
branch/commit you can grind. If the refresh fails (e.g. transient
network), the error is surfaced in the run log instead of silently
grinding on a stale tree.
## Testing
`python3 -m py_compile ci3/dashboard/rk.py` passes. The behavior change
is host-side (requires the dashboard's `REPO_PATH` checkout) and can't
be exercised in unit CI; it will take effect on the next dashboard
deploy. The immediate one-time unblock is still to refresh `REPO_PATH`
on `ci.aztec-labs.com` and restart `rkapp`.
---
*Created by
[claudebox](https://claudebox.work/v2/sessions/1c05a513cb601b21) ·
group: `slackbot`*1 file changed
Lines changed: 20 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
556 | 556 | | |
557 | 557 | | |
558 | 558 | | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
559 | 579 | | |
560 | 580 | | |
561 | 581 | | |
| |||
0 commit comments