|
| 1 | +--- |
| 2 | +id: DOC-CI-RUNNERS |
| 3 | +title: CI runner pool operations runbook |
| 4 | +type: reference |
| 5 | +status: current |
| 6 | +tags: [reference, ci, runners, ops, runbook] |
| 7 | +--- |
| 8 | + |
| 9 | +# CI runner pool operations runbook |
| 10 | + |
| 11 | +Rivet's gating CI jobs (`fmt`, `clippy`, `test`, `mutation`, `Kani`, |
| 12 | +Playwright) all run on the self-hosted pool labeled |
| 13 | +`[self-hosted, linux, x64, <class>]`. That pool is a **single point of |
| 14 | +failure**: when it goes offline, every gate queues indefinitely with no |
| 15 | +fallback, and — without the liveness workflow that fires on GitHub-hosted |
| 16 | +runners — the failure is invisible to anyone not watching the queue by |
| 17 | +hand. This document is the runbook for the two situations that matter: |
| 18 | + |
| 19 | +1. The scheduled liveness workflow raised the tracker issue. |
| 20 | +2. You are diagnosing the pool by hand (liveness workflow missing, or |
| 21 | + confirming what it reported). |
| 22 | + |
| 23 | +Not covered here: the liveness workflow itself (see |
| 24 | +`.github/workflows/runner-liveness.yml`), and the routing-fast-gates-to- |
| 25 | +GitHub-hosted option (open under #509, tracked separately). |
| 26 | + |
| 27 | +> **Reminder (REQ-051).** Pre-commit hooks are convenience; CI is the |
| 28 | +> gate. Anything that skips CI does not have a traceability claim behind |
| 29 | +> it. If the pool is down, treat merges to `main` with the same caution |
| 30 | +> you would a `--no-verify` commit — the four-gate pipeline |
| 31 | +> (`pre-commit → bazel → CI → verify-matrix`) is running on three legs. |
| 32 | +
|
| 33 | +## Runner classes |
| 34 | + |
| 35 | +Three self-hosted classes are pinned by workflows: |
| 36 | + |
| 37 | +| Label | Purpose | Jobs that use it | |
| 38 | +| --- | --- | --- | |
| 39 | +| `light` | Fast, low-RAM checks — `fmt`, `yaml-lint`, `validate`, `clippy`, doc renders. | The bulk of `ci.yml`'s per-PR gates. | |
| 40 | +| `rust-cpu` | CPU-heavy Rust builds — `test`, `Kani`, coverage runs. | Test / proof shards. | |
| 41 | +| `lean-mem` | Memory-constrained shards (host-level `MemoryHigh=32 GiB`, adding `MemoryMax=~48 GiB` per #590). | `mutants-core`, Miri. Never route a job here without a per-process memory cap. | |
| 42 | + |
| 43 | +A single class going offline while the others stay up produces |
| 44 | +**partial** failure: the queued-age liveness alert still fires because |
| 45 | +some workflows can't be scheduled, but a `runner list` may still show |
| 46 | +online runners. Diagnose per-class, not just aggregate. |
| 47 | + |
| 48 | +## The two diagnostic commands |
| 49 | + |
| 50 | +`runner-liveness.yml` runs these on a `*/15 * * * *` cadence; if the |
| 51 | +tracker issue is open, these are also the commands you run by hand. |
| 52 | + |
| 53 | +**Are runners registered and online?** (best-effort — needs the |
| 54 | +`administration:read` scope, which the default `GITHUB_TOKEN` does not |
| 55 | +carry): |
| 56 | + |
| 57 | +```sh |
| 58 | +gh api repos/pulseengine/rivet/actions/runners \ |
| 59 | + --jq '{total: .total_count, |
| 60 | + online: [.runners[] | select(.status=="online")] | length, |
| 61 | + classes: [.runners[] | .labels[].name] | unique}' |
| 62 | +``` |
| 63 | + |
| 64 | +- `total: 0` → pool is not registered with the repo at all (likely |
| 65 | + org-level pool — the repo-scoped API returns empty; check the org's |
| 66 | + runner page instead). |
| 67 | +- `total > 0`, `online: 0` → runners are registered but their agent |
| 68 | + processes are down or offline. Restart on the runner host is usually |
| 69 | + enough; see "Bring the pool back" below. |
| 70 | +- `online > 0` but no jobs picking up → check the queued-age query below; |
| 71 | + runners are online but not accepting the queued job's label set (label |
| 72 | + mismatch or an org-level concurrency cap). |
| 73 | + |
| 74 | +**Are jobs stuck queued?** (authoritative — needs only `actions:read`, |
| 75 | +works even when the runner list returns empty for permission reasons): |
| 76 | + |
| 77 | +```sh |
| 78 | +gh api repos/pulseengine/rivet/actions/runs?status=queued \ |
| 79 | + --jq '.workflow_runs[] | "\(.id) age=\(((now - (.created_at | fromdateiso8601)) / 60 | floor))m \(.name) — \(.head_branch)"' |
| 80 | +``` |
| 81 | + |
| 82 | +- Any run age over the liveness threshold (`QUEUE_THRESHOLD_MINUTES`, |
| 83 | + currently **30**) is the alarm shape. A single stuck run 30+ minutes |
| 84 | + old is enough to fire the tracker; two or more is a durable outage. |
| 85 | +- If the oldest queued run is on `main` (a push run for a merged PR), |
| 86 | + the CI record for that merge is missing — the merge landed with no |
| 87 | + verification. Note it in the tracker issue for the next audit. |
| 88 | + |
| 89 | +## When the liveness tracker issue is open |
| 90 | + |
| 91 | +`runner-liveness.yml` auto-opens (and updates every 15 min, then |
| 92 | +auto-closes on recovery) the tracker labeled `runner-down` titled |
| 93 | +`🚨 CI runner pool liveness alert`. When it's open: |
| 94 | + |
| 95 | +1. **Confirm the shape** — re-run the two commands above. Match against |
| 96 | + the tracker's most recent probe comment. If they now report the |
| 97 | + pool healthy, the next probe (≤ 15 min) will auto-close; no manual |
| 98 | + action. |
| 99 | +2. **Identify which class is affected** — a `runner list` at `total > 0` |
| 100 | + with `online > 0` but jobs still queued means the class the job |
| 101 | + asked for (`lean-mem` / `rust-cpu` / `light`) has no live members. |
| 102 | +3. **Bring the pool back** — see below. |
| 103 | +4. **Post the outage window** on the tracker issue as one comment before |
| 104 | + it auto-closes: start time, end time, affected class, root cause if |
| 105 | + known. That comment becomes the durable audit record after the |
| 106 | + tracker closes. |
| 107 | +5. **Note affected `main` runs** in the same comment. Any `main` push |
| 108 | + run that queued and expired without running is a merge that reached |
| 109 | + `main` without CI verification. Those need the next PR against them |
| 110 | + to re-exercise the same gates. |
| 111 | + |
| 112 | +Do NOT close the tracker manually — let the next probe close it on |
| 113 | +recovery. A manual close on a still-down pool is a false-recovery signal |
| 114 | +and re-arms only when the pool next transitions. |
| 115 | + |
| 116 | +## Bring the pool back |
| 117 | + |
| 118 | +The action lives on the runner host, not this repo — outside this |
| 119 | +repo's write surface. The steps every operator needs: |
| 120 | + |
| 121 | +1. `ssh` to the runner host (`pulseengine-ci-<NN>`). |
| 122 | +2. Check the runner-agent service status: |
| 123 | + `systemctl --user status actions.runner.pulseengine-rivet.<host>.service` |
| 124 | + (or `github-runner@<name>` per the host's install layout). |
| 125 | +3. Inspect the last few log entries: |
| 126 | + `journalctl --user -u actions.runner.pulseengine-rivet.<host>.service -n 50 --no-pager`. |
| 127 | + Common failure shapes: |
| 128 | + - `disk full` → clear per-runner `_work/` and the shared |
| 129 | + `$CARGO_HOME/registry/cache/**`; see #567 for the coordination |
| 130 | + rules on the shared cargo state. |
| 131 | + - `token expired` / `Not configured` → re-register the runner via |
| 132 | + `./config.sh --url … --token …` (org owner has to mint the token). |
| 133 | + - Process alive but no jobs picking up → label mismatch (the runner |
| 134 | + lost or never had the class label the job asks for); re-run |
| 135 | + `./config.sh` with the correct `--labels`. |
| 136 | +4. Restart the service: |
| 137 | + `systemctl --user restart actions.runner.pulseengine-rivet.<host>.service`. |
| 138 | +5. Confirm via the two diagnostic commands above that the runner is |
| 139 | + online and jobs drain. |
| 140 | + |
| 141 | +If more than one host is affected, work them in parallel — the tracker |
| 142 | +issue auto-closes on the first passing probe after all classes recover. |
| 143 | + |
| 144 | +## Escalation |
| 145 | + |
| 146 | +- **Pool has been down > 2 hours** and no operator on the runner host is |
| 147 | + reachable: the maintainer's call is whether to admin-merge without CI |
| 148 | + or wait. Default is to wait; #509 option 2 (route fast gates to |
| 149 | + `ubuntu-latest`) is the durable answer here and is tracked |
| 150 | + separately. |
| 151 | +- **Every host reports full disk**: coordinate the cleanup with #567 |
| 152 | + (the shared-cargo race). Do not `rm -rf` shared registry state on any |
| 153 | + host while jobs are running on another — the "in-flight build |
| 154 | + corruption" failure mode. |
| 155 | +- **Runners online but jobs still queue for a specific class only**: |
| 156 | + check whether the org-level pool has a per-class concurrency cap set |
| 157 | + above the actual live count. |
| 158 | + |
| 159 | +## Related |
| 160 | + |
| 161 | +- `.github/workflows/runner-liveness.yml` — the alerting workflow itself. |
| 162 | +- #509 — parent issue; options 1 (liveness) shipped, option 2 (fallback |
| 163 | + routing) open, option 3 (this runbook). |
| 164 | +- #567 — self-hosted-runner disk exhaustion / shared-`CARGO_HOME` race. |
| 165 | +- #590 — `rivet_core` test-binary memory shape (why `lean-mem` |
| 166 | + exists in the first place). |
| 167 | +- REQ-051 — hooks are convenience; CI is the gate. |
0 commit comments