Skip to content

Commit 8f9fac6

Browse files
authored
Merge pull request #662 from pulseengine/docs/issue-509-runner-runbook
docs: runner pool operations runbook (#509)
2 parents 8d53797 + eeee4c4 commit 8f9fac6

1 file changed

Lines changed: 167 additions & 0 deletions

File tree

docs/ci-runners.md

Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
---
2+
id: DOC-CI-RUNNERS
3+
title: CI runner pool operations runbook
4+
type: reference
5+
status: current
6+
tags: [reference, ci, runners, ops, runbook]
7+
---
8+
9+
# CI runner pool operations runbook
10+
11+
Rivet's gating CI jobs (`fmt`, `clippy`, `test`, `mutation`, `Kani`,
12+
Playwright) all run on the self-hosted pool labeled
13+
`[self-hosted, linux, x64, <class>]`. That pool is a **single point of
14+
failure**: when it goes offline, every gate queues indefinitely with no
15+
fallback, and — without the liveness workflow that fires on GitHub-hosted
16+
runners — the failure is invisible to anyone not watching the queue by
17+
hand. This document is the runbook for the two situations that matter:
18+
19+
1. The scheduled liveness workflow raised the tracker issue.
20+
2. You are diagnosing the pool by hand (liveness workflow missing, or
21+
confirming what it reported).
22+
23+
Not covered here: the liveness workflow itself (see
24+
`.github/workflows/runner-liveness.yml`), and the routing-fast-gates-to-
25+
GitHub-hosted option (open under #509, tracked separately).
26+
27+
> **Reminder (REQ-051).** Pre-commit hooks are convenience; CI is the
28+
> gate. Anything that skips CI does not have a traceability claim behind
29+
> it. If the pool is down, treat merges to `main` with the same caution
30+
> you would a `--no-verify` commit — the four-gate pipeline
31+
> (`pre-commit → bazel → CI → verify-matrix`) is running on three legs.
32+
33+
## Runner classes
34+
35+
Three self-hosted classes are pinned by workflows:
36+
37+
| Label | Purpose | Jobs that use it |
38+
| --- | --- | --- |
39+
| `light` | Fast, low-RAM checks — `fmt`, `yaml-lint`, `validate`, `clippy`, doc renders. | The bulk of `ci.yml`'s per-PR gates. |
40+
| `rust-cpu` | CPU-heavy Rust builds — `test`, `Kani`, coverage runs. | Test / proof shards. |
41+
| `lean-mem` | Memory-constrained shards (host-level `MemoryHigh=32 GiB`, adding `MemoryMax=~48 GiB` per #590). | `mutants-core`, Miri. Never route a job here without a per-process memory cap. |
42+
43+
A single class going offline while the others stay up produces
44+
**partial** failure: the queued-age liveness alert still fires because
45+
some workflows can't be scheduled, but a `runner list` may still show
46+
online runners. Diagnose per-class, not just aggregate.
47+
48+
## The two diagnostic commands
49+
50+
`runner-liveness.yml` runs these on a `*/15 * * * *` cadence; if the
51+
tracker issue is open, these are also the commands you run by hand.
52+
53+
**Are runners registered and online?** (best-effort — needs the
54+
`administration:read` scope, which the default `GITHUB_TOKEN` does not
55+
carry):
56+
57+
```sh
58+
gh api repos/pulseengine/rivet/actions/runners \
59+
--jq '{total: .total_count,
60+
online: [.runners[] | select(.status=="online")] | length,
61+
classes: [.runners[] | .labels[].name] | unique}'
62+
```
63+
64+
- `total: 0` → pool is not registered with the repo at all (likely
65+
org-level pool — the repo-scoped API returns empty; check the org's
66+
runner page instead).
67+
- `total > 0`, `online: 0` → runners are registered but their agent
68+
processes are down or offline. Restart on the runner host is usually
69+
enough; see "Bring the pool back" below.
70+
- `online > 0` but no jobs picking up → check the queued-age query below;
71+
runners are online but not accepting the queued job's label set (label
72+
mismatch or an org-level concurrency cap).
73+
74+
**Are jobs stuck queued?** (authoritative — needs only `actions:read`,
75+
works even when the runner list returns empty for permission reasons):
76+
77+
```sh
78+
gh api repos/pulseengine/rivet/actions/runs?status=queued \
79+
--jq '.workflow_runs[] | "\(.id) age=\(((now - (.created_at | fromdateiso8601)) / 60 | floor))m \(.name) — \(.head_branch)"'
80+
```
81+
82+
- Any run age over the liveness threshold (`QUEUE_THRESHOLD_MINUTES`,
83+
currently **30**) is the alarm shape. A single stuck run 30+ minutes
84+
old is enough to fire the tracker; two or more is a durable outage.
85+
- If the oldest queued run is on `main` (a push run for a merged PR),
86+
the CI record for that merge is missing — the merge landed with no
87+
verification. Note it in the tracker issue for the next audit.
88+
89+
## When the liveness tracker issue is open
90+
91+
`runner-liveness.yml` auto-opens (and updates every 15 min, then
92+
auto-closes on recovery) the tracker labeled `runner-down` titled
93+
`🚨 CI runner pool liveness alert`. When it's open:
94+
95+
1. **Confirm the shape** — re-run the two commands above. Match against
96+
the tracker's most recent probe comment. If they now report the
97+
pool healthy, the next probe (≤ 15 min) will auto-close; no manual
98+
action.
99+
2. **Identify which class is affected** — a `runner list` at `total > 0`
100+
with `online > 0` but jobs still queued means the class the job
101+
asked for (`lean-mem` / `rust-cpu` / `light`) has no live members.
102+
3. **Bring the pool back** — see below.
103+
4. **Post the outage window** on the tracker issue as one comment before
104+
it auto-closes: start time, end time, affected class, root cause if
105+
known. That comment becomes the durable audit record after the
106+
tracker closes.
107+
5. **Note affected `main` runs** in the same comment. Any `main` push
108+
run that queued and expired without running is a merge that reached
109+
`main` without CI verification. Those need the next PR against them
110+
to re-exercise the same gates.
111+
112+
Do NOT close the tracker manually — let the next probe close it on
113+
recovery. A manual close on a still-down pool is a false-recovery signal
114+
and re-arms only when the pool next transitions.
115+
116+
## Bring the pool back
117+
118+
The action lives on the runner host, not this repo — outside this
119+
repo's write surface. The steps every operator needs:
120+
121+
1. `ssh` to the runner host (`pulseengine-ci-<NN>`).
122+
2. Check the runner-agent service status:
123+
`systemctl --user status actions.runner.pulseengine-rivet.<host>.service`
124+
(or `github-runner@<name>` per the host's install layout).
125+
3. Inspect the last few log entries:
126+
`journalctl --user -u actions.runner.pulseengine-rivet.<host>.service -n 50 --no-pager`.
127+
Common failure shapes:
128+
- `disk full` → clear per-runner `_work/` and the shared
129+
`$CARGO_HOME/registry/cache/**`; see #567 for the coordination
130+
rules on the shared cargo state.
131+
- `token expired` / `Not configured` → re-register the runner via
132+
`./config.sh --url … --token …` (org owner has to mint the token).
133+
- Process alive but no jobs picking up → label mismatch (the runner
134+
lost or never had the class label the job asks for); re-run
135+
`./config.sh` with the correct `--labels`.
136+
4. Restart the service:
137+
`systemctl --user restart actions.runner.pulseengine-rivet.<host>.service`.
138+
5. Confirm via the two diagnostic commands above that the runner is
139+
online and jobs drain.
140+
141+
If more than one host is affected, work them in parallel — the tracker
142+
issue auto-closes on the first passing probe after all classes recover.
143+
144+
## Escalation
145+
146+
- **Pool has been down > 2 hours** and no operator on the runner host is
147+
reachable: the maintainer's call is whether to admin-merge without CI
148+
or wait. Default is to wait; #509 option 2 (route fast gates to
149+
`ubuntu-latest`) is the durable answer here and is tracked
150+
separately.
151+
- **Every host reports full disk**: coordinate the cleanup with #567
152+
(the shared-cargo race). Do not `rm -rf` shared registry state on any
153+
host while jobs are running on another — the "in-flight build
154+
corruption" failure mode.
155+
- **Runners online but jobs still queue for a specific class only**:
156+
check whether the org-level pool has a per-class concurrency cap set
157+
above the actual live count.
158+
159+
## Related
160+
161+
- `.github/workflows/runner-liveness.yml` — the alerting workflow itself.
162+
- #509 — parent issue; options 1 (liveness) shipped, option 2 (fallback
163+
routing) open, option 3 (this runbook).
164+
- #567 — self-hosted-runner disk exhaustion / shared-`CARGO_HOME` race.
165+
- #590`rivet_core` test-binary memory shape (why `lean-mem`
166+
exists in the first place).
167+
- REQ-051 — hooks are convenience; CI is the gate.

0 commit comments

Comments
 (0)