docs: runner pool operations runbook (#509) by avrabe · Pull Request #662 · pulseengine/rivet

avrabe · 2026-07-02T17:53:46Z

Refs #509 (slice 3 of the three suggested fixes in the issue body).

What

New reference doc at docs/ci-runners.md, matching the top-level docs/*.md layout for shipped operational reference material (adopts the same frontmatter shape as docs/pre-commit.md and docs/verification.md). It is the runbook the liveness alert (slice 1, .github/workflows/runner-liveness.yml) points at when it opens the runner-down tracker issue.

Covers:

The three self-hosted classes (light / rust-cpu / lean-mem) and which jobs pin each. A partial outage — one class down while the others stay up — is diagnosable without grepping every workflow file.
The two diagnostic commands — gh api …/runners (best-effort; needs administration:read, which GITHUB_TOKEN cannot carry) and the queued-run-age query (authoritative — needs only actions:read). Both mirror what runner-liveness.yml probes on its 15-min cadence, so an operator running them by hand sees the same shape the tracker issue reports.
What to do when the tracker is open — confirm the shape, identify the affected class, bring the pool back, post the outage window as one comment before auto-close (that comment becomes the durable audit record after the tracker closes), note any main push runs that queued and expired (merges that reached main without CI verification).
Host-side recovery steps — systemd unit paths, common failure shapes (disk-full → coordinate with CI: self-hosted runner disk exhaustion fails all compile gates; post-job cleanup hook frees nothing #567's shared-cargo race, token-expired, label-mismatch). Off-scope for this repo but referenced so an on-call operator can act.
Escalation — > 2 hour outage, every-host-full disk, class-specific queueing.

Also carries the REQ-051 reminder in the callout (hooks are convenience, CI is the gate — if the pool is down, treat merges to main with the same caution as a --no-verify commit) so a reader doesn't need to hunt the constraint down separately.

Direction picked (option (c) from the triage sketch)

Options 1 and 2 from the #509 body are separately tracked:

Option 1 — liveness alert already landed as .github/workflows/runner-liveness.yml. This runbook cross-references it as the source of truth for the tracker-issue lifecycle (auto-open on the first bad probe, auto-update every 15 min, auto-close on recovery).
Option 2 — route fast gates to ubuntu-latest is a policy call with billing implications; kept out of scope per the issue body's own "these are choices for the maintainer" framing.

Blog process pre-check

Re-fetched https://pulseengine.eu/blog/ at the start of this run. No posts on team workflow / branching / PR flow / testing conventions beyond the spec-driven-development post already reflected here. Nothing overrides this PR.

Test plan

rivet validate on the rivet repo — PASS, 497 warnings (unchanged from main; the new doc carries the standard DOC-CI-RUNNERS frontmatter so the docs scanner picks it up cleanly).
cargo fmt --all --check — clean (docs-only change, no Rust touched).
CI Format / YAML-lint / Docs Check gates green on this branch.
Reviewer smoke-test: read the runbook top-to-bottom on the assumption the pool is down right now — every step should be actionable from the two commands the doc provides.

Not this PR

runner-liveness.yml itself — already shipped.
Any host-side change (systemd unit, _work/ cleanup, CARGO_HOME layout) — off-scope; CI: self-hosted runner disk exhaustion fails all compile gates; post-job cleanup hook frees nothing #567 tracks the shared-cargo hazard.
Fast-gate routing to ubuntu-latest — option 2 of CI has a single point of failure: when the self-hosted runner pool goes offline, every gate queues forever with no fallback and no liveness alert #509, separate policy call.

Commit trailer: Trace: skip (docs is an exempt type per CLAUDE.md's commit-traceability section), Refs: #509, #567, #590, REQ-051.

Generated by Claude Code — issue-triage agent run 2026-07-02.

Generated by Claude Code

Slice 3 of #509 — the runbook the liveness alert (slice 1, `runner-liveness.yml`) points at when it opens the `runner-down` tracker issue. Documents: - The three self-hosted classes (`light` / `rust-cpu` / `lean-mem`) and which jobs pin each — so a partial outage is diagnosable without reading every workflow file. - The two diagnostic commands (`gh api …/runners` best-effort, queued-run age authoritative) with what each output shape means, matching the liveness workflow's own probe logic. - What to do when the tracker issue is open: confirm shape, identify the affected class, bring the pool back, post the outage window as the durable audit record before auto-close, note any `main` runs that queued and expired (merges that reached `main` without CI). - The host-side recovery steps (systemd unit paths, common failure shapes: disk-full, token-expired, label-mismatch) — off-scope for this repo but referenced so an operator can act. - Escalation: > 2 hour outage, every-host-full disk (coordinate with #567's shared-cargo race), class-specific queueing. Doesn't ship options 1 or 2 from #509. Option 1 (liveness alert) already landed as `.github/workflows/runner-liveness.yml`; this runbook cross- references it. Option 2 (route fast gates to GitHub-hosted) is a policy call with billing implications, kept out of scope per the issue body's own framing. Front-matter uses the standard rivet doc-artifact shape (`DOC-CI-RUNNERS`, `type: reference`, `status: current`). `rivet validate` on this repo remains PASS with the same 497 warnings. Trace: skip Refs: #509, #567, #590, REQ-051 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Jj8RPdbnK66ZeYgRuCxAdW

avrabe merged commit 8f9fac6 into main Jul 3, 2026
26 of 27 checks passed

avrabe deleted the docs/issue-509-runner-runbook branch July 3, 2026 05:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: runner pool operations runbook (#509)#662

docs: runner pool operations runbook (#509)#662
avrabe merged 1 commit into
mainfrom
docs/issue-509-runner-runbook

avrabe commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

avrabe commented Jul 2, 2026

What

Direction picked (option (c) from the triage sketch)

Blog process pre-check

Test plan

Not this PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants