docs: runner pool operations runbook (#509)#662
Merged
Conversation
Slice 3 of #509 — the runbook the liveness alert (slice 1, `runner-liveness.yml`) points at when it opens the `runner-down` tracker issue. Documents: - The three self-hosted classes (`light` / `rust-cpu` / `lean-mem`) and which jobs pin each — so a partial outage is diagnosable without reading every workflow file. - The two diagnostic commands (`gh api …/runners` best-effort, queued-run age authoritative) with what each output shape means, matching the liveness workflow's own probe logic. - What to do when the tracker issue is open: confirm shape, identify the affected class, bring the pool back, post the outage window as the durable audit record before auto-close, note any `main` runs that queued and expired (merges that reached `main` without CI). - The host-side recovery steps (systemd unit paths, common failure shapes: disk-full, token-expired, label-mismatch) — off-scope for this repo but referenced so an operator can act. - Escalation: > 2 hour outage, every-host-full disk (coordinate with #567's shared-cargo race), class-specific queueing. Doesn't ship options 1 or 2 from #509. Option 1 (liveness alert) already landed as `.github/workflows/runner-liveness.yml`; this runbook cross- references it. Option 2 (route fast gates to GitHub-hosted) is a policy call with billing implications, kept out of scope per the issue body's own framing. Front-matter uses the standard rivet doc-artifact shape (`DOC-CI-RUNNERS`, `type: reference`, `status: current`). `rivet validate` on this repo remains PASS with the same 497 warnings. Trace: skip Refs: #509, #567, #590, REQ-051 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Jj8RPdbnK66ZeYgRuCxAdW
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refs #509 (slice 3 of the three suggested fixes in the issue body).
What
New reference doc at
docs/ci-runners.md, matching the top-leveldocs/*.mdlayout for shipped operational reference material (adopts the same frontmatter shape asdocs/pre-commit.mdanddocs/verification.md). It is the runbook the liveness alert (slice 1,.github/workflows/runner-liveness.yml) points at when it opens therunner-downtracker issue.Covers:
light/rust-cpu/lean-mem) and which jobs pin each. A partial outage — one class down while the others stay up — is diagnosable without grepping every workflow file.gh api …/runners(best-effort; needsadministration:read, whichGITHUB_TOKENcannot carry) and the queued-run-age query (authoritative — needs onlyactions:read). Both mirror whatrunner-liveness.ymlprobes on its 15-min cadence, so an operator running them by hand sees the same shape the tracker issue reports.mainpush runs that queued and expired (merges that reachedmainwithout CI verification).Also carries the REQ-051 reminder in the callout (hooks are convenience, CI is the gate — if the pool is down, treat merges to
mainwith the same caution as a--no-verifycommit) so a reader doesn't need to hunt the constraint down separately.Direction picked (option (c) from the triage sketch)
Options 1 and 2 from the #509 body are separately tracked:
.github/workflows/runner-liveness.yml. This runbook cross-references it as the source of truth for the tracker-issue lifecycle (auto-open on the first bad probe, auto-update every 15 min, auto-close on recovery).ubuntu-latestis a policy call with billing implications; kept out of scope per the issue body's own "these are choices for the maintainer" framing.Blog process pre-check
Re-fetched
https://pulseengine.eu/blog/at the start of this run. No posts on team workflow / branching / PR flow / testing conventions beyond the spec-driven-development post already reflected here. Nothing overrides this PR.Test plan
rivet validateon the rivet repo — PASS, 497 warnings (unchanged frommain; the new doc carries the standardDOC-CI-RUNNERSfrontmatter so the docs scanner picks it up cleanly).cargo fmt --all --check— clean (docs-only change, no Rust touched).Not this PR
runner-liveness.ymlitself — already shipped._work/cleanup,CARGO_HOMElayout) — off-scope; CI: self-hosted runner disk exhaustion fails all compile gates; post-job cleanup hook frees nothing #567 tracks the shared-cargo hazard.ubuntu-latest— option 2 of CI has a single point of failure: when the self-hosted runner pool goes offline, every gate queues forever with no fallback and no liveness alert #509, separate policy call.Commit trailer:
Trace: skip(docs is an exempt type per CLAUDE.md's commit-traceability section),Refs: #509, #567, #590, REQ-051.Generated by Claude Code — issue-triage agent run 2026-07-02.
Generated by Claude Code