Skip to content

docs: runner pool operations runbook (#509)#662

Merged
avrabe merged 1 commit into
mainfrom
docs/issue-509-runner-runbook
Jul 3, 2026
Merged

docs: runner pool operations runbook (#509)#662
avrabe merged 1 commit into
mainfrom
docs/issue-509-runner-runbook

Conversation

@avrabe

@avrabe avrabe commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Refs #509 (slice 3 of the three suggested fixes in the issue body).

What

New reference doc at docs/ci-runners.md, matching the top-level docs/*.md layout for shipped operational reference material (adopts the same frontmatter shape as docs/pre-commit.md and docs/verification.md). It is the runbook the liveness alert (slice 1, .github/workflows/runner-liveness.yml) points at when it opens the runner-down tracker issue.

Covers:

  • The three self-hosted classes (light / rust-cpu / lean-mem) and which jobs pin each. A partial outage — one class down while the others stay up — is diagnosable without grepping every workflow file.
  • The two diagnostic commandsgh api …/runners (best-effort; needs administration:read, which GITHUB_TOKEN cannot carry) and the queued-run-age query (authoritative — needs only actions:read). Both mirror what runner-liveness.yml probes on its 15-min cadence, so an operator running them by hand sees the same shape the tracker issue reports.
  • What to do when the tracker is open — confirm the shape, identify the affected class, bring the pool back, post the outage window as one comment before auto-close (that comment becomes the durable audit record after the tracker closes), note any main push runs that queued and expired (merges that reached main without CI verification).
  • Host-side recovery steps — systemd unit paths, common failure shapes (disk-full → coordinate with CI: self-hosted runner disk exhaustion fails all compile gates; post-job cleanup hook frees nothing #567's shared-cargo race, token-expired, label-mismatch). Off-scope for this repo but referenced so an on-call operator can act.
  • Escalation — > 2 hour outage, every-host-full disk, class-specific queueing.

Also carries the REQ-051 reminder in the callout (hooks are convenience, CI is the gate — if the pool is down, treat merges to main with the same caution as a --no-verify commit) so a reader doesn't need to hunt the constraint down separately.

Direction picked (option (c) from the triage sketch)

Options 1 and 2 from the #509 body are separately tracked:

  • Option 1 — liveness alert already landed as .github/workflows/runner-liveness.yml. This runbook cross-references it as the source of truth for the tracker-issue lifecycle (auto-open on the first bad probe, auto-update every 15 min, auto-close on recovery).
  • Option 2 — route fast gates to ubuntu-latest is a policy call with billing implications; kept out of scope per the issue body's own "these are choices for the maintainer" framing.

Blog process pre-check

Re-fetched https://pulseengine.eu/blog/ at the start of this run. No posts on team workflow / branching / PR flow / testing conventions beyond the spec-driven-development post already reflected here. Nothing overrides this PR.

Test plan

  • rivet validate on the rivet repo — PASS, 497 warnings (unchanged from main; the new doc carries the standard DOC-CI-RUNNERS frontmatter so the docs scanner picks it up cleanly).
  • cargo fmt --all --check — clean (docs-only change, no Rust touched).
  • CI Format / YAML-lint / Docs Check gates green on this branch.
  • Reviewer smoke-test: read the runbook top-to-bottom on the assumption the pool is down right now — every step should be actionable from the two commands the doc provides.

Not this PR

Commit trailer: Trace: skip (docs is an exempt type per CLAUDE.md's commit-traceability section), Refs: #509, #567, #590, REQ-051.


Generated by Claude Code — issue-triage agent run 2026-07-02.


Generated by Claude Code

Slice 3 of #509 — the runbook the liveness alert (slice 1,
`runner-liveness.yml`) points at when it opens the `runner-down` tracker
issue. Documents:

- The three self-hosted classes (`light` / `rust-cpu` / `lean-mem`) and
  which jobs pin each — so a partial outage is diagnosable without
  reading every workflow file.
- The two diagnostic commands (`gh api …/runners` best-effort, queued-run
  age authoritative) with what each output shape means, matching the
  liveness workflow's own probe logic.
- What to do when the tracker issue is open: confirm shape, identify the
  affected class, bring the pool back, post the outage window as the
  durable audit record before auto-close, note any `main` runs that
  queued and expired (merges that reached `main` without CI).
- The host-side recovery steps (systemd unit paths, common failure
  shapes: disk-full, token-expired, label-mismatch) — off-scope for this
  repo but referenced so an operator can act.
- Escalation: > 2 hour outage, every-host-full disk (coordinate with
  #567's shared-cargo race), class-specific queueing.

Doesn't ship options 1 or 2 from #509. Option 1 (liveness alert) already
landed as `.github/workflows/runner-liveness.yml`; this runbook cross-
references it. Option 2 (route fast gates to GitHub-hosted) is a
policy call with billing implications, kept out of scope per the issue
body's own framing.

Front-matter uses the standard rivet doc-artifact shape
(`DOC-CI-RUNNERS`, `type: reference`, `status: current`). `rivet
validate` on this repo remains PASS with the same 497 warnings.

Trace: skip
Refs: #509, #567, #590, REQ-051

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Jj8RPdbnK66ZeYgRuCxAdW
@avrabe avrabe merged commit 8f9fac6 into main Jul 3, 2026
26 of 27 checks passed
@avrabe avrabe deleted the docs/issue-509-runner-runbook branch July 3, 2026 05:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants