Skip to content

ci: runner-liveness alert for the self-hosted pool (#509 slice 1)#536

Merged
avrabe merged 1 commit into
mainfrom
ci/509-runner-liveness-alert
Jun 18, 2026
Merged

ci: runner-liveness alert for the self-hosted pool (#509 slice 1)#536
avrabe merged 1 commit into
mainfrom
ci/509-runner-liveness-alert

Conversation

@avrabe

@avrabe avrabe commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

What

Every gating CI job runs on [self-hosted, …], so when the pool goes offline every gate queues forever with no fallback and no alarm — the multi-day outage in #509 was invisible until someone noticed by hand. This adds a GitHub-hosted (ubuntu-latest) liveness probe that keeps firing even when the self-hosted pool is down, and turns that silent failure into a durable tracking issue rather than a transient red badge.

How

.github/workflows/runner-liveness.ymlschedule: */15 * * * * + workflow_dispatch:

  1. Queued-run age (authoritative). Fails if any run has been queued longer than QUEUE_THRESHOLD_MINUTES (default 30). Needs only actions: read, and works regardless of whether runners are registered at the repo or org level — it measures the actual symptom (jobs not getting picked up).
  2. Runner list (best-effort). GET /repos/{repo}/actions/runners needs the administration scope, which is not grantable to GITHUB_TOKEN (actionlint flagged this), so the lookup self-skips on 403/empty instead of false-alarming. Wire a PAT into GH_TOKEN later if a hard runner count is wanted.

On a problem it opens or updates a single runner-down-labelled tracking issue (idempotent — comment-updates the existing one, never duplicates); on recovery it comments and auto-closes. The run itself also goes red so there's a badge signal too.

Verification

  • actionlint clean (includes shellcheck on the run: blocks).
  • Injection-safe: dynamic content flows through env: vars; triggers are schedule/workflow_dispatch with no untrusted event payload.
  • Scheduled workflows only run from the default branch, so I'll smoke-test via workflow_dispatch once this is on main and record the result here.

Out of scope (separate PRs, per the issue)

  • Routing fast core gates (fmt/yaml-lint/validate/clippy) to ubuntu-latest — runner-policy + billing implications.
  • The operational runbook docs.

Refs #509, #436. Trace: skip (ci type, AGENTS.md exempt).

🤖 Generated with Claude Code

@codecov

codecov Bot commented Jun 14, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Every gating job runs on `[self-hosted, …]`, so when the pool goes offline every
gate queues forever with no fallback and no alarm — the multi-day outage in #509
was invisible until noticed by hand. This GitHub-hosted workflow (ubuntu-latest,
so it fires even when the pool is down) polls on a 15-min schedule + dispatch and
raises a durable tracking issue instead of a transient red badge.

Signals: (1) queued-run age > QUEUE_THRESHOLD_MINUTES (default 30) is the
authoritative alarm — needs only actions:read and is agnostic to repo-vs-org
runner registration; (2) the runner-list check is best-effort and self-skips,
since listing self-hosted runners needs the `administration` scope that
GITHUB_TOKEN cannot be granted. On a problem it opens or updates an idempotent
`runner-down`-labelled issue (one tracker, comment-updated); on recovery it
comments and auto-closes. Validated with actionlint (incl. shellcheck on the
run blocks). Smoke-test via workflow_dispatch after merge.

Out of scope (separate PRs): routing fast core gates to ubuntu-latest (runner
policy + billing); the operational runbook.

Trace: skip
Refs: #509, #436
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@avrabe avrabe force-pushed the ci/509-runner-liveness-alert branch from 7360bb8 to 5985218 Compare June 18, 2026 05:05

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Rivet Criterion Benchmarks'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.20.

Benchmark suite Current: 5985218 Previous: 55339e1 Ratio
link_graph_build/10000 34369263 ns/iter (± 2380276) 24402470 ns/iter (± 1046285) 1.41

This comment was automatically generated by workflow using github-action-benchmark.

@avrabe avrabe merged commit a32a02b into main Jun 18, 2026
26 checks passed
@avrabe avrabe deleted the ci/509-runner-liveness-alert branch June 18, 2026 05:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant