Skip to content

Commit 720bb84

Browse files
heavygeecursoragent
andcommitted
feat(runner): systemd resilience templates - Restart=always + watchdog timer
Ships drop-in + watchdog unit templates under cli/systemd/. Operator copies them once to /etc/systemd/system/; they update via the daily-driver rebuild like any other CLI code. What's in the box: * hapi-runner-resilience.conf - service drop-in: - Restart=always (replaces the base unit's Restart=on-failure which did NOT recover from the runner's clean exits during mtime-driven self-restart, the proximate cause of today's 22:40 BST outage). - StartLimitBurst=10 + StartLimitIntervalSec=300 (5 min) so a genuinely broken runner can't pin CPU. - RestartPreventExitStatus=2 to skip restarting on fatal config errors. - Carries forward the prior drop-in's HAPI_DISABLE_VERSION_HANDOFF=1 env var and ExecStartPre runner-stop (both still needed; the env var is now ALSO mirrored as a settings.json key for env-leak safety). * hapi-runner-watchdog.{service,timer,sh} - belt-and-braces probe: - Every 60s, queries the hub's /api/machines, restarts the unit if THIS host's machineId is missing OR runner.status != 'running'. - 30s grace window (runner.state.json mtime) so brief reconnects don't trigger spurious restarts. - Requires NOPASSWD sudo for systemctl restart hapi-runner.service (operator sets up /etc/sudoers.d/hapi-watchdog once). * README.md - install + verify instructions, failure-mode matrix. This is the "no-matter-what-path-fails" lever from the plan's friction-mode analysis: even if every CLI-level fix in this branch regresses someday, the watchdog brings the machine back within one tick. Templates only; nothing installs itself. Soup operators run the README's install block once. Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent ec8e06a commit 720bb84

5 files changed

Lines changed: 361 additions & 0 deletions

File tree

cli/systemd/README.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# cli/systemd/ - systemd integration templates
2+
3+
Operator-facing templates for running `hapi runner start-sync` under systemd
4+
on the daily-driver machine. Carried inside the CLI so the templates stay
5+
in lockstep with the runner code they assume.
6+
7+
These are templates: copy to `/etc/systemd/system/` and adjust paths/users
8+
to taste. Defaults assume the `heavygee/hapi` operator-fork layout (see the
9+
fork's operator setup guide for the full systemd context):
10+
11+
- Active tree symlink: `~/coding/hapi-active` (-> `~/coding/hapi-driver`)
12+
- Runner state: `~/.hapi/`
13+
- Bun: `~/.bun/bin/bun`
14+
- Service user: `heavygee`
15+
16+
## Files
17+
18+
| File | Destination | Purpose |
19+
|------|-------------|---------|
20+
| `hapi-runner-resilience.conf` | `/etc/systemd/system/hapi-runner.service.d/10-resilience.conf` | Drop-in: env var, `ExecStartPre` runner-stop, `Restart=always` + burst limits |
21+
| `hapi-runner-watchdog.service` | `/etc/systemd/system/hapi-runner-watchdog.service` | Oneshot probe that restarts the runner if the hub doesn't see it |
22+
| `hapi-runner-watchdog.timer` | `/etc/systemd/system/hapi-runner-watchdog.timer` | Schedule: 2 min after boot, then every 60s |
23+
| `hapi-runner-watchdog.sh` | (in-place, referenced by `.service`) | The probe itself |
24+
25+
## Base unit
26+
27+
The base `hapi-runner.service` lives outside this repo (it's
28+
operator-installed). A reference copy:
29+
30+
```ini
31+
# /etc/systemd/system/hapi-runner.service
32+
[Unit]
33+
Description=HAPI runner (remote session spawn + workspace browse)
34+
After=network-online.target hapi-hub.service
35+
Wants=network-online.target
36+
Requires=hapi-hub.service
37+
38+
[Service]
39+
Type=simple
40+
User=heavygee
41+
Environment=HAPI_API_URL=http://127.0.0.1:3006
42+
Environment=PATH=/home/heavygee/.local/bin:/home/heavygee/.npm-global/bin:/usr/local/bin:/usr/bin:/bin
43+
ExecStart=/home/heavygee/.local/bin/hapi-runner-from-active
44+
Restart=on-failure
45+
RestartSec=5
46+
47+
[Install]
48+
WantedBy=multi-user.target
49+
```
50+
51+
`Restart=on-failure` in the base is intentionally minimal so npm-installed
52+
HAPI users (who never run the soup rebuild path) don't get aggressive
53+
restart behavior. The resilience drop-in upgrades it to
54+
`Restart=always` for operators who own the supervision side.
55+
56+
## Install (operator)
57+
58+
```bash
59+
# 1. Drop-in (replaces the older 10-disable-version-handoff.conf if present)
60+
sudo mkdir -p /etc/systemd/system/hapi-runner.service.d/
61+
sudo cp ~/coding/hapi-active/cli/systemd/hapi-runner-resilience.conf \
62+
/etc/systemd/system/hapi-runner.service.d/10-resilience.conf
63+
sudo rm -f /etc/systemd/system/hapi-runner.service.d/10-disable-version-handoff.conf
64+
65+
# 2. Watchdog (optional but recommended)
66+
sudo cp ~/coding/hapi-active/cli/systemd/hapi-runner-watchdog.service \
67+
/etc/systemd/system/hapi-runner-watchdog.service
68+
sudo cp ~/coding/hapi-active/cli/systemd/hapi-runner-watchdog.timer \
69+
/etc/systemd/system/hapi-runner-watchdog.timer
70+
71+
# 3. Allow the watchdog to restart the runner without password.
72+
# Create /etc/sudoers.d/hapi-watchdog with:
73+
# heavygee ALL=(root) NOPASSWD: /bin/systemctl restart hapi-runner.service
74+
# (use `sudo visudo -f /etc/sudoers.d/hapi-watchdog` to edit safely)
75+
76+
# 4. Persist the disable-version-handoff opt-out (belt-and-braces in case the
77+
# env var leaks). Edit ~/.hapi/settings.json to add:
78+
# "runnerDisableVersionHandoff": true
79+
jq '.runnerDisableVersionHandoff = true' ~/.hapi/settings.json > /tmp/s.json && \
80+
mv /tmp/s.json ~/.hapi/settings.json
81+
82+
# 5. Activate
83+
sudo systemctl daemon-reload
84+
sudo systemctl restart hapi-runner.service # picks up new drop-in
85+
sudo systemctl enable --now hapi-runner-watchdog.timer
86+
```
87+
88+
## Verify
89+
90+
```bash
91+
systemctl show hapi-runner.service -p Restart -p RestartSec -p StartLimitBurst -p StartLimitIntervalSec -p Environment
92+
# Expect: Restart=always, RestartSec=5, StartLimitBurst=10,
93+
# StartLimitIntervalSec=300000000 (microseconds), HAPI_DISABLE_VERSION_HANDOFF=1
94+
95+
systemctl list-timers hapi-runner-watchdog.timer
96+
# Expect: next fire in 60s window
97+
98+
journalctl -u hapi-runner-watchdog.service --since '5 min ago'
99+
# Expect: "machine <uuid> present + runner running on http://...; no action"
100+
```
101+
102+
## Why each piece exists
103+
104+
See the fork's runner-self-restart bluedeploy plan
105+
(`2026-05-31-runner-self-restart-bluedeploy-fix.md`) for the full
106+
2026-05-31 incident retrospective and the failure-mode-to-fix matrix.
107+
Short version:
108+
109+
| Failure mode | Mitigation |
110+
|--------------|------------|
111+
| Runner self-restarts mid-rebuild, exits 0, `Restart=on-failure` doesn't recover | `Restart=always` in resilience drop-in |
112+
| Terminal `hapi runner start-sync` launched without env var kills live runner | settings.json `runnerDisableVersionHandoff:true` (persisted opt-out) |
113+
| Two `start-sync` invocations race past the kill-old/start-new sequence | `runner.start.lock` in the CLI (cli/src/runner/run.ts) |
114+
| All of the above fail and the machine drops off the hub anyway | watchdog timer hits `/api/machines`, restarts unit |
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# Drop-in for hapi-runner.service: resilience hardening
2+
#
3+
# Place at: /etc/systemd/system/hapi-runner.service.d/10-resilience.conf
4+
# Then: sudo systemctl daemon-reload && sudo systemctl restart hapi-runner.service
5+
#
6+
# The base unit ships with Restart=on-failure, which does NOT recover when the
7+
# runner exits cleanly (process.exit(0)) - and the mtime-driven self-restart
8+
# path is precisely one such case. The 2026-05-31 22:40 BST incident knocked
9+
# the runner offline for 3 minutes because of this exact gap. This drop-in
10+
# converts the policy to Restart=always with sensible burst limiting:
11+
#
12+
# * Restart=always - bring the runner back regardless of exit code
13+
# (does not fire on `systemctl stop`; that's
14+
# explicitly excluded by systemd).
15+
# * RestartSec=5 - 5 second cooldown between restart attempts.
16+
# * StartLimitBurst=10 - allow 10 restarts before the unit gives up.
17+
# * StartLimitIntervalSec=300 - over a 5 minute window.
18+
# A genuinely broken runner won't pin CPU.
19+
# * RestartPreventExitStatus=2 - exit 2 is reserved for "fatal config error";
20+
# don't restart-loop on misconfiguration.
21+
#
22+
# Plus the env var + ExecStartPre that the
23+
# feat/runner-skip-version-handoff-flag + start-lock layers depend on:
24+
#
25+
# 1. HAPI_DISABLE_VERSION_HANDOFF=1
26+
# Skips the runner's mtime-driven self-restart watcher. Source mtimes
27+
# change during soup rebuilds and `hapi-use-driver` swings for reasons
28+
# unrelated to npm upgrades; without this the live runner would commit
29+
# suicide every rebuild. Requires
30+
# cli/src/runner/run.ts >= feat/runner-skip-version-handoff-flag.
31+
#
32+
# 2. ExecStartPre=runner stop
33+
# Belt-and-braces cleanup of any orphaned runner before ExecStart fires.
34+
# The start-lock layer makes this less critical (concurrent invocations
35+
# no longer race) but it remains useful for the case where a prior
36+
# runner detached from systemd's cgroup.
37+
#
38+
# See the fork's `runner-systemd` operator guide and
39+
# `2026-05-31-runner-self-restart-bluedeploy-fix` plan for the
40+
# full rationale.
41+
42+
[Service]
43+
Environment=HAPI_DISABLE_VERSION_HANDOFF=1
44+
45+
ExecStartPre=/bin/bash -lc '/home/heavygee/.bun/bin/bun run --cwd /home/heavygee/coding/hapi-active/cli /home/heavygee/coding/hapi-active/cli/src/index.ts runner stop || true'
46+
47+
Restart=always
48+
RestartSec=5
49+
RestartPreventExitStatus=2
50+
51+
# Burst limiting: at most 10 restarts inside a 5 minute window.
52+
# StartLimitBurst/StartLimitIntervalSec are technically [Unit] directives
53+
# in newer systemd but [Service] is accepted for drop-in compatibility.
54+
StartLimitBurst=10
55+
StartLimitIntervalSec=300
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Belt-and-braces watchdog for hapi-runner.service
2+
#
3+
# Place at: /etc/systemd/system/hapi-runner-watchdog.service
4+
# Enable via: sudo systemctl enable --now hapi-runner-watchdog.timer
5+
#
6+
# Runs the watchdog probe script which:
7+
# 1. Queries the local hub's /api/machines endpoint.
8+
# 2. If THIS host's machineId is missing OR runner.status != 'running',
9+
# restarts hapi-runner.service.
10+
#
11+
# This catches any path that drops the runner off the hub regardless of
12+
# the cause - even if every other code path we built fails, this brings
13+
# the machine back within ONE timer tick.
14+
#
15+
# This is intentionally a oneshot, not Type=simple. The timer schedules
16+
# it; systemd-execd handles the run. ConditionPathExists keeps it idle
17+
# on hosts that haven't completed bootstrap.
18+
19+
[Unit]
20+
Description=HAPI runner liveness watchdog (restart service if machine drops off hub)
21+
After=hapi-runner.service
22+
ConditionPathExists=/home/heavygee/.hapi/settings.json
23+
24+
[Service]
25+
Type=oneshot
26+
User=heavygee
27+
EnvironmentFile=-/home/heavygee/.hapi/runner-watchdog.env
28+
ExecStart=/home/heavygee/coding/hapi-active/cli/systemd/hapi-runner-watchdog.sh
29+
# Output goes to journal for `journalctl -u hapi-runner-watchdog.service`
30+
StandardOutput=journal
31+
StandardError=journal
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
#!/usr/bin/env bash
2+
# hapi-runner-watchdog.sh
3+
#
4+
# Belt-and-braces watchdog: if THIS host's machine entry is missing from the
5+
# hub's /api/machines list OR its runner.status is not "running", restart the
6+
# hapi-runner.service systemd unit.
7+
#
8+
# Run by hapi-runner-watchdog.service (oneshot) on the schedule defined by
9+
# hapi-runner-watchdog.timer. Logs to journald.
10+
#
11+
# Configuration via env vars (optionally loaded from
12+
# /home/heavygee/.hapi/runner-watchdog.env by the service unit):
13+
#
14+
# HAPI_API_URL (default: http://127.0.0.1:3006)
15+
# CLI_API_TOKEN (required - Bearer token for /api/machines)
16+
# If missing, falls back to settings.json's cliApiToken.
17+
# HAPI_HOME (default: ~/.hapi) - location of settings.json + runner.state.json
18+
# HAPI_WATCHDOG_DRY_RUN (default: empty) - if "1", log decision but don't actually restart
19+
# HAPI_WATCHDOG_GRACE_SEC (default: 30) - how recently the runner must have
20+
# heartbeated to be considered alive even if absent
21+
# from the hub list (handles brief reconnect windows).
22+
#
23+
# Exit codes:
24+
# 0 - runner healthy OR successfully restarted
25+
# 1 - probe failed AND restart attempt also failed
26+
# 2 - misconfigured (missing token, etc.) - StartLimitPreventExitStatus
27+
# in the unit catches this so we don't restart-loop.
28+
29+
set -euo pipefail
30+
31+
log() {
32+
printf '%s [watchdog] %s\n' "$(date '+%Y-%m-%d %H:%M:%S')" "$*"
33+
}
34+
35+
HAPI_API_URL="${HAPI_API_URL:-http://127.0.0.1:3006}"
36+
HAPI_HOME="${HAPI_HOME:-$HOME/.hapi}"
37+
GRACE_SEC="${HAPI_WATCHDOG_GRACE_SEC:-30}"
38+
SETTINGS_FILE="$HAPI_HOME/settings.json"
39+
STATE_FILE="$HAPI_HOME/runner.state.json"
40+
41+
if [[ ! -f "$SETTINGS_FILE" ]]; then
42+
log "settings.json missing at $SETTINGS_FILE; nothing to do"
43+
exit 0
44+
fi
45+
46+
# Token resolution: env wins, then settings.json.
47+
TOKEN="${CLI_API_TOKEN:-}"
48+
if [[ -z "$TOKEN" ]] && command -v jq >/dev/null; then
49+
TOKEN="$(jq -r '.cliApiToken // empty' "$SETTINGS_FILE" 2>/dev/null || true)"
50+
fi
51+
if [[ -z "$TOKEN" ]]; then
52+
log "ERROR: no CLI_API_TOKEN in env or settings.json (cliApiToken)"
53+
exit 2
54+
fi
55+
56+
MACHINE_ID=""
57+
if command -v jq >/dev/null; then
58+
MACHINE_ID="$(jq -r '.machineId // empty' "$SETTINGS_FILE" 2>/dev/null || true)"
59+
fi
60+
if [[ -z "$MACHINE_ID" ]]; then
61+
log "ERROR: no machineId in settings.json"
62+
exit 2
63+
fi
64+
65+
# Probe the hub. Use --fail so HTTP errors come back as non-zero.
66+
TMP_RESP="$(mktemp)"
67+
trap 'rm -f "$TMP_RESP"' EXIT
68+
69+
HTTP_CODE="$(curl -sS \
70+
-o "$TMP_RESP" \
71+
-w '%{http_code}' \
72+
-H "Authorization: Bearer $TOKEN" \
73+
--max-time 5 \
74+
"$HAPI_API_URL/cli/machines/" 2>/dev/null || echo 'curl_fail')"
75+
76+
if [[ "$HTTP_CODE" == "curl_fail" || "$HTTP_CODE" != "200" ]]; then
77+
log "hub probe failed (HTTP $HTTP_CODE at $HAPI_API_URL/cli/machines/); not restarting on a failed probe"
78+
exit 0
79+
fi
80+
81+
# Look up THIS machine in the response. The hub returns
82+
# { "machines": [ { "id": "...", "runner": { "status": "running", ... }, ... } ] }
83+
# but the shape has shifted historically; accept either { "machines": [] } or top-level array.
84+
HEALTHY="false"
85+
if command -v jq >/dev/null; then
86+
HEALTHY="$(jq --arg mid "$MACHINE_ID" '
87+
(if type == "array" then . else (.machines // []) end)
88+
| map(select(.id == $mid))
89+
| if length == 0 then false
90+
else (.[0].runner.status // .[0].runnerState.status // "unknown") == "running"
91+
end
92+
| tostring
93+
' "$TMP_RESP" 2>/dev/null || echo 'parse_fail')"
94+
fi
95+
96+
if [[ "$HEALTHY" == "true" ]]; then
97+
log "machine $MACHINE_ID present + runner running on $HAPI_API_URL; no action"
98+
exit 0
99+
fi
100+
101+
# Before restarting, give the runner a grace window in case it's mid-reconnect.
102+
if [[ -f "$STATE_FILE" ]] && command -v jq >/dev/null; then
103+
LAST_HEARTBEAT="$(jq -r '.lastHeartbeat // empty' "$STATE_FILE" 2>/dev/null || true)"
104+
if [[ -n "$LAST_HEARTBEAT" ]]; then
105+
# lastHeartbeat is JS toLocaleString format, not ISO; can't trivially parse
106+
# to epoch in pure bash. Use file mtime as a rough proxy.
107+
LAST_MTIME="$(stat -c '%Y' "$STATE_FILE" 2>/dev/null || echo 0)"
108+
NOW="$(date +%s)"
109+
AGE=$((NOW - LAST_MTIME))
110+
if (( AGE < GRACE_SEC )); then
111+
log "machine $MACHINE_ID not in hub list but runner heartbeat ${AGE}s old (< grace ${GRACE_SEC}s); skipping restart"
112+
exit 0
113+
fi
114+
log "machine $MACHINE_ID absent or runner not running; heartbeat ${AGE}s stale"
115+
fi
116+
fi
117+
118+
if [[ "${HAPI_WATCHDOG_DRY_RUN:-}" == "1" ]]; then
119+
log "DRY_RUN: would `sudo systemctl restart hapi-runner.service`"
120+
exit 0
121+
fi
122+
123+
log "restarting hapi-runner.service via systemctl"
124+
if systemctl restart hapi-runner.service 2>/dev/null; then
125+
log "restart OK"
126+
exit 0
127+
fi
128+
129+
# Fallback: try sudo (the unit user is heavygee but may have NOPASSWD for this one)
130+
if sudo -n systemctl restart hapi-runner.service 2>/dev/null; then
131+
log "restart OK (via sudo)"
132+
exit 0
133+
fi
134+
135+
log "ERROR: could not restart hapi-runner.service (permission?)"
136+
exit 1
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Schedule for hapi-runner-watchdog.service
2+
#
3+
# Place at: /etc/systemd/system/hapi-runner-watchdog.timer
4+
# Enable: sudo systemctl enable --now hapi-runner-watchdog.timer
5+
# Inspect: systemctl list-timers hapi-runner-watchdog.timer
6+
#
7+
# Fires:
8+
# * OnBootSec=120s - wait 2 min after boot so hapi-hub + runner have
9+
# had a chance to come up before we probe.
10+
# * OnUnitActiveSec=60s - then every 60 seconds.
11+
#
12+
# The probe is cheap (one HTTP GET); 60s is the right tradeoff between
13+
# "operator notices missing machine" and "filling the journal with noise".
14+
15+
[Unit]
16+
Description=HAPI runner liveness watchdog timer
17+
18+
[Timer]
19+
OnBootSec=120
20+
OnUnitActiveSec=60
21+
AccuracySec=5
22+
Unit=hapi-runner-watchdog.service
23+
24+
[Install]
25+
WantedBy=timers.target

0 commit comments

Comments
 (0)