feat(runner): systemd resilience templates - Restart=always + watchdog timer

heavygee · cursoragent · heavygee · commit 720bb8476cf4 · 2026-05-31T23:25:07.000+01:00
Ships drop-in + watchdog unit templates under cli/systemd/. Operator copies
them once to /etc/systemd/system/; they update via the daily-driver rebuild
like any other CLI code.

What's in the box:

  * hapi-runner-resilience.conf - service drop-in:
    - Restart=always (replaces the base unit's Restart=on-failure which
      did NOT recover from the runner's clean exits during mtime-driven
      self-restart, the proximate cause of today's 22:40 BST outage).
    - StartLimitBurst=10 + StartLimitIntervalSec=300 (5 min) so a
      genuinely broken runner can't pin CPU.
    - RestartPreventExitStatus=2 to skip restarting on fatal config errors.
    - Carries forward the prior drop-in's HAPI_DISABLE_VERSION_HANDOFF=1
      env var and ExecStartPre runner-stop (both still needed; the env
      var is now ALSO mirrored as a settings.json key for env-leak safety).

  * hapi-runner-watchdog.{service,timer,sh} - belt-and-braces probe:
    - Every 60s, queries the hub's /api/machines, restarts the unit if
      THIS host's machineId is missing OR runner.status != 'running'.
    - 30s grace window (runner.state.json mtime) so brief reconnects
      don't trigger spurious restarts.
    - Requires NOPASSWD sudo for systemctl restart hapi-runner.service
      (operator sets up /etc/sudoers.d/hapi-watchdog once).

  * README.md - install + verify instructions, failure-mode matrix.

This is the "no-matter-what-path-fails" lever from the plan's friction-mode
analysis: even if every CLI-level fix in this branch regresses someday,
the watchdog brings the machine back within one tick.

Templates only; nothing installs itself. Soup operators run the README's
install block once.

Co-authored-by: Cursor &lt;cursoragent@cursor.com&gt;
diff --git a/cli/systemd/README.md b/cli/systemd/README.md
@@ -0,0 +1,114 @@
+# cli/systemd/ - systemd integration templates
+
+Operator-facing templates for running `hapi runner start-sync` under systemd
+on the daily-driver machine. Carried inside the CLI so the templates stay
+in lockstep with the runner code they assume.
+
+These are templates: copy to `/etc/systemd/system/` and adjust paths/users
+to taste. Defaults assume the `heavygee/hapi` operator-fork layout (see the
+fork's operator setup guide for the full systemd context):
+
+- Active tree symlink: `~/coding/hapi-active` (-> `~/coding/hapi-driver`)
+- Runner state: `~/.hapi/`
+- Bun: `~/.bun/bin/bun`
+- Service user: `heavygee`
+
+## Files
+
+| File | Destination | Purpose |
+|------|-------------|---------|
+| `hapi-runner-resilience.conf` | `/etc/systemd/system/hapi-runner.service.d/10-resilience.conf` | Drop-in: env var, `ExecStartPre` runner-stop, `Restart=always` + burst limits |
+| `hapi-runner-watchdog.service` | `/etc/systemd/system/hapi-runner-watchdog.service` | Oneshot probe that restarts the runner if the hub doesn't see it |
+| `hapi-runner-watchdog.timer` | `/etc/systemd/system/hapi-runner-watchdog.timer` | Schedule: 2 min after boot, then every 60s |
+| `hapi-runner-watchdog.sh` | (in-place, referenced by `.service`) | The probe itself |
+
+## Base unit
+
+The base `hapi-runner.service` lives outside this repo (it's
+operator-installed). A reference copy:
+
+```ini
+# /etc/systemd/system/hapi-runner.service
+[Unit]
+Description=HAPI runner (remote session spawn + workspace browse)
+After=network-online.target hapi-hub.service
+Wants=network-online.target
+Requires=hapi-hub.service
+
+[Service]
+Type=simple
+User=heavygee
+Environment=HAPI_API_URL=http://127.0.0.1:3006
+Environment=PATH=/home/heavygee/.local/bin:/home/heavygee/.npm-global/bin:/usr/local/bin:/usr/bin:/bin
+ExecStart=/home/heavygee/.local/bin/hapi-runner-from-active
+Restart=on-failure
+RestartSec=5
+
+[Install]
+WantedBy=multi-user.target
+```
+
+`Restart=on-failure` in the base is intentionally minimal so npm-installed
+HAPI users (who never run the soup rebuild path) don't get aggressive
+restart behavior. The resilience drop-in upgrades it to
+`Restart=always` for operators who own the supervision side.
+
+## Install (operator)
+
+```bash
+# 1. Drop-in (replaces the older 10-disable-version-handoff.conf if present)
+sudo mkdir -p /etc/systemd/system/hapi-runner.service.d/
+sudo cp ~/coding/hapi-active/cli/systemd/hapi-runner-resilience.conf \
+       /etc/systemd/system/hapi-runner.service.d/10-resilience.conf
+sudo rm -f /etc/systemd/system/hapi-runner.service.d/10-disable-version-handoff.conf
+
+# 2. Watchdog (optional but recommended)
+sudo cp ~/coding/hapi-active/cli/systemd/hapi-runner-watchdog.service \
+       /etc/systemd/system/hapi-runner-watchdog.service
+sudo cp ~/coding/hapi-active/cli/systemd/hapi-runner-watchdog.timer \
+       /etc/systemd/system/hapi-runner-watchdog.timer
+
+# 3. Allow the watchdog to restart the runner without password.
+#    Create /etc/sudoers.d/hapi-watchdog with:
+#       heavygee ALL=(root) NOPASSWD: /bin/systemctl restart hapi-runner.service
+#    (use `sudo visudo -f /etc/sudoers.d/hapi-watchdog` to edit safely)
+
+# 4. Persist the disable-version-handoff opt-out (belt-and-braces in case the
+#    env var leaks). Edit ~/.hapi/settings.json to add:
+#       "runnerDisableVersionHandoff": true
+jq '.runnerDisableVersionHandoff = true' ~/.hapi/settings.json > /tmp/s.json && \
+    mv /tmp/s.json ~/.hapi/settings.json
+
+# 5. Activate
+sudo systemctl daemon-reload
+sudo systemctl restart hapi-runner.service        # picks up new drop-in
+sudo systemctl enable --now hapi-runner-watchdog.timer
+```
+
+## Verify
+
+```bash
+systemctl show hapi-runner.service -p Restart -p RestartSec -p StartLimitBurst -p StartLimitIntervalSec -p Environment
+# Expect: Restart=always, RestartSec=5, StartLimitBurst=10,
+#         StartLimitIntervalSec=300000000 (microseconds), HAPI_DISABLE_VERSION_HANDOFF=1
+
+systemctl list-timers hapi-runner-watchdog.timer
+# Expect: next fire in 60s window
+
+journalctl -u hapi-runner-watchdog.service --since '5 min ago'
+# Expect: "machine <uuid> present + runner running on http://...; no action"
+```
+
+## Why each piece exists
+
+See the fork's runner-self-restart bluedeploy plan
+(`2026-05-31-runner-self-restart-bluedeploy-fix.md`) for the full
+2026-05-31 incident retrospective and the failure-mode-to-fix matrix.
+Short version:
+
+| Failure mode | Mitigation |
+|--------------|------------|
+| Runner self-restarts mid-rebuild, exits 0, `Restart=on-failure` doesn't recover | `Restart=always` in resilience drop-in |
+| Terminal `hapi runner start-sync` launched without env var kills live runner | settings.json `runnerDisableVersionHandoff:true` (persisted opt-out) |
+| Two `start-sync` invocations race past the kill-old/start-new sequence | `runner.start.lock` in the CLI (cli/src/runner/run.ts) |
+| All of the above fail and the machine drops off the hub anyway | watchdog timer hits `/api/machines`, restarts unit |
diff --git a/cli/systemd/hapi-runner-resilience.conf b/cli/systemd/hapi-runner-resilience.conf
@@ -0,0 +1,55 @@
+# Drop-in for hapi-runner.service: resilience hardening
+#
+# Place at: /etc/systemd/system/hapi-runner.service.d/10-resilience.conf
+# Then:    sudo systemctl daemon-reload && sudo systemctl restart hapi-runner.service
+#
+# The base unit ships with Restart=on-failure, which does NOT recover when the
+# runner exits cleanly (process.exit(0)) - and the mtime-driven self-restart
+# path is precisely one such case. The 2026-05-31 22:40 BST incident knocked
+# the runner offline for 3 minutes because of this exact gap. This drop-in
+# converts the policy to Restart=always with sensible burst limiting:
+#
+#   * Restart=always       - bring the runner back regardless of exit code
+#                            (does not fire on `systemctl stop`; that's
+#                            explicitly excluded by systemd).
+#   * RestartSec=5         - 5 second cooldown between restart attempts.
+#   * StartLimitBurst=10   - allow 10 restarts before the unit gives up.
+#   * StartLimitIntervalSec=300 - over a 5 minute window.
+#                            A genuinely broken runner won't pin CPU.
+#   * RestartPreventExitStatus=2 - exit 2 is reserved for "fatal config error";
+#                            don't restart-loop on misconfiguration.
+#
+# Plus the env var + ExecStartPre that the
+# feat/runner-skip-version-handoff-flag + start-lock layers depend on:
+#
+#   1. HAPI_DISABLE_VERSION_HANDOFF=1
+#      Skips the runner's mtime-driven self-restart watcher. Source mtimes
+#      change during soup rebuilds and `hapi-use-driver` swings for reasons
+#      unrelated to npm upgrades; without this the live runner would commit
+#      suicide every rebuild. Requires
+#      cli/src/runner/run.ts >= feat/runner-skip-version-handoff-flag.
+#
+#   2. ExecStartPre=runner stop
+#      Belt-and-braces cleanup of any orphaned runner before ExecStart fires.
+#      The start-lock layer makes this less critical (concurrent invocations
+#      no longer race) but it remains useful for the case where a prior
+#      runner detached from systemd's cgroup.
+#
+# See the fork's `runner-systemd` operator guide and
+# `2026-05-31-runner-self-restart-bluedeploy-fix` plan for the
+# full rationale.
+
+[Service]
+Environment=HAPI_DISABLE_VERSION_HANDOFF=1
+
+ExecStartPre=/bin/bash -lc '/home/heavygee/.bun/bin/bun run --cwd /home/heavygee/coding/hapi-active/cli /home/heavygee/coding/hapi-active/cli/src/index.ts runner stop || true'
+
+Restart=always
+RestartSec=5
+RestartPreventExitStatus=2
+
+# Burst limiting: at most 10 restarts inside a 5 minute window.
+# StartLimitBurst/StartLimitIntervalSec are technically [Unit] directives
+# in newer systemd but [Service] is accepted for drop-in compatibility.
+StartLimitBurst=10
+StartLimitIntervalSec=300
diff --git a/cli/systemd/hapi-runner-watchdog.service b/cli/systemd/hapi-runner-watchdog.service
@@ -0,0 +1,31 @@
+# Belt-and-braces watchdog for hapi-runner.service
+#
+# Place at: /etc/systemd/system/hapi-runner-watchdog.service
+# Enable via: sudo systemctl enable --now hapi-runner-watchdog.timer
+#
+# Runs the watchdog probe script which:
+#   1. Queries the local hub's /api/machines endpoint.
+#   2. If THIS host's machineId is missing OR runner.status != 'running',
+#      restarts hapi-runner.service.
+#
+# This catches any path that drops the runner off the hub regardless of
+# the cause - even if every other code path we built fails, this brings
+# the machine back within ONE timer tick.
+#
+# This is intentionally a oneshot, not Type=simple. The timer schedules
+# it; systemd-execd handles the run. ConditionPathExists keeps it idle
+# on hosts that haven't completed bootstrap.
+
+[Unit]
+Description=HAPI runner liveness watchdog (restart service if machine drops off hub)
+After=hapi-runner.service
+ConditionPathExists=/home/heavygee/.hapi/settings.json
+
+[Service]
+Type=oneshot
+User=heavygee
+EnvironmentFile=-/home/heavygee/.hapi/runner-watchdog.env
+ExecStart=/home/heavygee/coding/hapi-active/cli/systemd/hapi-runner-watchdog.sh
+# Output goes to journal for `journalctl -u hapi-runner-watchdog.service`
+StandardOutput=journal
+StandardError=journal
diff --git a/cli/systemd/hapi-runner-watchdog.sh b/cli/systemd/hapi-runner-watchdog.sh
@@ -0,0 +1,136 @@
+#!/usr/bin/env bash
+# hapi-runner-watchdog.sh
+#
+# Belt-and-braces watchdog: if THIS host's machine entry is missing from the
+# hub's /api/machines list OR its runner.status is not "running", restart the
+# hapi-runner.service systemd unit.
+#
+# Run by hapi-runner-watchdog.service (oneshot) on the schedule defined by
+# hapi-runner-watchdog.timer. Logs to journald.
+#
+# Configuration via env vars (optionally loaded from
+# /home/heavygee/.hapi/runner-watchdog.env by the service unit):
+#
+#   HAPI_API_URL          (default: http://127.0.0.1:3006)
+#   CLI_API_TOKEN         (required - Bearer token for /api/machines)
+#                         If missing, falls back to settings.json's cliApiToken.
+#   HAPI_HOME             (default: ~/.hapi) - location of settings.json + runner.state.json
+#   HAPI_WATCHDOG_DRY_RUN (default: empty) - if "1", log decision but don't actually restart
+#   HAPI_WATCHDOG_GRACE_SEC (default: 30) - how recently the runner must have
+#                         heartbeated to be considered alive even if absent
+#                         from the hub list (handles brief reconnect windows).
+#
+# Exit codes:
+#   0 - runner healthy OR successfully restarted
+#   1 - probe failed AND restart attempt also failed
+#   2 - misconfigured (missing token, etc.) - StartLimitPreventExitStatus
+#       in the unit catches this so we don't restart-loop.
+
+set -euo pipefail
+
+log() {
+    printf '%s [watchdog] %s\n' "$(date '+%Y-%m-%d %H:%M:%S')" "$*"
+}
+
+HAPI_API_URL="${HAPI_API_URL:-http://127.0.0.1:3006}"
+HAPI_HOME="${HAPI_HOME:-$HOME/.hapi}"
+GRACE_SEC="${HAPI_WATCHDOG_GRACE_SEC:-30}"
+SETTINGS_FILE="$HAPI_HOME/settings.json"
+STATE_FILE="$HAPI_HOME/runner.state.json"
+
+if [[ ! -f "$SETTINGS_FILE" ]]; then
+    log "settings.json missing at $SETTINGS_FILE; nothing to do"
+    exit 0
+fi
+
+# Token resolution: env wins, then settings.json.
+TOKEN="${CLI_API_TOKEN:-}"
+if [[ -z "$TOKEN" ]] && command -v jq >/dev/null; then
+    TOKEN="$(jq -r '.cliApiToken // empty' "$SETTINGS_FILE" 2>/dev/null || true)"
+fi
+if [[ -z "$TOKEN" ]]; then
+    log "ERROR: no CLI_API_TOKEN in env or settings.json (cliApiToken)"
+    exit 2
+fi
+
+MACHINE_ID=""
+if command -v jq >/dev/null; then
+    MACHINE_ID="$(jq -r '.machineId // empty' "$SETTINGS_FILE" 2>/dev/null || true)"
+fi
+if [[ -z "$MACHINE_ID" ]]; then
+    log "ERROR: no machineId in settings.json"
+    exit 2
+fi
+
+# Probe the hub. Use --fail so HTTP errors come back as non-zero.
+TMP_RESP="$(mktemp)"
+trap 'rm -f "$TMP_RESP"' EXIT
+
+HTTP_CODE="$(curl -sS \
+    -o "$TMP_RESP" \
+    -w '%{http_code}' \
+    -H "Authorization: Bearer $TOKEN" \
+    --max-time 5 \
+    "$HAPI_API_URL/cli/machines/" 2>/dev/null || echo 'curl_fail')"
+
+if [[ "$HTTP_CODE" == "curl_fail" || "$HTTP_CODE" != "200" ]]; then
+    log "hub probe failed (HTTP $HTTP_CODE at $HAPI_API_URL/cli/machines/); not restarting on a failed probe"
+    exit 0
+fi
+
+# Look up THIS machine in the response. The hub returns
+#   { "machines": [ { "id": "...", "runner": { "status": "running", ... }, ... } ] }
+# but the shape has shifted historically; accept either { "machines": [] } or top-level array.
+HEALTHY="false"
+if command -v jq >/dev/null; then
+    HEALTHY="$(jq --arg mid "$MACHINE_ID" '
+        (if type == "array" then . else (.machines // []) end)
+        | map(select(.id == $mid))
+        | if length == 0 then false
+          else (.[0].runner.status // .[0].runnerState.status // "unknown") == "running"
+          end
+        | tostring
+    ' "$TMP_RESP" 2>/dev/null || echo 'parse_fail')"
+fi
+
+if [[ "$HEALTHY" == "true" ]]; then
+    log "machine $MACHINE_ID present + runner running on $HAPI_API_URL; no action"
+    exit 0
+fi
+
+# Before restarting, give the runner a grace window in case it's mid-reconnect.
+if [[ -f "$STATE_FILE" ]] && command -v jq >/dev/null; then
+    LAST_HEARTBEAT="$(jq -r '.lastHeartbeat // empty' "$STATE_FILE" 2>/dev/null || true)"
+    if [[ -n "$LAST_HEARTBEAT" ]]; then
+        # lastHeartbeat is JS toLocaleString format, not ISO; can't trivially parse
+        # to epoch in pure bash. Use file mtime as a rough proxy.
+        LAST_MTIME="$(stat -c '%Y' "$STATE_FILE" 2>/dev/null || echo 0)"
+        NOW="$(date +%s)"
+        AGE=$((NOW - LAST_MTIME))
+        if (( AGE < GRACE_SEC )); then
+            log "machine $MACHINE_ID not in hub list but runner heartbeat ${AGE}s old (< grace ${GRACE_SEC}s); skipping restart"
+            exit 0
+        fi
+        log "machine $MACHINE_ID absent or runner not running; heartbeat ${AGE}s stale"
+    fi
+fi
+
+if [[ "${HAPI_WATCHDOG_DRY_RUN:-}" == "1" ]]; then
+    log "DRY_RUN: would `sudo systemctl restart hapi-runner.service`"
+    exit 0
+fi
+
+log "restarting hapi-runner.service via systemctl"
+if systemctl restart hapi-runner.service 2>/dev/null; then
+    log "restart OK"
+    exit 0
+fi
+
+# Fallback: try sudo (the unit user is heavygee but may have NOPASSWD for this one)
+if sudo -n systemctl restart hapi-runner.service 2>/dev/null; then
+    log "restart OK (via sudo)"
+    exit 0
+fi
+
+log "ERROR: could not restart hapi-runner.service (permission?)"
+exit 1
diff --git a/cli/systemd/hapi-runner-watchdog.timer b/cli/systemd/hapi-runner-watchdog.timer
@@ -0,0 +1,25 @@
+# Schedule for hapi-runner-watchdog.service
+#
+# Place at: /etc/systemd/system/hapi-runner-watchdog.timer
+# Enable:   sudo systemctl enable --now hapi-runner-watchdog.timer
+# Inspect:  systemctl list-timers hapi-runner-watchdog.timer
+#
+# Fires:
+#   * OnBootSec=120s   - wait 2 min after boot so hapi-hub + runner have
+#                        had a chance to come up before we probe.
+#   * OnUnitActiveSec=60s - then every 60 seconds.
+#
+# The probe is cheap (one HTTP GET); 60s is the right tradeoff between
+# "operator notices missing machine" and "filling the journal with noise".
+
+[Unit]
+Description=HAPI runner liveness watchdog timer
+
+[Timer]
+OnBootSec=120
+OnUnitActiveSec=60
+AccuracySec=5
+Unit=hapi-runner-watchdog.service
+
+[Install]
+WantedBy=timers.target