Skip to content

stale-runner: identity guard; options: symmetric provider wiring#73

Merged
JAORMX merged 4 commits intomainfrom
jaosorior/stale-runner-and-provider-wiring
Apr 17, 2026
Merged

stale-runner: identity guard; options: symmetric provider wiring#73
JAORMX merged 4 commits intomainfrom
jaosorior/stale-runner-and-provider-wiring

Conversation

@JAORMX
Copy link
Copy Markdown
Contributor

@JAORMX JAORMX commented Apr 17, 2026

Summary

Phase 5 of the hardening series. Two related correctness changes
plus the scaffolding to support them.

  1. Stronger process identity check.
    procutil.IsExpectedProcess previously compared only base
    names of /proc/PID/exe against the expected binary. Two
    unrelated binaries sharing a name on the same host could
    therefore collide. When the caller passes an absolute path
    (the common case — the spawn pipeline resolves runnerPath
    to absolute), the comparison is now against the full
    resolved exe path. A base-name fallback remains for callers
    that pass bare names. Also strips the (deleted) suffix the
    kernel appends when the underlying binary has been unlinked
    post-exec.

  2. Record PIDStartTime in persisted state. A new field on
    state.State alongside PID, populated at spawn time with
    time.Now().UTC(). Backward-compatible: missing field on
    legacy files unmarshals to zero time. Intended for future
    PID-disambiguation on platforms without /proc/PID/exe
    (tracked as Phase 8).

  3. Guard terminateStaleRunner against recycled PIDs.
    Before sending SIGTERM/SIGKILL to -PID (the whole process
    group), verify via a new cfg.processIsExpected hook that
    the PID still belongs to the runner binary. Default calls
    procutil.IsExpectedProcess with the runner base name.
    Without this, a PID recycled onto an unrelated session
    leader would have its entire group signalled.

  4. Auto-wire a network provider for firewall-only configs.
    Previously, a hosted provider was auto-created only when
    EgressPolicy was set. Callers who supplied only
    WithFirewallRules or WithFirewallDefaultAction(Deny)
    got the default runner-side path, which has no filter —
    the caller's deny-default silently degraded to allow-all.
    Extend the trigger: any non-default firewall configuration
    now creates hosted.NewProvider() when no provider is set
    explicitly.

Commits are bisectable; each lands with its own tests.

Test plan

  • task verify green at every commit
  • New unit tests:
    • TestIsExpectedProcess_AbsolutePathMismatch{,Fails} — absolute
      path with differing dir no longer matches even when base names align.
    • TestIsExpectedProcess_SelfBaseName{Fallback} — bare name
      still matches via the fallback path.
    • TestManager_PIDStartTime_RoundTrip — sub-second fidelity
      preserved through Save/Load.
    • TestManager_Load_MissingPIDStartTime_IsZero — legacy state
      files without the field load as zero time.
    • TestTerminateStaleRunner_RecycledPID_Skipped — PID alive
      but not expected → killProcess not called.
    • TestWireDefaultProvider/* — four table entries covering no-
      firewall, egress policy, firewall-rules-only, deny-default-only,
      and explicit-provider-preserved.
  • Existing TestTerminateStaleRunner_* updated to set the
    new cfg.processIsExpected hook where they exercise the
    kill path.
  • End-to-end via brood-box local replace: bbox rebuilds,
    VM boots, SSH + hooks + workspace round-trip clean.

🤖 Generated with Claude Code

JAORMX and others added 4 commits April 17, 2026 09:25
Base-name comparison alone collides when two unrelated binaries
share a name — anywhere /usr/bin/foo and /opt/tools/foo coexist,
a recycled PID landing on the unrelated foo would pass the check.
When the caller passes an absolute path (the common case — the
spawn pipeline resolves runnerPath to absolute) compare against
the full resolved /proc/PID/exe path instead.

Retain a base-name fallback when the caller passes a bare name
so the API remains useful in contexts where the caller only
knows the binary's base name.

Also strip the " (deleted)" suffix the kernel appends when the
underlying binary has been unlinked post-exec, so that runners
still running from a since-cleaned extract directory are
correctly identified.

Updated the two colliding tests that previously relied on the
loose semantics: one covers the base-name fallback path, one
covers the strengthened absolute-path mismatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a PIDStartTime field alongside PID so terminateStaleRunner —
and future logic on platforms where /proc/PID/exe comparison is
unavailable — can disambiguate a recycled PID from the original
runner. Populated with time.Now().UTC() at the same site that
writes the PID. The JSON tag uses omitempty so state files
written before this field existed still load cleanly with a
zero-time default.

Tests cover both directions: a round-trip at sub-second
precision, and a legacy-format state file missing the field
that must load with IsZero() == true.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Before signalling a PID loaded from the state file, verify the
PID still belongs to the expected runner binary. Between a
crashed runner and the next Run(), the kernel may recycle the
PID onto an unrelated process; sending SIGTERM/SIGKILL to
-PID (the whole process group) at that point would kill the
wrong group.

Adds a cfg.processIsExpected hook alongside the existing
processAlive and killProcess hooks. Default calls
procutil.IsExpectedProcess with the runner binary's base name
(strong match via /proc/PID/exe on Linux, liveness-only on
macOS). Tests that previously exercised the kill path now opt
into cfg.processIsExpected = func(int) bool { return true }
to match their fake-PID scaffolding.

New regression test: PID alive but processIsExpected says false
— must skip termination entirely and not call killProcess.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously, a network provider was auto-created only when an
egress policy was set. A caller who supplied only
WithFirewallRules or WithFirewallDefaultAction(Deny) without a
provider got the default runner-side path, which has no filter —
the caller's deny-default silently degraded to allow-all.

Extend the auto-wire trigger: any non-default firewall
configuration (egress policy, non-empty rules, or a non-Allow
default) now creates hosted.NewProvider() when the caller did
not set one explicitly.

Extract the wiring into wireDefaultProvider so the four cases
can be unit-tested directly on a config instead of indirectly
through Run(). Test introduces a sentinelProvider for the
explicit-provider case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JAORMX JAORMX merged commit ddbbc01 into main Apr 17, 2026
7 checks passed
@JAORMX JAORMX deleted the jaosorior/stale-runner-and-provider-wiring branch April 17, 2026 06:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant