fix(envd): kill command's whole process tree on signal#2933
Conversation
envd's SendSignal only signaled the single process it manages (the command leader), so child processes the command spawned kept running — and consuming resources — after a kill. Non-PTY commands now start in their own process group (Setpgid) and SendSignal delivers the signal to the whole group, terminating the leader together with its children; command timeouts tear down the group too. PTY processes are left untouched. Bumps the envd version. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit b1d9d30. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
Code Review
This pull request ensures that non-PTY commands run in their own process group so that signals and context cancellations terminate the entire process tree instead of just the leader process. A test is added to verify this behavior, and the package version is bumped to 0.6.2. There are no review comments, and I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
❌ 4 Tests Failed:
View the full list of 4 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
| // gone group (ESRCH) is treated as success — there is nothing left to kill. | ||
| func signalProcessGroup(cmd *exec.Cmd, signal syscall.Signal) error { | ||
| if cmd.Process == nil { | ||
| return errors.New("process not started") | ||
| } | ||
|
|
||
| err := syscall.Kill(-cmd.Process.Pid, signal) | ||
| if err != nil && !errors.Is(err, syscall.ESRCH) { | ||
| return err | ||
| } | ||
|
|
||
| return nil | ||
| } | ||
|
|
There was a problem hiding this comment.
🔴 After this PR, SendSignal on a non-PTY handler bypasses Go's ErrProcessDone guard: signalProcessGroup calls syscall.Kill(-cmd.Process.Pid, signal) directly. There is a window after cmd.Wait() reaps the leader (freeing its PID) but before the deferred s.processes.Delete(pid) runs in start.go:184-188 — a stale SendSignal arriving in that window can target a recycled PID and silently kill an unrelated command's whole process group. Fix: gate signalProcessGroup on cmd.ProcessState != nil (or a handler-level atomic flag set around cmd.Wait()).
Extended reasoning...
What the bug is
Pre-PR, SendSignal delivered the signal via p.cmd.Process.Signal(signal). Go's os.Process keeps a done atomic (set inside Wait() before/after Wait4), and pidSignal (/usr/local/go/src/os/exec_unix.go) returns os.ErrProcessDone once the process has been reaped. That guard makes signaling a reaped leader a no-op — important because the kernel is free to reuse a reaped PID immediately.
This PR introduces signalProcessGroup:
func signalProcessGroup(cmd *exec.Cmd, signal syscall.Signal) error {
if cmd.Process == nil {
return errors.New("process not started")
}
err := syscall.Kill(-cmd.Process.Pid, signal)
if err != nil && !errors.Is(err, syscall.ESRCH) {
return err
}
return nil
}It bypasses Go's done guard entirely and issues a raw kill(-pid, sig) on whatever PID cmd.Process.Pid still records — which is exactly the PID that may have just been reassigned by the kernel.
The race window
start.go:184-188 reaps and unregisters like this:
go func() {
defer s.processes.Delete(pid)
proc.Wait()
}()handler.Wait() then does, in order: <-p.outCtx.Done() → p.cmd.Wait() (leader is reaped here — kernel can reuse the PID from this point) → p.tty.Close() → build endEvent → p.EndEvent.Source <- event (buffer-0 multiplex channel — blocks until the handleStart consumer reads) → log → p.cancel(). Only after all of that returns does s.processes.Delete(pid) run.
During that window:
- The handler is still discoverable via PID and via tag in
getProcess(service.go:47). p.cmd.Process.Pidstill holds the now-free leader PID.- A new non-PTY command started in this PR sets
SysProcAttr.Setpgid = true(handler.go:151), so if its leader gets the recycled PID it becomes a process-group leader withpgid == pid.
A stale SendSignal(SIGKILL) (client retry, tag lookup, UI race) arriving against the old handler in that window calls signalProcessGroup → kill(-recycled_pid, SIGKILL), which lands on the brand-new, unrelated command's entire process group.
Step-by-step proof
- Client starts command A (non-PTY). Leader gets PID 1234.
Setpgid=truemakes it pgid=1234. - Command A finishes. Inside
handler.Wait(),p.cmd.Wait()returns — kernel marks PID 1234 reusable. Handler is still ins.processesbecause thedefer s.processes.Delete(pid)hasn't fired yet. - The buffer-0 send
p.EndEvent.Source <- eventblocks: the SDK consumer is slow / coalescing reads. Window stretches into ms+ territory. - Concurrently, the same envd starts command B (non-PTY, also
Setpgid=true). The kernel reassigns PID 1234 to B; B becomes pgid=1234 leader. - A stale
SendSignal(SIGKILL)for command A arrives —getProcess(pid=1234)(or by tag) still returns A's handler. A.SendSignalcallssignalProcessGroup(A.cmd, SIGKILL)→syscall.Kill(-1234, SIGKILL)→ entire process group 1234 (= command B's leader and all its children) is killed.
Pre-PR step 6 would have returned ErrProcessDone and made no syscall.
Why existing code doesn't prevent it
- The
if cmd.Process == nilcheck only covers "not started yet"; it does nothing about "already reaped". - The
errors.Is(err, syscall.ESRCH)swallow makes things worse, not better: it only triggers when the group is truly gone — which is exactly not the case when the PID has been recycled. - The PTY branch still goes through
p.cmd.Process.Signaland retains Go's guard, so only non-PTY commands regress. - Linux's large
pid_maxdoesn't save us: PIDs are recycled FIFO within the in-use set, and a sandbox that churns short-lived commands can wrap quickly (much faster in PID namespaces with smaller ranges).
Impact
A stale signal can silently kill a different command's whole process tree — exactly the blast radius this PR was supposed to introduce for legitimate kills, now applied to the wrong target. The error returned to the caller is nil, so the caller has no signal that anything went wrong; the victim command just dies. Likelihood is low (small window × PID reuse), but the failure mode is silent cross-command termination, which is hard to diagnose after the fact.
Fix
Restore the equivalent of ErrProcessDone. Simplest:
func signalProcessGroup(cmd *exec.Cmd, signal syscall.Signal) error {
if cmd.Process == nil {
return errors.New("process not started")
}
if cmd.ProcessState != nil {
return os.ErrProcessDone
}
err := syscall.Kill(-cmd.Process.Pid, signal)
...
}or track a handler-level atomic.Bool set immediately around p.cmd.Wait() and consult it in SendSignal before falling through. The cmd.Cancel callback (also new in this PR) has the same hazard and should use the same guard.
| func processAlive(pid int) bool { | ||
| // Signal 0 performs error checking without actually sending a signal. | ||
| return syscall.Kill(pid, 0) == nil | ||
| } | ||
|
|
There was a problem hiding this comment.
🟡 processAlive uses syscall.Kill(pid, 0) which succeeds for zombies — kill(2) explicitly notes the existence check passes "as long as the process still exists" (i.e. has a /proc entry, even if unreaped). After the test SIGKILLs the process group, the two sleep children are orphaned to PID 1 / nearest subreaper; if the test ever runs in a container where PID 1 is the Go test binary itself (no tini/dumb-init), the orphans stay zombies, processAlive returns true forever, and the 5s Eventually times out. Test-only nit — childrenOf already parses /proc//stat, so reading fields[0] and treating '''Z''' as not-alive is a one-line fix.
Extended reasoning...
What the bug is
processAlive at start_test.go:194-197 uses syscall.Kill(pid, 0) to probe process existence. Per kill(2): "If sig is 0, then no signal is sent, but existence and permission checks are still performed; this can be used to check for the existence of a process ID..." — and a zombie is still an existing process (it has a /proc entry until the parent reaps it). Only when the entry is gone does kill(pid, 0) return ESRCH.
Why the test can flake
TestSendSignal_KillsProcessTree runs /bin/sh -c 'sleep 120 & sleep 120 & wait'. The shell (the leader) is the parent of both sleeps. The test then SIGKILLs the process group, which kills shell + both sleeps simultaneously. envd reaps the shell via cmd.Wait() in Handler.Wait(). But the orphaned sleeps get reparented (by the kernel, at the moment their parent dies) to PID 1 or the nearest PR_SET_CHILD_SUBREAPER. envd does not set itself as a subreaper, so reaping is up to whatever PID 1 happens to be in the test environment.
Why the refutation is partly right but not a reason to skip
The refutation is correct that in the environments this code is normally tested in (dev host, go test as a child of a shell, CI with docker run --init, systemd as PID 1), reaping happens within microseconds and the 5s Eventually window is enormous. The test will pass.
The issue is what happens in the one environment where it doesn't: docker run golang go test ./... (or any image built with ENTRYPOINT ["go","test",...]) — here the Go test binary itself is PID 1, and Go's runtime does not perform automatic SIGCHLD reaping for non-direct children. The zombies persist for the lifetime of the test process, processAlive keeps returning true, and the require.Eventually assertion times out. This is exactly the kind of environment infra-style projects sometimes adopt for sandboxed CI, and it's reasonable to harden against it preemptively — especially since the fix is essentially free.
Step-by-step proof
- Test starts leader
sh -c 'sleep 120 & sleep 120 & wait'(pid L). Kernel creates children C1, C2 withppid=L. childrenOf(L)returns[C1, C2].- Client sends
SIGKILL→ envd callssyscall.Kill(-L, SIGKILL). Kernel marks L, C1, C2 as terminated. - envd's
cmd.Wait()reaps L → L's /proc entry disappears. C1, C2 still have entries (state =Z), now reparented to whatever process is PID 1 in this namespace. - If PID 1 = the Go test binary (no reaper), no
wait4()is ever called for C1, C2. slices.ContainsFunc(childPids, processAlive):syscall.Kill(C1, 0)→ returnsnil(zombie still exists, permission check passes since same uid), soprocessAlive(C1) == true.- The Eventually predicate returns
falsefor 5s, thenrequire.Eventuallyfails: "child processes should be killed with the leader".
How to fix
childrenOf already parses /proc/<pid>/stat and even extracts fields[0] (the state) but discards it. Reuse it in processAlive:
func processAlive(pid int) bool {
data, err := os.ReadFile("/proc/" + strconv.Itoa(pid) + "/stat")
if err != nil {
return false // gone from /proc → definitely not alive
}
idx := strings.LastIndex(string(data), ")")
if idx == -1 {
return false
}
fields := strings.Fields(string(data)[idx+1:])
return len(fields) > 0 && fields[0] != "Z"
}This makes the test robust regardless of who PID 1 is.
envd's
SendSignalonly signaled the single process it manages (the command leader), so child processes the command spawned kept running — and consuming resources — after akill()(see e2b-dev/E2B#1389, which worked around this SDK-side with apgrepshell walk). Non-PTY commands now start in their own process group (Setpgid) andSendSignaldelivers the signal to the whole group, terminating the leader together with its children; command timeouts tear down the group too, and PTY processes are left untouched. This is the same idiom already used for socat ininternal/port/forward.go, and it's transparently backward-compatible — existing SDKs get whole-tree termination for free and can later drop their workaround. Adds a regression test (TestSendSignal_KillsProcessTree) and bumps the envd version to 0.6.2.🤖 Generated with Claude Code