Skip to content

fix(envd): kill command's whole process tree on signal#2933

Open
mishushakov wants to merge 1 commit into
mainfrom
mishushakov/envd-kill-process-tree
Open

fix(envd): kill command's whole process tree on signal#2933
mishushakov wants to merge 1 commit into
mainfrom
mishushakov/envd-kill-process-tree

Conversation

@mishushakov
Copy link
Copy Markdown
Member

envd's SendSignal only signaled the single process it manages (the command leader), so child processes the command spawned kept running — and consuming resources — after a kill() (see e2b-dev/E2B#1389, which worked around this SDK-side with a pgrep shell walk). Non-PTY commands now start in their own process group (Setpgid) and SendSignal delivers the signal to the whole group, terminating the leader together with its children; command timeouts tear down the group too, and PTY processes are left untouched. This is the same idiom already used for socat in internal/port/forward.go, and it's transparently backward-compatible — existing SDKs get whole-tree termination for free and can later drop their workaround. Adds a regression test (TestSendSignal_KillsProcessTree) and bumps the envd version to 0.6.2.

🤖 Generated with Claude Code

envd's SendSignal only signaled the single process it manages (the
command leader), so child processes the command spawned kept running —
and consuming resources — after a kill. Non-PTY commands now start in
their own process group (Setpgid) and SendSignal delivers the signal to
the whole group, terminating the leader together with its children;
command timeouts tear down the group too. PTY processes are left
untouched. Bumps the envd version.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented Jun 5, 2026

PR Summary

Medium Risk
Changes core process termination and timeout cancellation paths; mis-targeted group signals could have broad impact, though PTY is excluded and the pattern matches existing socat handling.

Overview
Non-PTY user commands now start in a dedicated process group so SendSignal and command timeouts tear down the leader and any children it spawned, instead of leaving orphan processes after a kill. PTY sessions keep the previous single-process signaling behavior. A regression test asserts that SIGKILL clears a spawned child tree; envd is bumped to 0.6.2.

Reviewed by Cursor Bugbot for commit b1d9d30. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request ensures that non-PTY commands run in their own process group so that signals and context cancellations terminate the entire process tree instead of just the leader process. A test is added to verify this behavior, and the package version is bumped to 0.6.2. There are no review comments, and I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 5, 2026

❌ 4 Tests Failed:

Tests completed Failed Passed Skipped
2734 4 2730 7
View the full list of 4 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestSandboxListPaginationRunningLargerLimit

Flake rate in main: 40.04% (Passed 891 times, Failed 595 times)

Stack Traces | 93.2s run time
=== RUN   TestSandboxListPaginationRunningLargerLimit
    sandbox_list_test.go:327: Created sandbox 1/12: ii33u67gzre0tfswcoheg
    sandbox_list_test.go:327: Created sandbox 2/12: irtiqyu4nzzsgrrhzt6b4
    sandbox_list_test.go:327: Created sandbox 3/12: icv5wh3t3r5uo9xxclnh6
    sandbox_list_test.go:327: Created sandbox 4/12: i5ubvybc2weet5q0jvhth
    sandbox_list_test.go:327: Created sandbox 5/12: i34zl58kuji1q4fsyweaf
    sandbox_list_test.go:327: Created sandbox 6/12: i5neymjryjq8avpxll31c
    sandbox_list_test.go:327: Created sandbox 7/12: io16rllow3uzzcvjtdvoe
    sandbox_list_test.go:327: Created sandbox 8/12: iv2a5s3thbily9ldt8t24
    sandbox_list_test.go:327: Created sandbox 9/12: i1xg3s9etzrinyu5krubm
    sandbox_list_test.go:327: Created sandbox 10/12: i6w6kbq8i7r2srz2jylm2
    sandbox_list_test.go:327: Created sandbox 11/12: iww3gu13lcmy4had8d79u
    sandbox_list_test.go:327: Created sandbox 12/12: ie6dp2bn21dv9ku051wsu
    sandbox_list_test.go:330: 
        	Error Trace:	.../api/sandboxes/sandbox_list_test.go:340
        	            				.../hostedtoolcache/go/1.26.3.../src/runtime/asm_amd64.s:1771
        	Error:      	"[]" should have 12 item(s), but has 0
    sandbox_list_test.go:330: 
        	Error Trace:	.../api/sandboxes/sandbox_list_test.go:330
        	Error:      	Condition never satisfied
        	Test:       	TestSandboxListPaginationRunningLargerLimit
--- FAIL: TestSandboxListPaginationRunningLargerLimit (93.23s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 54.90% (Passed 884 times, Failed 1076 times)

Stack Traces | 62.5s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:27: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (62.55s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 54.97% (Passed 874 times, Failed 1067 times)

Stack Traces | 191s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1252}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\nUsed memory before tmpfs mount: 191 MB\nFree memory before tmpfs mount: 793 MB\nMemory to use in integrity test (60% of free, min 64MB): 475 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"475+0 records in\n475+0 records out\n498073600 bytes (498 MB, 475 MiB) copied, 2.54369 s, 196 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"C"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"o"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"m"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"m"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"a"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"d"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"b"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"e"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"i"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"g"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"i"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"m"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"e"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"d"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:":"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\""}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"dd"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"if=/dev/urandom"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"of=/mnt/testfile"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"bs=1M"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"count=475"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\""}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"U"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"s"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"e"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"r"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"i"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"m"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"e"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"("}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"s"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"e"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"c"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"o"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"d"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"s"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:")"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:":"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"0.00"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"S"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"y"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"s"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"e"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"m"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" time (seconds): 2.52\n\tPercent of CPU this job got: 99%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): 0:02.55\n\tAverage shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kbytes): 0\n\tAverage total size (kbytes): 0\n\tMaximum resident set size (kbytes): 2628\n\tAverage resident set size (kbytes): 0\n\tMajor (requiring I/O) page faults: 2\n\tMinor (reclaiming a frame) page faults: 346\n\tVoluntary context switches: 3\n\tInvoluntary context switches: 14\n\tSwaps: 0\n\tFile system inputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket messages received: 0\n\tSignals delivered: 0\n\tPage size (bytes): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 670 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox ik01qwcba6b790job39ra
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{start:{pid:1268}}
Executing command bash in sandbox iwhmgb0blvbknpm9wgsdr (user: root)
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{data:{stdout:"8e0e23a1fceb40ddb2faeec3707110aa2889f35cec34ac4e6092b1070aa4c25c\n"}}
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:80: Command [bash] completed successfully in sandbox ik01qwcba6b790job39ra
Executing command bash in sandbox iwhmgb0blvbknpm9wgsdr (user: root)
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{start:{pid:1271}}
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
Executing command bash in sandbox ik01qwcba6b790job39ra (user: root)
    sandbox_memory_integrity_test.go:110: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:81
        	            				.../hostedtoolcache/go/1.26.3.../src/runtime/asm_amd64.s:1771
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox ik01qwcba6b790job39ra: unavailable: HTTP status 502 Bad Gateway
    sandbox_memory_integrity_test.go:110: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:78
        	            				.../tests/orchestrator/sandbox_memory_integrity_test.go:110
        	Error:      	Condition never satisfied
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (191.32s)
github.com/e2b-dev/infra/tests/integration/internal/tests/proxies::TestEnvdAccessTokenAutoResumeViaProxy

Flake rate in main: 40.01% (Passed 880 times, Failed 587 times)

Stack Traces | 11.6s run time
=== RUN   TestEnvdAccessTokenAutoResumeViaProxy
=== PAUSE TestEnvdAccessTokenAutoResumeViaProxy
=== CONT  TestEnvdAccessTokenAutoResumeViaProxy
    traffic_access_token_test.go:357: 
        	Error Trace:	.../tests/proxies/traffic_access_token_test.go:357
        	Error:      	Received unexpected error:
        	            	Get "http://localhost:3002/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
        	Test:       	TestEnvdAccessTokenAutoResumeViaProxy
--- FAIL: TestEnvdAccessTokenAutoResumeViaProxy (11.63s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Comment on lines +401 to +414
// gone group (ESRCH) is treated as success — there is nothing left to kill.
func signalProcessGroup(cmd *exec.Cmd, signal syscall.Signal) error {
if cmd.Process == nil {
return errors.New("process not started")
}

err := syscall.Kill(-cmd.Process.Pid, signal)
if err != nil && !errors.Is(err, syscall.ESRCH) {
return err
}

return nil
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 After this PR, SendSignal on a non-PTY handler bypasses Go's ErrProcessDone guard: signalProcessGroup calls syscall.Kill(-cmd.Process.Pid, signal) directly. There is a window after cmd.Wait() reaps the leader (freeing its PID) but before the deferred s.processes.Delete(pid) runs in start.go:184-188 — a stale SendSignal arriving in that window can target a recycled PID and silently kill an unrelated command's whole process group. Fix: gate signalProcessGroup on cmd.ProcessState != nil (or a handler-level atomic flag set around cmd.Wait()).

Extended reasoning...

What the bug is

Pre-PR, SendSignal delivered the signal via p.cmd.Process.Signal(signal). Go's os.Process keeps a done atomic (set inside Wait() before/after Wait4), and pidSignal (/usr/local/go/src/os/exec_unix.go) returns os.ErrProcessDone once the process has been reaped. That guard makes signaling a reaped leader a no-op — important because the kernel is free to reuse a reaped PID immediately.

This PR introduces signalProcessGroup:

func signalProcessGroup(cmd *exec.Cmd, signal syscall.Signal) error {
    if cmd.Process == nil {
        return errors.New("process not started")
    }
    err := syscall.Kill(-cmd.Process.Pid, signal)
    if err != nil && !errors.Is(err, syscall.ESRCH) {
        return err
    }
    return nil
}

It bypasses Go's done guard entirely and issues a raw kill(-pid, sig) on whatever PID cmd.Process.Pid still records — which is exactly the PID that may have just been reassigned by the kernel.

The race window

start.go:184-188 reaps and unregisters like this:

go func() {
    defer s.processes.Delete(pid)
    proc.Wait()
}()

handler.Wait() then does, in order: <-p.outCtx.Done()p.cmd.Wait() (leader is reaped here — kernel can reuse the PID from this point) → p.tty.Close() → build endEvent → p.EndEvent.Source <- event (buffer-0 multiplex channel — blocks until the handleStart consumer reads) → log → p.cancel(). Only after all of that returns does s.processes.Delete(pid) run.

During that window:

  1. The handler is still discoverable via PID and via tag in getProcess (service.go:47).
  2. p.cmd.Process.Pid still holds the now-free leader PID.
  3. A new non-PTY command started in this PR sets SysProcAttr.Setpgid = true (handler.go:151), so if its leader gets the recycled PID it becomes a process-group leader with pgid == pid.

A stale SendSignal(SIGKILL) (client retry, tag lookup, UI race) arriving against the old handler in that window calls signalProcessGroupkill(-recycled_pid, SIGKILL), which lands on the brand-new, unrelated command's entire process group.

Step-by-step proof

  1. Client starts command A (non-PTY). Leader gets PID 1234. Setpgid=true makes it pgid=1234.
  2. Command A finishes. Inside handler.Wait(), p.cmd.Wait() returns — kernel marks PID 1234 reusable. Handler is still in s.processes because the defer s.processes.Delete(pid) hasn't fired yet.
  3. The buffer-0 send p.EndEvent.Source <- event blocks: the SDK consumer is slow / coalescing reads. Window stretches into ms+ territory.
  4. Concurrently, the same envd starts command B (non-PTY, also Setpgid=true). The kernel reassigns PID 1234 to B; B becomes pgid=1234 leader.
  5. A stale SendSignal(SIGKILL) for command A arrives — getProcess(pid=1234) (or by tag) still returns A's handler.
  6. A.SendSignal calls signalProcessGroup(A.cmd, SIGKILL)syscall.Kill(-1234, SIGKILL) → entire process group 1234 (= command B's leader and all its children) is killed.

Pre-PR step 6 would have returned ErrProcessDone and made no syscall.

Why existing code doesn't prevent it

  • The if cmd.Process == nil check only covers "not started yet"; it does nothing about "already reaped".
  • The errors.Is(err, syscall.ESRCH) swallow makes things worse, not better: it only triggers when the group is truly gone — which is exactly not the case when the PID has been recycled.
  • The PTY branch still goes through p.cmd.Process.Signal and retains Go's guard, so only non-PTY commands regress.
  • Linux's large pid_max doesn't save us: PIDs are recycled FIFO within the in-use set, and a sandbox that churns short-lived commands can wrap quickly (much faster in PID namespaces with smaller ranges).

Impact

A stale signal can silently kill a different command's whole process tree — exactly the blast radius this PR was supposed to introduce for legitimate kills, now applied to the wrong target. The error returned to the caller is nil, so the caller has no signal that anything went wrong; the victim command just dies. Likelihood is low (small window × PID reuse), but the failure mode is silent cross-command termination, which is hard to diagnose after the fact.

Fix

Restore the equivalent of ErrProcessDone. Simplest:

func signalProcessGroup(cmd *exec.Cmd, signal syscall.Signal) error {
    if cmd.Process == nil {
        return errors.New("process not started")
    }
    if cmd.ProcessState != nil {
        return os.ErrProcessDone
    }
    err := syscall.Kill(-cmd.Process.Pid, signal)
    ...
}

or track a handler-level atomic.Bool set immediately around p.cmd.Wait() and consult it in SendSignal before falling through. The cmd.Cancel callback (also new in this PR) has the same hazard and should use the same guard.

Comment on lines +193 to +197
func processAlive(pid int) bool {
// Signal 0 performs error checking without actually sending a signal.
return syscall.Kill(pid, 0) == nil
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 processAlive uses syscall.Kill(pid, 0) which succeeds for zombies — kill(2) explicitly notes the existence check passes "as long as the process still exists" (i.e. has a /proc entry, even if unreaped). After the test SIGKILLs the process group, the two sleep children are orphaned to PID 1 / nearest subreaper; if the test ever runs in a container where PID 1 is the Go test binary itself (no tini/dumb-init), the orphans stay zombies, processAlive returns true forever, and the 5s Eventually times out. Test-only nit — childrenOf already parses /proc//stat, so reading fields[0] and treating '''Z''' as not-alive is a one-line fix.

Extended reasoning...

What the bug is

processAlive at start_test.go:194-197 uses syscall.Kill(pid, 0) to probe process existence. Per kill(2): "If sig is 0, then no signal is sent, but existence and permission checks are still performed; this can be used to check for the existence of a process ID..." — and a zombie is still an existing process (it has a /proc entry until the parent reaps it). Only when the entry is gone does kill(pid, 0) return ESRCH.

Why the test can flake

TestSendSignal_KillsProcessTree runs /bin/sh -c 'sleep 120 & sleep 120 & wait'. The shell (the leader) is the parent of both sleeps. The test then SIGKILLs the process group, which kills shell + both sleeps simultaneously. envd reaps the shell via cmd.Wait() in Handler.Wait(). But the orphaned sleeps get reparented (by the kernel, at the moment their parent dies) to PID 1 or the nearest PR_SET_CHILD_SUBREAPER. envd does not set itself as a subreaper, so reaping is up to whatever PID 1 happens to be in the test environment.

Why the refutation is partly right but not a reason to skip

The refutation is correct that in the environments this code is normally tested in (dev host, go test as a child of a shell, CI with docker run --init, systemd as PID 1), reaping happens within microseconds and the 5s Eventually window is enormous. The test will pass.

The issue is what happens in the one environment where it doesn't: docker run golang go test ./... (or any image built with ENTRYPOINT ["go","test",...]) — here the Go test binary itself is PID 1, and Go's runtime does not perform automatic SIGCHLD reaping for non-direct children. The zombies persist for the lifetime of the test process, processAlive keeps returning true, and the require.Eventually assertion times out. This is exactly the kind of environment infra-style projects sometimes adopt for sandboxed CI, and it's reasonable to harden against it preemptively — especially since the fix is essentially free.

Step-by-step proof

  1. Test starts leader sh -c 'sleep 120 & sleep 120 & wait' (pid L). Kernel creates children C1, C2 with ppid=L.
  2. childrenOf(L) returns [C1, C2].
  3. Client sends SIGKILL → envd calls syscall.Kill(-L, SIGKILL). Kernel marks L, C1, C2 as terminated.
  4. envd's cmd.Wait() reaps L → L's /proc entry disappears. C1, C2 still have entries (state = Z), now reparented to whatever process is PID 1 in this namespace.
  5. If PID 1 = the Go test binary (no reaper), no wait4() is ever called for C1, C2.
  6. slices.ContainsFunc(childPids, processAlive): syscall.Kill(C1, 0) → returns nil (zombie still exists, permission check passes since same uid), so processAlive(C1) == true.
  7. The Eventually predicate returns false for 5s, then require.Eventually fails: "child processes should be killed with the leader".

How to fix

childrenOf already parses /proc/<pid>/stat and even extracts fields[0] (the state) but discards it. Reuse it in processAlive:

func processAlive(pid int) bool {
    data, err := os.ReadFile("/proc/" + strconv.Itoa(pid) + "/stat")
    if err != nil {
        return false // gone from /proc → definitely not alive
    }
    idx := strings.LastIndex(string(data), ")")
    if idx == -1 {
        return false
    }
    fields := strings.Fields(string(data)[idx+1:])
    return len(fields) > 0 && fields[0] != "Z"
}

This makes the test robust regardless of who PID 1 is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant