fix(execd): kill entire process group on command cancel#924
Open
Pangjiping wants to merge 4 commits into
Open
Conversation
When a foreground command was cancelled (client disconnect, timeout, or DELETE /command), only the bash group leader received SIGKILL — child processes spawned via `&` or pipelines kept running as orphans because exec.CommandContext's internal kill targets a single pid, and killPid sent signals to the leader only. Fix runCommand's ctx.Done() branch to send SIGKILL to -pid (the whole group, since the leader is launched with Setpgid: true), mirroring runBackgroundCommand. Rewrite killPid to signal -pid for SIGTERM/SIGKILL and to use kill(-pid, 0) for liveness probing, so Interrupt() also terminates descendants. Adds regression tests covering both cancel and Interrupt paths. Fixes alibaba#922 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5c3e5b3094
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
killPid now signals the whole process group (-pid). Combined with the fact that commandClientMap retains finished sessions, a late or retried Interrupt could otherwise terminate every process in an unrelated process group whose PGID has reused the recorded PID. - markCommandFinished clears kernel.pid alongside kernel.running so the stale PID is no longer accessible. - commandSnapshot now reads under c.mu.RLock for a consistent view of running/pid relative to markCommandFinished's write under c.mu.Lock. - Interrupt() (unix and windows) snapshots the kernel and refuses to signal when the command has already finished. Adds a regression test ensuring Interrupt on a completed session returns an error and that pid is cleared. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
kill(2) on a process group only guarantees delivery to at least one member, and kill(-pid, 0) keeps reporting the group as observable while any unreaped zombie lingers. The previous post-SIGKILL probe ran for only 150ms and then returned a hard error, so Interrupt could surface a 500 even though the kill signal had already been delivered. Likewise on macOS, SIGKILL on a group that has been reduced to zombies returns EPERM, which the previous code reported as a kill failure even though SIGTERM had already taken effect. - After a successful SIGKILL, log a warning when the probe loop still observes the group instead of returning an error. - When SIGTERM was delivered but the SIGKILL syscall fails (commonly EPERM on a zombie-only group), log and return nil — the kill is in flight and the kernel will reap the group once Wait() runs. Adds a regression test that runs killPid against a Setpgid group with no concurrent reaper, exercising the zombie-lingering path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Execute() defers cancel() for every foreground command, including successful ones, so the signal-forwarding goroutine's ctx.Done() branch also fired on the normal-success path. With the new group-wide SIGKILL on -cmd.Process.Pid, that post-completion signal could hit a recycled pid/pgid and kill an unrelated process group inside the sandbox. Gate the goroutine on the existing `done` channel (closed after cmd.Wait() returns or on start failure): exit cleanly when the command has finished, so only genuine cancellations — timeout, client abort, Interrupt — trigger the group kill. A double-check inside the ctx.Done() branch handles the race where ctx is cancelled at the same instant cmd.Wait() returns. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a foreground command was cancelled (client disconnect, timeout, or DELETE /command), only the bash group leader received SIGKILL — child processes spawned via
&or pipelines kept running as orphans because exec.CommandContext's internal kill targets a single pid, and killPid sent signals to the leader only.Fix runCommand's ctx.Done() branch to send SIGKILL to -pid (the whole group, since the leader is launched with Setpgid: true), mirroring runBackgroundCommand. Rewrite killPid to signal -pid for SIGTERM/SIGKILL and to use kill(-pid, 0) for liveness probing, so Interrupt() also terminates descendants.
Adds regression tests covering both cancel and Interrupt paths.
Fixes #922
Testing
Breaking Changes
Checklist