Skip to content

fix(execd): kill entire process group on command cancel#924

Open
Pangjiping wants to merge 4 commits into
alibaba:mainfrom
Pangjiping:fix/execd-child-process-leak-922
Open

fix(execd): kill entire process group on command cancel#924
Pangjiping wants to merge 4 commits into
alibaba:mainfrom
Pangjiping:fix/execd-child-process-leak-922

Conversation

@Pangjiping
Copy link
Copy Markdown
Collaborator

Summary

When a foreground command was cancelled (client disconnect, timeout, or DELETE /command), only the bash group leader received SIGKILL — child processes spawned via & or pipelines kept running as orphans because exec.CommandContext's internal kill targets a single pid, and killPid sent signals to the leader only.

Fix runCommand's ctx.Done() branch to send SIGKILL to -pid (the whole group, since the leader is launched with Setpgid: true), mirroring runBackgroundCommand. Rewrite killPid to signal -pid for SIGTERM/SIGKILL and to use kill(-pid, 0) for liveness probing, so Interrupt() also terminates descendants.

Adds regression tests covering both cancel and Interrupt paths.

Fixes #922

Testing

  • Not run (explain why)
  • Unit tests
  • Integration tests
  • e2e / manual verification

Breaking Changes

  • None
  • Yes (describe impact and migration path)

Checklist

  • Linked Issue or clearly described motivation
  • Added/updated docs (if needed)
  • Added/updated tests (if needed)
  • Security impact considered
  • Backward compatibility considered

When a foreground command was cancelled (client disconnect, timeout, or
DELETE /command), only the bash group leader received SIGKILL — child
processes spawned via `&` or pipelines kept running as orphans because
exec.CommandContext's internal kill targets a single pid, and killPid
sent signals to the leader only.

Fix runCommand's ctx.Done() branch to send SIGKILL to -pid (the whole
group, since the leader is launched with Setpgid: true), mirroring
runBackgroundCommand. Rewrite killPid to signal -pid for SIGTERM/SIGKILL
and to use kill(-pid, 0) for liveness probing, so Interrupt() also
terminates descendants.

Adds regression tests covering both cancel and Interrupt paths.

Fixes alibaba#922

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Pangjiping Pangjiping added bug Something isn't working component/execd labels May 20, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5c3e5b3094

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread components/execd/pkg/runtime/command.go
Comment thread components/execd/pkg/runtime/interrupt.go
Comment thread components/execd/pkg/runtime/interrupt.go Outdated
Pangjiping and others added 3 commits May 20, 2026 21:42
killPid now signals the whole process group (-pid). Combined with the
fact that commandClientMap retains finished sessions, a late or retried
Interrupt could otherwise terminate every process in an unrelated
process group whose PGID has reused the recorded PID.

- markCommandFinished clears kernel.pid alongside kernel.running so the
  stale PID is no longer accessible.
- commandSnapshot now reads under c.mu.RLock for a consistent view of
  running/pid relative to markCommandFinished's write under c.mu.Lock.
- Interrupt() (unix and windows) snapshots the kernel and refuses to
  signal when the command has already finished.

Adds a regression test ensuring Interrupt on a completed session returns
an error and that pid is cleared.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
kill(2) on a process group only guarantees delivery to at least one
member, and kill(-pid, 0) keeps reporting the group as observable while
any unreaped zombie lingers. The previous post-SIGKILL probe ran for
only 150ms and then returned a hard error, so Interrupt could surface a
500 even though the kill signal had already been delivered. Likewise on
macOS, SIGKILL on a group that has been reduced to zombies returns
EPERM, which the previous code reported as a kill failure even though
SIGTERM had already taken effect.

- After a successful SIGKILL, log a warning when the probe loop still
  observes the group instead of returning an error.
- When SIGTERM was delivered but the SIGKILL syscall fails (commonly
  EPERM on a zombie-only group), log and return nil — the kill is in
  flight and the kernel will reap the group once Wait() runs.

Adds a regression test that runs killPid against a Setpgid group with
no concurrent reaper, exercising the zombie-lingering path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Execute() defers cancel() for every foreground command, including
successful ones, so the signal-forwarding goroutine's ctx.Done() branch
also fired on the normal-success path. With the new group-wide SIGKILL
on -cmd.Process.Pid, that post-completion signal could hit a recycled
pid/pgid and kill an unrelated process group inside the sandbox.

Gate the goroutine on the existing `done` channel (closed after
cmd.Wait() returns or on start failure): exit cleanly when the command
has finished, so only genuine cancellations — timeout, client abort,
Interrupt — trigger the group kill. A double-check inside the
ctx.Done() branch handles the race where ctx is cancelled at the same
instant cmd.Wait() returns.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working component/execd

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Child process leak on command cancellation

1 participant