Skip to content

fix(execd): kill entire process group on command cancel#924

Merged
hittyt merged 4 commits into
opensandbox-group:mainfrom
Pangjiping:fix/execd-child-process-leak-922
May 25, 2026
Merged

fix(execd): kill entire process group on command cancel#924
hittyt merged 4 commits into
opensandbox-group:mainfrom
Pangjiping:fix/execd-child-process-leak-922

Conversation

@Pangjiping
Copy link
Copy Markdown
Collaborator

Summary

When a foreground command was cancelled (client disconnect, timeout, or DELETE /command), only the bash group leader received SIGKILL — child processes spawned via & or pipelines kept running as orphans because exec.CommandContext's internal kill targets a single pid, and killPid sent signals to the leader only.

Fix runCommand's ctx.Done() branch to send SIGKILL to -pid (the whole group, since the leader is launched with Setpgid: true), mirroring runBackgroundCommand. Rewrite killPid to signal -pid for SIGTERM/SIGKILL and to use kill(-pid, 0) for liveness probing, so Interrupt() also terminates descendants.

Adds regression tests covering both cancel and Interrupt paths.

Fixes #922

Testing

  • Not run (explain why)
  • Unit tests
  • Integration tests
  • e2e / manual verification

Breaking Changes

  • None
  • Yes (describe impact and migration path)

Checklist

  • Linked Issue or clearly described motivation
  • Added/updated docs (if needed)
  • Added/updated tests (if needed)
  • Security impact considered
  • Backward compatibility considered

When a foreground command was cancelled (client disconnect, timeout, or
DELETE /command), only the bash group leader received SIGKILL — child
processes spawned via `&` or pipelines kept running as orphans because
exec.CommandContext's internal kill targets a single pid, and killPid
sent signals to the leader only.

Fix runCommand's ctx.Done() branch to send SIGKILL to -pid (the whole
group, since the leader is launched with Setpgid: true), mirroring
runBackgroundCommand. Rewrite killPid to signal -pid for SIGTERM/SIGKILL
and to use kill(-pid, 0) for liveness probing, so Interrupt() also
terminates descendants.

Adds regression tests covering both cancel and Interrupt paths.

Fixes opensandbox-group#922

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Pangjiping Pangjiping added bug Something isn't working component/execd labels May 20, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5c3e5b3094

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread components/execd/pkg/runtime/command.go
Comment thread components/execd/pkg/runtime/interrupt.go
Comment thread components/execd/pkg/runtime/interrupt.go Outdated
Pangjiping and others added 3 commits May 20, 2026 21:42
killPid now signals the whole process group (-pid). Combined with the
fact that commandClientMap retains finished sessions, a late or retried
Interrupt could otherwise terminate every process in an unrelated
process group whose PGID has reused the recorded PID.

- markCommandFinished clears kernel.pid alongside kernel.running so the
  stale PID is no longer accessible.
- commandSnapshot now reads under c.mu.RLock for a consistent view of
  running/pid relative to markCommandFinished's write under c.mu.Lock.
- Interrupt() (unix and windows) snapshots the kernel and refuses to
  signal when the command has already finished.

Adds a regression test ensuring Interrupt on a completed session returns
an error and that pid is cleared.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
kill(2) on a process group only guarantees delivery to at least one
member, and kill(-pid, 0) keeps reporting the group as observable while
any unreaped zombie lingers. The previous post-SIGKILL probe ran for
only 150ms and then returned a hard error, so Interrupt could surface a
500 even though the kill signal had already been delivered. Likewise on
macOS, SIGKILL on a group that has been reduced to zombies returns
EPERM, which the previous code reported as a kill failure even though
SIGTERM had already taken effect.

- After a successful SIGKILL, log a warning when the probe loop still
  observes the group instead of returning an error.
- When SIGTERM was delivered but the SIGKILL syscall fails (commonly
  EPERM on a zombie-only group), log and return nil — the kill is in
  flight and the kernel will reap the group once Wait() runs.

Adds a regression test that runs killPid against a Setpgid group with
no concurrent reaper, exercising the zombie-lingering path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Execute() defers cancel() for every foreground command, including
successful ones, so the signal-forwarding goroutine's ctx.Done() branch
also fired on the normal-success path. With the new group-wide SIGKILL
on -cmd.Process.Pid, that post-completion signal could hit a recycled
pid/pgid and kill an unrelated process group inside the sandbox.

Gate the goroutine on the existing `done` channel (closed after
cmd.Wait() returns or on start failure): exit cleanly when the command
has finished, so only genuine cancellations — timeout, client abort,
Interrupt — trigger the group kill. A double-check inside the
ctx.Done() branch handles the race where ctx is cancelled at the same
instant cmd.Wait() returns.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@hittyt hittyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hittyt hittyt merged commit 2772623 into opensandbox-group:main May 25, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working component/execd

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Child process leak on command cancellation

2 participants