You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(cells): Use process groups and properly handle child process reaping (#599)
* fix(cells): handle exit codes and prevent process orphaning on stop
Stop on a tracked executable previously left grandchildren alive on the
host. tokio::process::Child::kill signals only the leader PID — even
when process_group(0) was set on the Command, the spawned PGID's
members are not signaled. The fix sets each executable's process group
to its own PGID via process_group(0), then signals the entire group
with killpg(SIGKILL) on stop.
Adds:
- Executable::start: process_group(0) so the spawned child is its own
PGID leader. Captures the leader PID at spawn so killpg targets the
right group even after Tokio reaps the child internally.
- Executable::kill: replaces child.kill() with killpg(SIGKILL) on the
captured PGID. Always reaps the child and joins the stdout/stderr
reader tasks even if killpg fails, surfacing the killpg error after
cleanup.
- Executable::pid: now infallible (Option<Pid>); pid is read from the
captured field, not from a possibly-reaped Child.
- NestedAuraed::kill: tolerates ESRCH from nix::sys::signal::kill
(process already gone) so cell teardown is idempotent.
- Executables::stop: distinguishes 'never inserted' (ExecutableNotFound)
from 'process already exited' (new ExecutableAlreadyExited variant)
via ESRCH/ECHILD classification on the io::Error. Cache is evicted
in both cases. Other errors still propagate as FailedToStopExecutable.
- Executables::broadcast_stop: logs kill failures instead of dropping
them silently.
- ExecutablesError::ExecutableAlreadyExited: new variant; mapped to
Status::not_found in CellsServiceError -> Status.
- CellService::stop: reads pid from the infallible pid() and decouples
observe-service channel cleanup from the stop result so channels are
unregistered even when kill fails. Translates ExecutableNotFound and
ExecutableAlreadyExited to an Ok response for idempotency.
Tests (auraed/tests/cell_start_stop_delete.rs):
- cells_start_stop_delete: happy-path allocate / start / stop / free.
- cells_stop_kills_entire_process_group: regression test for
#534. Spawns a bash
wrapper that forks two background sleeps; after stop, every PID in
the leader's PGID must be gone within 3s.
- cells_double_stop_is_idempotent: pins that a second stop returns Ok.
- cells_stop_after_natural_exit_is_ok: stop on a process that has
already exited on its own returns Ok (drives the ESRCH/ECHILD
classification path).
Unit tests (executables.rs):
- start_should_cache_pid_and_reject_duplicates
- stop_after_natural_exit_returns_ok_and_evicts
- stop_unknown_name_returns_not_found
* fix(vms): remove SIGCHLD SIG_IGN that auto-reaped cells children
VirtualMachines::new called libc::signal(SIGCHLD, SIG_IGN). The
disposition is inherited across execve into every nested auraed, so
the kernel auto-reaped spawned children. That made
Executable::kill's child.wait().await return ECHILD on every cells
stop and caused waitpid in nested_auraed to hang waiting for a
SIGCHLD that the kernel never delivered.
Cloud Hypervisor's Vm::HANDLED_SIGNALS and Vmm::HANDLED_SIGNALS do
not include SIGCHLD, so the block_signal loops below are unaffected.
---------
Co-authored-by: dominic <510002+dmah42@users.noreply.github.com>
0 commit comments