You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(infra): add two-phase teardown to prevent TCPStore race at shutdown
Problem:
During teardown, rank-0 (TCPStore server owner) could exit before peer
ranks finished their final NCCL abort, causing noisy "TCPStore.recvValue
failed" / "Broken pipe" warnings on stderr.
Root cause:
All ranks were killed simultaneously without first coordinating a
distributed barrier on the CPU (gloo) group to safely tear down NCCL
communicators and the TCPStore.
Solution:
Implement a two-phase teardown protocol:
Phase 1 - Engine destroy: call engine.destroy() on every worker
concurrently. The engine-side destroy() now executes a CPU barrier
(dist.barrier on a gloo process group) followed by
dist.destroy_process_group(), ensuring all ranks leave the NCCL
collective together.
Phase 2 - Process kill: only after the barrier completes, kill the
actual processes (Ray: remove placement groups; Slurm: scancel;
Local: process tree cleanup).
Changes:
- engine (fsdp/megatron/archon): add _cpu_group + pre-destroy barrier
- train_controller: two-phase destroy (engines first, then workers)
- scheduler/ray: _cleanup_workers with ray.wait timeout + PG removal
- scheduler/slurm: _destroy_engines_on_workers via HTTP before scancel
- scheduler/local: graceful engine teardown before SIGKILL
- scheduler_api: add reverse_order param to delete_workers interface
- tests: updated test_train_controller, added test_local_scheduler
Tested: DPO 4xH20 Ray scheduler - clean teardown, no TCPStore warnings.
0 commit comments