Checklist
- [×] The error occurs when using our provided Docker image.
- [×] I can consistently reproduce the bug across multiple trials or random seeds.
- [×] If the error causes experiment abortion, I've verified that this error is the root
cause, not a secondary error caused by peer workers.
Detailed Information
Describe the bug
When a training job finishes cleanly under scheduler.type=local, the tail of the
log is flooded with warnings such as:
[W ... TCPStore.cpp:...] TCPStore.recvValue failed ... Connection reset by peer [W ... ProcessGroupNCCL.cpp:...] ProcessGroupNCCL's watchdog got unexpected signal [W ... HeartbeatMonitor ...] socket error while reading heartbeat ...
No exception is raised and the run's metrics are correct, but the noise makes real errors hard to spot and suggests that the distributed teardown is racy.
Investigation shows there are two distinct races stacked on top of each other, both specific to LocalScheduler:
- Intra-process race — inside each worker,
FSDPEngine.destroy() tears down the NCCL process group before the CPU-side gloo group has synchronised, so ranks can reach destroy_process_group() in different orders.
- Inter-process race —
LocalScheduler._cleanup_workers iterates workers serially and calls kill_process_tree(..., timeout=3, graceful=True) on each one. Because that helper blocks up to timeout seconds between SIGTERM and the fallback SIGKILL, a 4-rank job spends ~12 s in cleanup. During that window only a single rank is executing its engine.destroy() path, so any CPU barrier placed there can never actually rendezvous, and rank-0 (which hosts the TCPStore server) can in some cases be killed before all clients have disconnected.
The Ray and Slurm backends are not affected:
ray.kill / actor.destroy.remote() dispatch is asynchronous, so all actors receive the signal within ms;
scancel <job_id> broadcasts the signal at the cluster layer.
Expected behavior
On clean shutdown of a local-scheduler job:
- every rank enters
engine.destroy() within the same small time window;
- NCCL / gloo groups are destroyed after a successful CPU-side barrier;
- the TCPStore server on rank-0 is the last process to exit, so clients never see
Connection reset by peer;
- the log tail is quiet.
Full logs
If possible, provide logs for more detailed information.
To Reproduce
Commit ID
Please provide your Git commit ID.
Environment
Please provide your software and hardware information if you're not using a
containerized environment.
Script
The bash script or YAML configuration to run:
Checklist
cause, not a secondary error caused by peer workers.
Detailed Information
Describe the bug
When a training job finishes cleanly under
scheduler.type=local, the tail of thelog is flooded with warnings such as:
[W ... TCPStore.cpp:...] TCPStore.recvValue failed ... Connection reset by peer [W ... ProcessGroupNCCL.cpp:...] ProcessGroupNCCL's watchdog got unexpected signal [W ... HeartbeatMonitor ...] socket error while reading heartbeat ...
No exception is raised and the run's metrics are correct, but the noise makes real errors hard to spot and suggests that the distributed teardown is racy.
Investigation shows there are two distinct races stacked on top of each other, both specific to
LocalScheduler:FSDPEngine.destroy()tears down the NCCL process group before the CPU-side gloo group has synchronised, so ranks can reachdestroy_process_group()in different orders.LocalScheduler._cleanup_workersiterates workers serially and callskill_process_tree(..., timeout=3, graceful=True)on each one. Because that helper blocks up totimeoutseconds between SIGTERM and the fallback SIGKILL, a 4-rank job spends ~12 s in cleanup. During that window only a single rank is executing itsengine.destroy()path, so any CPU barrier placed there can never actually rendezvous, and rank-0 (which hosts the TCPStore server) can in some cases be killed before all clients have disconnected.The
RayandSlurmbackends are not affected:ray.kill/actor.destroy.remote()dispatch is asynchronous, so all actors receive the signal within ms;scancel <job_id>broadcasts the signal at the cluster layer.Expected behavior
On clean shutdown of a local-scheduler job:
engine.destroy()within the same small time window;Connection reset by peer;Full logs
If possible, provide logs for more detailed information.
To Reproduce
Commit ID
Please provide your Git commit ID.
Environment
Please provide your software and hardware information if you're not using a
containerized environment.
Script
The bash script or YAML configuration to run: