Skip to content

[BUG] TCPStore.recvValue failed / HeartbeatMonitor warnings on clean shutdown with scheduler.type=local #1245

@HT-Yuan

Description

@HT-Yuan

Checklist

  • [×] The error occurs when using our provided Docker image.
  • [×] I can consistently reproduce the bug across multiple trials or random seeds.
  • [×] If the error causes experiment abortion, I've verified that this error is the root
    cause, not a secondary error caused by peer workers.

Detailed Information

Describe the bug

When a training job finishes cleanly under scheduler.type=local, the tail of the
log is flooded with warnings such as:
[W ... TCPStore.cpp:...] TCPStore.recvValue failed ... Connection reset by peer [W ... ProcessGroupNCCL.cpp:...] ProcessGroupNCCL's watchdog got unexpected signal [W ... HeartbeatMonitor ...] socket error while reading heartbeat ...

No exception is raised and the run's metrics are correct, but the noise makes real errors hard to spot and suggests that the distributed teardown is racy.

Investigation shows there are two distinct races stacked on top of each other, both specific to LocalScheduler:

  1. Intra-process race — inside each worker, FSDPEngine.destroy() tears down the NCCL process group before the CPU-side gloo group has synchronised, so ranks can reach destroy_process_group() in different orders.
  2. Inter-process raceLocalScheduler._cleanup_workers iterates workers serially and calls kill_process_tree(..., timeout=3, graceful=True) on each one. Because that helper blocks up to timeout seconds between SIGTERM and the fallback SIGKILL, a 4-rank job spends ~12 s in cleanup. During that window only a single rank is executing its engine.destroy() path, so any CPU barrier placed there can never actually rendezvous, and rank-0 (which hosts the TCPStore server) can in some cases be killed before all clients have disconnected.

The Ray and Slurm backends are not affected:

  • ray.kill / actor.destroy.remote() dispatch is asynchronous, so all actors receive the signal within ms;
  • scancel <job_id> broadcasts the signal at the cluster layer.

Expected behavior

On clean shutdown of a local-scheduler job:

  • every rank enters engine.destroy() within the same small time window;
  • NCCL / gloo groups are destroyed after a successful CPU-side barrier;
  • the TCPStore server on rank-0 is the last process to exit, so clients never see Connection reset by peer;
  • the log tail is quiet.

Full logs

If possible, provide logs for more detailed information.

Image

To Reproduce

Commit ID

Please provide your Git commit ID.

Environment

Please provide your software and hardware information if you're not using a
containerized environment.

Script

The bash script or YAML configuration to run:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions