[BUG] TCPStore.recvValue failed / HeartbeatMonitor warnings on clean shutdown with scheduler.type=local

## Checklist

- [×] The error occurs when using our provided Docker image.
- [×] I can consistently reproduce the bug across multiple trials or random seeds.
- [×] If the error causes experiment abortion, I've verified that this error is the root
  cause, not a secondary error caused by peer workers.

## Detailed Information

### Describe the bug

When a training job finishes cleanly under `scheduler.type=local`, the tail of the
log is flooded with warnings such as:
[W ... TCPStore.cpp:...] TCPStore.recvValue failed ... Connection reset by peer [W ... ProcessGroupNCCL.cpp:...] ProcessGroupNCCL's watchdog got unexpected signal [W ... HeartbeatMonitor ...] socket error while reading heartbeat ...

No exception is raised and the run's metrics are correct, but the noise makes real errors hard to spot and suggests that the distributed teardown is racy.

Investigation shows there are two distinct races stacked on top of each other, both specific to `LocalScheduler`:

1. **Intra-process race** — inside each worker, `FSDPEngine.destroy()` tears down the NCCL process group before the CPU-side gloo group has synchronised, so ranks can reach `destroy_process_group()` in different orders.
2. **Inter-process race** — `LocalScheduler._cleanup_workers` iterates workers serially and calls `kill_process_tree(..., timeout=3, graceful=True)` on each one. Because that helper blocks up to `timeout` seconds between SIGTERM and the fallback SIGKILL, a 4-rank job spends ~12 s in cleanup. During that window only a single rank is executing its `engine.destroy()` path, so any CPU barrier placed there can never actually rendezvous, and rank-0 (which hosts the TCPStore server) can in some cases be killed before all clients have disconnected.

The `Ray` and `Slurm` backends are not affected:
- `ray.kill` / `actor.destroy.remote()` dispatch is asynchronous, so all actors receive the signal within ms;
- `scancel <job_id>` broadcasts the signal at the cluster layer.

### Expected behavior

On clean shutdown of a local-scheduler job:
- every rank enters `engine.destroy()` within the same small time window;
- NCCL / gloo groups are destroyed after a successful CPU-side barrier;
- the TCPStore server on rank-0 is the last process to exit, so clients never see `Connection reset by peer`;
- the log tail is quiet.


### Full logs

If possible, provide logs for more detailed information.

<img width="2625" height="1070" alt="Image" src="https://github.com/user-attachments/assets/bf8a9e8b-1748-4e01-9f8c-1154a6061af7" />

## To Reproduce

### Commit ID

Please provide your Git commit ID.

### Environment

Please provide your software and hardware information if you're not using a
containerized environment.

### Script

The bash script or YAML configuration to run:


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] TCPStore.recvValue failed / HeartbeatMonitor warnings on clean shutdown with scheduler.type=local #1245

Checklist

Detailed Information

Describe the bug

Expected behavior

Full logs

To Reproduce

Commit ID

Environment

Script

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] TCPStore.recvValue failed / HeartbeatMonitor warnings on clean shutdown with scheduler.type=local #1245

Description

Checklist

Detailed Information

Describe the bug

Expected behavior

Full logs

To Reproduce

Commit ID

Environment

Script

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions