You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(controller): tear down workers in reverse rank order with concurrent dispatch
When ``TrainController.destroy()`` asks the scheduler to kill its workers,
rank-0 is now signalled last instead of first. Rank-0 hosts the global
TCPStore server that all other ranks' ``ProcessGroupNCCL::HeartbeatMonitor``
threads still poll during their final cleanup; killing it first leaves
peers observing a closed socket, which surfaces as
[W TCPStore.cpp] recvValue failed ... no error
[W ProcessGroupNCCL.cpp] ... HeartbeatMonitor::runLoop()
in stderr at the very end of a successful run.
Reverse rank order alone is necessary but not sufficient: the original
``LocalScheduler._cleanup_workers`` iterated workers serially and blocked
on ``kill_process_tree(..., timeout=3, graceful=True)`` for each one. A
4-rank job therefore spent ~12s in cleanup, with only one rank inside its
``engine.destroy()`` path at a time. The CPU ``dist.barrier`` added in the
companion FSDP commit could never rendezvous -- every rank timed out and
the NCCL/TCPStore teardown race still fired. The local scheduler now
dispatches SIGTERM to all workers concurrently via daemon threads and
joins them in parallel, so every rank enters ``engine.destroy()`` within
the same small window while rank-0 still receives its signal last (by a
few milliseconds).
Changes
-------
* ``Scheduler.delete_workers`` grows an optional ``reverse_order: bool``
keyword, documented in the abstract API. Existing callers stay
source-compatible.
* ``LocalScheduler._cleanup_workers`` is restructured into three phases:
synchronous port release, concurrent SIGTERM dispatch (one daemon
thread per worker, started in caller-provided order), and parallel
join with a bounded timeout. Per-worker SIGTERM -> wait -> SIGKILL
escalation is preserved via the existing ``kill_process_tree`` helper.
* ``RayScheduler`` honours the flag by iterating workers in reverse
before invoking ``actor.destroy.remote()``. No concurrency change is
needed because Ray's ``.remote()`` is already async-dispatched.
* ``SlurmScheduler`` accepts the keyword for API parity but ignores it,
since ``scancel`` tears down the whole job step atomically.
* ``TrainController.destroy()`` now:
- passes ``reverse_order=True`` with a ``TypeError`` fallback so
third-party schedulers keep working;
- inspects the ``asyncio.gather(..., return_exceptions=True)`` result
and logs per-rank engine-destroy failures as warnings instead of
silently discarding them;
- documents the new two-phase teardown invariant in its docstring.
* Mock schedulers in ``tests/test_train_controller.py`` and
``tests/test_rollout_controller.py`` accept the new kwarg.
Tests
-----
* ``tests/test_train_controller.py`` asserts ``delete_workers`` is called
with ``reverse_order=True`` and verifies the ``TypeError`` fallback
path for legacy schedulers.
* ``tests/test_local_scheduler.py`` verifies that ``reverse_order=True``
produces the expected reverse iteration over workers.
* Verified end-to-end with the HH-RLHF DPO example under
``scheduler.type=local``: the previously reproducible
``TCPStore.recvValue failed`` / HeartbeatMonitor warnings no longer
appear on clean shutdown.
0 commit comments