Enforce per-op timeouts on init and coordination collectives by mmshad · Pull Request #64 · KempnerInstitute/KempnerForge

mmshad · 2026-04-22T01:41:30Z

Closes #63.

What

Make declared per-op timeouts on init and coordination collectives actually enforce. Today they fall through to the 1800s process-group default, so a wedged peer hangs the job for 30 minutes regardless of the caller's intent.

kempnerforge/resilience/health.py::check_nccl_health switches to async_op=True + Work.wait(timeout=timedelta(seconds=timeout_sec)) and returns False on either timeout signaling path (RuntimeError on modern PyTorch, False return on legacy backends).
kempnerforge/distributed/setup.py adds _barrier_with_timeout(seconds, reason) and uses it for the DeviceMesh-construction barrier with a 60s budget. Timeouts raise RuntimeError with the reason string attached so the failure points at the right phase of init.
kempnerforge/training/eval.py wraps the PP eval loss broadcast in a 300s budget with a nan fallback so a wedged PP stage surfaces as a clear bad-iteration signal instead of a half-hour hang.
kempnerforge/distributed/setup.py::_set_nccl_env setdefaults TORCH_NCCL_ASYNC_ERROR_HANDLING=1 and logs a warning when a user has explicitly set it to 0. Without this, every per-op bound in this patch is advisory.

Known limit, train_state broadcast

The original commit also wired a 600s async_op=True budget around the train_state broadcast in CheckpointManager.load. PyTorch 2.11's dist.broadcast_object_list does not accept async_op (TypeError at runtime), so the second commit on this branch reverts that one site to the plain synchronous form and leaves an explanatory comment. The other three call sites still get fast-fail; a wedged rank surfaces via those before the broadcast hits the 1800s default.

Tests

tests/unit/test_collective_timeouts.py (10 new tests, all passing locally):

TestCheckNcclHealthTimeout covers Work.wait(timedelta(...)) forwarding, RuntimeError and False-return timeout paths, and the not-initialized early return.
TestBarrierWithTimeout covers the reason-string error message, the False-return path, and the success path.
TestNcclAsyncErrorHandlingEnvGuard covers the unset/explicit-1/explicit-0 cases for TORCH_NCCL_ASYNC_ERROR_HANDLING.

The test file documents the absence of a TestCheckpointBroadcastTimeout class with a comment pointing at the PyTorch 2.11 limitation.

check_nccl_health accepted a timeout_sec parameter and then called dist.all_reduce synchronously, inheriting the 1800s PG default instead. Every caller asking for a 10s liveness probe got a 30-minute one. Switch to async_op=True + Work.wait(timeout=timedelta(...)) so the caller's bound is actually enforced, and handle both the RuntimeError and legacy False-return timeout paths. Add _barrier_with_timeout for the DeviceMesh-construction barrier (60s) so a stuck rank during init fails fast instead of after 30min. Wrap the PP eval loss broadcast in a 300s budget that falls back to nan loss, and the train_state broadcast_object_list in a 600s budget that raises a diagnostic RuntimeError — both previously inherited the PG default. Force TORCH_NCCL_ASYNC_ERROR_HANDLING=1 in _set_nccl_env when unset, and warn loudly when a user has explicitly set it to 0 (at which point NCCL timeouts become advisory and none of the other bounds matter).

dist.broadcast_object_list does not accept async_op in PyTorch 2.11, so the sub-fix added in 330e158 raised TypeError on every distributed load. The init-path barrier, NCCL health, and eval-loss paths in that patch all remain fast-fail; only the object broadcast inherits the 1800s PG default now, which matches the pre-patch behavior.

PyTorch's distributed stubs type dist.barrier/all_reduce/broadcast as returning Work | None because the sync path (async_op=False) returns None. Pyright cannot narrow the return when async_op=True is a literal kwarg, so every .wait(...) on the async handle flags reportOptionalMemberAccess. Add the same # type: ignore the rest of the codebase uses at three sites: _barrier_with_timeout, the eval-loss broadcast, and check_nccl_health.

mmshad added 2 commits April 21, 2026 21:22

mmshad self-assigned this Apr 22, 2026

mmshad requested a review from Naeemkh April 22, 2026 01:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce per-op timeouts on init and coordination collectives#64

Enforce per-op timeouts on init and coordination collectives#64
mmshad wants to merge 3 commits intomainfrom
collective-timeouts

mmshad commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mmshad commented Apr 22, 2026

What

Known limit, train_state broadcast

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant