Commit e7230e2
committed
fix(engine): synchronize on CPU group before destroying NCCL PG
Before calling ``dist.destroy_process_group()``, FSDPEngine and
MegatronEngine now perform a barrier on the gloo CPU subgroup. This
ensures every rank finishes its final NCCL abort together instead of
having rank-0 tear down the global TCPStore while non-zero ranks'
HeartbeatMonitor threads are still polling it.
Symptom this fixes: a noisy stderr backtrace at the very end of a
successful training run, e.g.
[W TCPStore.cpp] recvValue failed on SocketImpl ... no error
[W ProcessGroupNCCL.cpp] ... HeartbeatMonitor::runLoop()
emitted from the NCCL HeartbeatMonitor C++ thread on non-zero ranks
after rank-0's TCPStore server has already been shut down.
Also make ``FSDPEngine.destroy()`` idempotent by flipping
``own_global_group`` to False after tearing the group down, matching
``MegatronEngine.destroy()``. This protects against double-destroy
from future cleanup hooks.
The barrier is wrapped in try/except so a half-dead process group at
teardown never turns a warning into a hard failure.1 parent ae8c792 commit e7230e2
2 files changed
Lines changed: 31 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
419 | 419 | | |
420 | 420 | | |
421 | 421 | | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
422 | 436 | | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
423 | 441 | | |
424 | 442 | | |
425 | 443 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
510 | 510 | | |
511 | 511 | | |
512 | 512 | | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
513 | 526 | | |
514 | 527 | | |
515 | 528 | | |
| |||
0 commit comments