Commit 65914f9
fix(configs): size dense multi-node NCCL world by inference GPU count (#2707)
The dense multi-node external-LB weight-broadcast world size was computed as total_infer_nodes * api_server_count * tp. api_server_count can resolve to the global DP size (e.g. when parallel.dp is set, or via validator ordering), which double-counts the node dimension, so the trainer's NCCL broadcast waits for more ranks than exist and init deadlocks.
Every allocated inference GPU is one NCCL rank, and the external-LB launcher starts dp_per_node TP-sharded servers per node (gpus_per_node workers/node), so size the world directly as total_infer_nodes * gpus_per_node. This matches the disaggregated path.
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>1 parent b67fd12 commit 65914f9
1 file changed
Lines changed: 8 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
548 | 548 | | |
549 | 549 | | |
550 | 550 | | |
551 | | - | |
552 | | - | |
553 | | - | |
554 | | - | |
555 | | - | |
| 551 | + | |
| 552 | + | |
| 553 | + | |
| 554 | + | |
| 555 | + | |
| 556 | + | |
| 557 | + | |
| 558 | + | |
556 | 559 | | |
557 | 560 | | |
558 | 561 | | |
| |||
0 commit comments