Commit ea9466e
committed
[None][feat] WideEP FT: add active_rank_mask to NVLink AlltoAll kernels
Eliminates the infinite-spin AlltoAll hang that turns a single GPU failure in a Wide-EP group into a 5-minute HangDetector fire + full restart. The dispatch and combine kernels now take a uint64[2] bitmask of currently-alive EP ranks; dead ranks are skipped on every completion-flag write/wait, peer recv_counter store, EPLB stats write, and per-token routing decision (dead-targeted slots collapse to the same -1 sentinel combine already uses for duplicates).
The mask is optional on both torch ops; omitting it (or passing all-ones) produces bit-identical output to the pre-change kernel. kMaxRanks is bumped 64 -> 128 to cover NVL72 with headroom; kRankMaskWords = 2 names the kernel ABI explicitly.
Tests cover (a) all-ones mask matches no-mask bit-for-bit, and (b) one rank masked dead -> surviving ranks complete dispatch+combine without hang, dead-targeted topk slots dropped, in tests/unittest/_torch/multi_gpu/test_moe_a2a_rank_mask.py.
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>1 parent eddaa3a commit ea9466e
6 files changed
Lines changed: 543 additions & 10 deletions
File tree
- cpp/tensorrt_llm
- kernels/communicationKernels
- thop
- tensorrt_llm/_torch
- custom_ops
- modules/fused_moe/communication
- tests/unittest/_torch/multi_gpu
Lines changed: 57 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
210 | 210 | | |
211 | 211 | | |
212 | 212 | | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
213 | 221 | | |
214 | 222 | | |
215 | 223 | | |
| |||
432 | 440 | | |
433 | 441 | | |
434 | 442 | | |
435 | | - | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
436 | 449 | | |
437 | 450 | | |
438 | 451 | | |
| |||
511 | 524 | | |
512 | 525 | | |
513 | 526 | | |
514 | | - | |
| 527 | + | |
| 528 | + | |
515 | 529 | | |
516 | 530 | | |
517 | 531 | | |
| 532 | + | |
| 533 | + | |
518 | 534 | | |
519 | 535 | | |
520 | 536 | | |
521 | 537 | | |
522 | 538 | | |
523 | 539 | | |
524 | 540 | | |
| 541 | + | |
525 | 542 | | |
526 | 543 | | |
527 | 544 | | |
| 545 | + | |
| 546 | + | |
528 | 547 | | |
529 | 548 | | |
530 | 549 | | |
| |||
543 | 562 | | |
544 | 563 | | |
545 | 564 | | |
| 565 | + | |
| 566 | + | |
546 | 567 | | |
547 | 568 | | |
548 | 569 | | |
| 570 | + | |
| 571 | + | |
549 | 572 | | |
550 | 573 | | |
551 | 574 | | |
| |||
555 | 578 | | |
556 | 579 | | |
557 | 580 | | |
| 581 | + | |
| 582 | + | |
558 | 583 | | |
559 | 584 | | |
560 | 585 | | |
| 586 | + | |
| 587 | + | |
561 | 588 | | |
562 | 589 | | |
563 | 590 | | |
| |||
605 | 632 | | |
606 | 633 | | |
607 | 634 | | |
| 635 | + | |
| 636 | + | |
| 637 | + | |
| 638 | + | |
608 | 639 | | |
609 | 640 | | |
610 | 641 | | |
| |||
642 | 673 | | |
643 | 674 | | |
644 | 675 | | |
| 676 | + | |
| 677 | + | |
| 678 | + | |
| 679 | + | |
| 680 | + | |
| 681 | + | |
645 | 682 | | |
646 | 683 | | |
647 | 684 | | |
| |||
1153 | 1190 | | |
1154 | 1191 | | |
1155 | 1192 | | |
| 1193 | + | |
| 1194 | + | |
1156 | 1195 | | |
1157 | 1196 | | |
1158 | 1197 | | |
| 1198 | + | |
| 1199 | + | |
1159 | 1200 | | |
1160 | 1201 | | |
1161 | 1202 | | |
| |||
1165 | 1206 | | |
1166 | 1207 | | |
1167 | 1208 | | |
| 1209 | + | |
| 1210 | + | |
1168 | 1211 | | |
1169 | 1212 | | |
1170 | 1213 | | |
| 1214 | + | |
| 1215 | + | |
1171 | 1216 | | |
1172 | 1217 | | |
1173 | 1218 | | |
| |||
1273 | 1318 | | |
1274 | 1319 | | |
1275 | 1320 | | |
| 1321 | + | |
| 1322 | + | |
| 1323 | + | |
| 1324 | + | |
1276 | 1325 | | |
1277 | 1326 | | |
1278 | 1327 | | |
| |||
1306 | 1355 | | |
1307 | 1356 | | |
1308 | 1357 | | |
| 1358 | + | |
| 1359 | + | |
| 1360 | + | |
| 1361 | + | |
| 1362 | + | |
| 1363 | + | |
1309 | 1364 | | |
1310 | 1365 | | |
1311 | 1366 | | |
| |||
Lines changed: 27 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
29 | | - | |
30 | | - | |
31 | | - | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
32 | 35 | | |
33 | 36 | | |
34 | 37 | | |
| |||
65 | 68 | | |
66 | 69 | | |
67 | 70 | | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
68 | 77 | | |
69 | 78 | | |
70 | 79 | | |
| |||
82 | 91 | | |
83 | 92 | | |
84 | 93 | | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
85 | 99 | | |
86 | 100 | | |
87 | 101 | | |
| |||
125 | 139 | | |
126 | 140 | | |
127 | 141 | | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
128 | 147 | | |
129 | 148 | | |
130 | 149 | | |
| |||
170 | 189 | | |
171 | 190 | | |
172 | 191 | | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
173 | 197 | | |
174 | 198 | | |
175 | 199 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
42 | 42 | | |
43 | 43 | | |
44 | 44 | | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
45 | 79 | | |
46 | 80 | | |
47 | 81 | | |
| |||
181 | 215 | | |
182 | 216 | | |
183 | 217 | | |
184 | | - | |
| 218 | + | |
| 219 | + | |
185 | 220 | | |
186 | 221 | | |
187 | 222 | | |
| |||
360 | 395 | | |
361 | 396 | | |
362 | 397 | | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
363 | 402 | | |
364 | 403 | | |
365 | 404 | | |
| |||
413 | 452 | | |
414 | 453 | | |
415 | 454 | | |
416 | | - | |
| 455 | + | |
| 456 | + | |
417 | 457 | | |
418 | 458 | | |
419 | 459 | | |
| |||
520 | 560 | | |
521 | 561 | | |
522 | 562 | | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
523 | 566 | | |
524 | 567 | | |
525 | 568 | | |
| |||
613 | 656 | | |
614 | 657 | | |
615 | 658 | | |
616 | | - | |
| 659 | + | |
| 660 | + | |
617 | 661 | | |
618 | 662 | | |
619 | 663 | | |
620 | 664 | | |
621 | | - | |
| 665 | + | |
| 666 | + | |
622 | 667 | | |
623 | 668 | | |
624 | 669 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
448 | 448 | | |
449 | 449 | | |
450 | 450 | | |
| 451 | + | |
451 | 452 | | |
452 | 453 | | |
453 | 454 | | |
| |||
478 | 479 | | |
479 | 480 | | |
480 | 481 | | |
| 482 | + | |
481 | 483 | | |
482 | 484 | | |
483 | 485 | | |
| |||
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
51 | 51 | | |
52 | 52 | | |
53 | 53 | | |
54 | | - | |
| 54 | + | |
55 | 55 | | |
56 | 56 | | |
57 | 57 | | |
| |||
0 commit comments