Commit dea5cbd
net-ib: fix fault injection API issues from review
- ncclIbCastFaultSetQpDelay/SetQpError: use comm->base.nqps for bounds
check instead of NCCL_IB_MAX_QPS
- ncclIbCastFaultClear: atomically reset fatalErrorCount to 0
- move net_ib_fault_inject.h into transport/net_ib_cast/; drop local
#define NCCL_IB_MAX_QPS 128, include net_ib_cast_inspect.h and add
static_assert so a size mismatch becomes a compile error
- fault hook in IbCastMultiSend: use IbCastStatsFatalError (renamed from
ncclIbStatsFatalError in asanniko's split)
- FaultInjCastQpErrorClearRecovers: ASSERT_EQ on SetQpError return value
(was silently ignored); drain recvReq before DeregisterMemory
- FaultInjCastSingleQpErrorIsFatal: EXPECT_EQ on ncclIbCastSetTokens and
ncclIbCastFaultClear return values (Copilot review)
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>1 parent 80a1ad3 commit dea5cbd
5 files changed
Lines changed: 52 additions & 21 deletions
File tree
- projects/rccl
- src
- transport/net_ib_cast
- test
- transport/NetIbMPI
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
217 | 217 | | |
218 | 218 | | |
219 | 219 | | |
220 | | - | |
| 220 | + | |
221 | 221 | | |
222 | 222 | | |
223 | 223 | | |
| |||
Lines changed: 5 additions & 7 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
25 | | - | |
26 | | - | |
27 | | - | |
28 | | - | |
| 25 | + | |
29 | 26 | | |
30 | | - | |
31 | | - | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
32 | 30 | | |
33 | 31 | | |
34 | 32 | | |
| |||
39 | 37 | | |
40 | 38 | | |
41 | 39 | | |
42 | | - | |
| 40 | + | |
43 | 41 | | |
44 | 42 | | |
45 | 43 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
298 | 298 | | |
299 | 299 | | |
300 | 300 | | |
301 | | - | |
| 301 | + | |
302 | 302 | | |
303 | 303 | | |
304 | 304 | | |
| |||
1094 | 1094 | | |
1095 | 1095 | | |
1096 | 1096 | | |
1097 | | - | |
| 1097 | + | |
1098 | 1098 | | |
1099 | 1099 | | |
1100 | 1100 | | |
1101 | 1101 | | |
1102 | 1102 | | |
1103 | 1103 | | |
1104 | 1104 | | |
1105 | | - | |
| 1105 | + | |
1106 | 1106 | | |
1107 | 1107 | | |
1108 | 1108 | | |
| |||
1112 | 1112 | | |
1113 | 1113 | | |
1114 | 1114 | | |
| 1115 | + | |
1115 | 1116 | | |
1116 | 1117 | | |
1117 | 1118 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
94 | 94 | | |
95 | 95 | | |
96 | 96 | | |
| 97 | + | |
97 | 98 | | |
98 | 99 | | |
99 | 100 | | |
| |||
Lines changed: 41 additions & 10 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
462 | 462 | | |
463 | 463 | | |
464 | 464 | | |
465 | | - | |
| 465 | + | |
466 | 466 | | |
467 | 467 | | |
468 | 468 | | |
| |||
602 | 602 | | |
603 | 603 | | |
604 | 604 | | |
605 | | - | |
| 605 | + | |
606 | 606 | | |
607 | 607 | | |
608 | 608 | | |
609 | 609 | | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
| 615 | + | |
610 | 616 | | |
611 | 617 | | |
612 | 618 | | |
613 | 619 | | |
614 | 620 | | |
615 | | - | |
616 | | - | |
| 621 | + | |
617 | 622 | | |
618 | 623 | | |
619 | | - | |
620 | | - | |
| 624 | + | |
| 625 | + | |
621 | 626 | | |
622 | 627 | | |
623 | 628 | | |
| |||
628 | 633 | | |
629 | 634 | | |
630 | 635 | | |
| 636 | + | |
631 | 637 | | |
632 | 638 | | |
633 | 639 | | |
634 | | - | |
635 | 640 | | |
636 | | - | |
637 | | - | |
| 641 | + | |
| 642 | + | |
638 | 643 | | |
639 | 644 | | |
| 645 | + | |
| 646 | + | |
640 | 647 | | |
641 | | - | |
| 648 | + | |
| 649 | + | |
| 650 | + | |
| 651 | + | |
| 652 | + | |
| 653 | + | |
| 654 | + | |
| 655 | + | |
| 656 | + | |
| 657 | + | |
| 658 | + | |
| 659 | + | |
| 660 | + | |
| 661 | + | |
642 | 662 | | |
643 | 663 | | |
644 | 664 | | |
645 | 665 | | |
| 666 | + | |
| 667 | + | |
| 668 | + | |
| 669 | + | |
| 670 | + | |
| 671 | + | |
| 672 | + | |
| 673 | + | |
| 674 | + | |
| 675 | + | |
| 676 | + | |
646 | 677 | | |
647 | 678 | | |
648 | 679 | | |
| |||
0 commit comments