Commit 5f29b91
net-ib: fix fault injection API issues from review
- Validate qpIndex against comm->base.nqps (actual QP count) instead of
NCCL_IB_MAX_QPS (128) in FaultSetQpDelay and FaultSetQpError, so
arming a non-existent QP returns ncclInvalidArgument immediately.
- Reset fatalErrorCount in ncclIbCastFaultClear so the connection is
truly recoverable after clearing fault state.
- Add phase 1 verification in FaultInjCastQpErrorClearRecovers: rank 1
forwards sendRet + fatalCount to rank 0 via MPI, rank 0 asserts that
the fault was actually observed before testing recovery in phase 2.
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>1 parent 7da9b83 commit 5f29b91
2 files changed
Lines changed: 25 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2942 | 2942 | | |
2943 | 2943 | | |
2944 | 2944 | | |
2945 | | - | |
| 2945 | + | |
2946 | 2946 | | |
2947 | 2947 | | |
2948 | 2948 | | |
2949 | 2949 | | |
2950 | 2950 | | |
2951 | 2951 | | |
2952 | 2952 | | |
2953 | | - | |
| 2953 | + | |
2954 | 2954 | | |
2955 | 2955 | | |
2956 | 2956 | | |
| |||
2960 | 2960 | | |
2961 | 2961 | | |
2962 | 2962 | | |
| 2963 | + | |
2963 | 2964 | | |
2964 | 2965 | | |
2965 | 2966 | | |
| |||
Lines changed: 22 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
607 | 607 | | |
608 | 608 | | |
609 | 609 | | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
610 | 614 | | |
611 | 615 | | |
612 | 616 | | |
| |||
628 | 632 | | |
629 | 633 | | |
630 | 634 | | |
| 635 | + | |
631 | 636 | | |
632 | 637 | | |
633 | 638 | | |
634 | | - | |
635 | 639 | | |
636 | | - | |
637 | | - | |
| 640 | + | |
| 641 | + | |
638 | 642 | | |
639 | 643 | | |
| 644 | + | |
| 645 | + | |
640 | 646 | | |
| 647 | + | |
| 648 | + | |
641 | 649 | | |
| 650 | + | |
| 651 | + | |
| 652 | + | |
| 653 | + | |
| 654 | + | |
| 655 | + | |
| 656 | + | |
| 657 | + | |
| 658 | + | |
| 659 | + | |
| 660 | + | |
642 | 661 | | |
643 | 662 | | |
644 | 663 | | |
| |||
0 commit comments