[RCCL] net-ib: per-QP fault injection API, hooks, and tests for IB and CAST paths#5515
[RCCL] net-ib: per-QP fault injection API, hooks, and tests for IB and CAST paths#5515i-kosarev wants to merge 8 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a test-only per-QP fault injection facility for the regular IB and CAST paths, along with extensive new MPI-based transport tests and corresponding test-runner suites.
Changes:
- Added CAST WRR scheduler “inspect” APIs and per-QP fault injection APIs/hooks (guarded by
ENABLE_FAULT_INJECTION) innet_ib.ccandnet_ib_cast.cc. - Added new NetIb MPI test infrastructure plus large sets of new tests for NIC fusion/vNIC, CAST WRR behavior, and fault injection.
- Extended test-runner JSON configs to add CAST and fault-injection suites; updated ROCm net-ib symbol renaming to avoid duplicate symbols.
Reviewed changes
Copilot reviewed 17 out of 18 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| projects/rccl/tools/scripts/test_runner/configs/net_ib_transport.json | Adds CAST and fault-injection configs/suites for transport CI. |
| projects/rccl/tools/scripts/test_runner/configs/mi300x_mellanox_ib.json | Mirrors CAST and fault-injection suites for MI300X Mellanox IB coverage. |
| projects/rccl/test/transport/NetIbMPI/NicFusionTests.cpp | New NIC fusion / vNIC functional & stress tests. |
| projects/rccl/test/transport/NetIbMPI/NetIbMPITestBase.hpp | New shared MPI test fixture/helpers for net-ib and CAST testing. |
| projects/rccl/test/transport/NetIbMPI/NetIbFaultInject.hpp | Test wrapper header re-exporting the shared fault-injection API. |
| projects/rccl/test/transport/NetIbMPI/NetIbCastInspect.hpp | Test wrapper header re-exporting CAST scheduler inspection API. |
| projects/rccl/test/transport/NetIbMPI/GeneralTests.cpp | New general net-ib MPI tests (init, properties, send/recv, stress). |
| projects/rccl/test/transport/NetIbMPI/FaultInjectTests.cpp | New per-QP delay/error fault-injection tests for IB and CAST. |
| projects/rccl/test/transport/NetIbMPI/CastTests.cpp | New CAST WRR scheduler white-box tests + stress tests. |
| projects/rccl/test/CMakeLists.txt | Splits prior NetIb MPI test file into multiple new compilation units. |
| projects/rccl/src/transport/net_ib_cast.cc | Adds inspect + fault-injection APIs and hook points in CAST send path. |
| projects/rccl/src/transport/net_ib.cc | Adds fault-injection API and hook points in regular IB send path. |
| projects/rccl/src/include/net_ib_fault_inject.h | New shared test-only fault injection API header (guarded by ENABLE_FAULT_INJECTION). |
| projects/rccl/src/include/net_ib_cast_inspect.h | New shared test-only CAST scheduler inspection API header. |
| projects/rccl/src/CMakeLists.txt | Adds the new headers to the build’s source list. |
| projects/rccl/cmake/rocmIb.cmake | Adds a sed rename rule to avoid ncclIbFault* symbol duplication in ROCm net-ib generation. |
| projects/rccl/.gitignore | Broadens build directory ignore rule from build/ to build*/. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
regression-detection smoke (single-node) on commit be7bc6bc83fdb2740a7bab15033f8b3cc7adf453Artifacts - No regressions preview generated. |
be7bc6b to
2cd755e
Compare
regression-detection smoke (single-node) on commit 2cd755e70202a7b0c1be838d83e927d7486a5696Artifacts - No regressions preview generated. |
2cd755e to
c222f53
Compare
regression-detection smoke (single-node) on commit c222f5391a2db1f28d275dd8cce9af34da7cf85aArtifacts - No regressions preview generated. |
d5d1d3b to
9640bb3
Compare
9640bb3 to
5cede29
Compare
29251a4 to
81bf5d1
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f49a77e to
5f29b91
Compare
Add per-QP fault injection to ncclIbNetCommBase: faultQpDelayUs (usleep before post_send) and faultQpError (return ncclSystemError and increment fatalErrorCount). Hook fires in IbCastMultiSend (CAST path only). Public API in net_ib_fault_inject.h: ncclIbCastFault* for the CAST path. All fault code is guarded by ENABLE_FAULT_INJECTION; production builds are zero-cost. rocmIb.cmake renamed ncclIbFault → rocmIbFault to avoid duplicate symbols in the generated net_ib_rocm.cc.
CAST path: arm faultQpError on QP 0 of the CAST sender. Verifies that isend returns an error or fatalErrorCount > 0, confirming the fault hook in IbCastMultiSend propagates correctly.
CAST WRR path: arm 10 ms delay on QP 0, run 500 sends. Verifies that the RTT timer reduces initQpTokens[0] below the equal-share baseline, and that WRR token accounting invariants hold throughout.
CAST path: arm 2 ms per-QP delay, run 50 send/recv pairs. Verifies that artificial delays do not corrupt data and fatalErrorCount stays 0.
Arm error injection on exactly one QP (QP 0) using ncclIbCastSetTokens to steer all WRR tokens there before the fault send. This makes the chosen QP deterministic regardless of where the WRR cursor landed after warmup, modeling the 'one link failed' scenario. Verifies: isend returns error OR fatalErrorCount > 0. Skipped when actualNqps < 2 (FaultInjCastQpErrorIsFatal covers that).
Arm error injection on all QPs of connection 1, trigger one failed send, call FaultClear, then close connection 1. Open a fresh connection 2 and transfer 20x4096-byte messages. Verifies data integrity (memcmp) and that fatalErrorCount on connection 2 is 0. A new connection is required because a faulted connection accumulates fatalErrorCount and IbCastIsend checks that counter when NCCL_IB_RETURN_ASYNC_EVENTS=1 is set.
- Validate qpIndex against comm->base.nqps (actual QP count) instead of NCCL_IB_MAX_QPS (128) in FaultSetQpDelay and FaultSetQpError, so arming a non-existent QP returns ncclInvalidArgument immediately. - Reset fatalErrorCount in ncclIbCastFaultClear so the connection is truly recoverable after clearing fault state. - Add phase 1 verification in FaultInjCastQpErrorClearRecovers: rank 1 forwards sendRet + fatalCount to rank 0 via MPI, rank 0 asserts that the fault was actually observed before testing recovery in phase 2. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
5f29b91 to
41407a5
Compare
nikitaxgusev
left a comment
There was a problem hiding this comment.
In general looks good.
| * double-definition when both headers are included together. | ||
| */ | ||
| #ifndef NCCL_IB_MAX_QPS | ||
| #define NCCL_IB_MAX_QPS 128 |
There was a problem hiding this comment.
If net_ib_cast_inspect.h defines NCCL_IB_MAX_QPS to a different value lets say it changes to 256 in the future , the fault arrays will silently be with different size, may be adding static_assert(NCCL_IB_MAX_QPS == 128)
There was a problem hiding this comment.
|
%RCCL_TEST_CONFIG |
|
Moved to #5931 |
Motivation
Adds a test-only per-QP fault injection facility to the net-ib CAST path, guarded by
ENABLE_FAULT_INJECTIONso production builds are zero-cost.Technical Details
Infrastructure (
net-ib: fault injection API and hooks...)ncclIbNetCommBase: two new fieldsfaultQpDelayUs[]/faultQpError[]IbCastMultiSend(net_ib_cast.cc): inject artificialusleepor returnncclSystemError+ incrementfatalErrorCountbeforewrap_ibv_post_sendsrc/include/net_ib_fault_inject.h:ncclIbCastFault{SetQpDelay,SetQpError,Clear,GetFatalCount}for the CAST pathcmake/rocmIb.cmake: addedncclIbFault → rocmIbFaultrename rule to fix duplicate symbol linker errors innet_ib_rocm.ccTest Plan
FaultInjCastQpErrorIsFatal— CAST: QP error propagates as a fatal event; fault is armed on all active QPs, results forwarded from rank 1 to rank 0 for assertionFaultInjCastSlowQpRebalances— CAST WRR: 10 ms delay on QP 0 causes the RTT timer to rebalance tokens below equal share; scheduler state forwarded from rank 1 to rank 0FaultInjCastDelayDataIntegrity— CAST: 2 ms per-QP delay across 50 sends, full data integrity verified on rank 0FaultInjCastSingleQpErrorIsFatal— CAST WRR "one link failed" scenario: all WRR tokens steered to QP 0 viancclIbCastSetTokensbefore arming the fault on that QP only; verifiesisendreturns error orfatalErrorCount > 0; skipped whenactualNqps < 2FaultInjCastQpErrorClearRecovers— CAST recovery: after arming error on all QPs, triggering one failed send, and callingFaultClear, a fresh connection completes 20 × 4096-byte sends with data integrity andfatalErrorCount == 0JIRA ID
AICOMNET-133
Submission Checklist