[RCCL] net-ib: per-QP fault injection API, hooks, and tests for IB and CAST paths#5931
[RCCL] net-ib: per-QP fault injection API, hooks, and tests for IB and CAST paths#5931i-kosarev wants to merge 8 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a test-only, per-QP fault injection facility for the RCCL net-ib CAST transport (guarded by ENABLE_FAULT_INJECTION), and adds MPI-based gtests plus CI runner configurations to validate error propagation, WRR rebalancing under artificial delay, and data integrity.
Changes:
- Added a CAST per-QP fault-injection API header and CAST transport hooks/state to support QP delay/error injection in test builds.
- Added new MPI gtests for fault injection scenarios and enabled them via new test-runner suites/configs.
- Refactored/build-integrated net_ib_cast sources via CMake and extended ibverbs/mlx5dv wrapper infrastructure (dynamic symbol loading + helpers).
Reviewed changes
Copilot reviewed 34 out of 35 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| projects/rccl/tools/scripts/test_runner/configs/net_ib_transport.json | Adds a new CAST fault-injection test config and suite to the test runner. |
| projects/rccl/tools/scripts/test_runner/configs/mi300x_mellanox_ib.json | Adds a MI300X-specific CAST fault-injection test config and suite. |
| projects/rccl/test/transport/NetIbMPI/NetIbFaultInject.hpp | Test-side wrapper header to include the shared fault injection API. |
| projects/rccl/test/transport/NetIbMPI/FaultInjectTests.cpp | Adds MPI gtests covering CAST per-QP error/delay injection, WRR behavior, and recovery. |
| projects/rccl/test/CMakeLists.txt | Wires new fault-injection tests and net_ib_cast include paths into the test build. |
| projects/rccl/src/transport/net_ib_cast/scheduler.cc | Adds/updates CAST WRR scheduler logic and exposes test introspection helpers. |
| projects/rccl/src/transport/net_ib_cast/reg.cc | Adds CAST-side MR registration/deregistration implementation. |
| projects/rccl/src/transport/net_ib_cast/p2p_resiliency_recovery_cast.h | Adds CAST resiliency recovery header declarations. |
| projects/rccl/src/transport/net_ib_cast/p2p_resiliency_cast.h | Adds CAST resiliency types and APIs header. |
| projects/rccl/src/transport/net_ib_cast/p2p_cast.h | Adds CAST P2P header utilities (including request index retrieval helper). |
| projects/rccl/src/transport/net_ib_cast/net_ib_fault_inject.h | Introduces the test-only CAST per-QP fault injection API header. |
| projects/rccl/src/transport/net_ib_cast/init.cc | CAST transport init and device discovery logic (including NUMA/root-complex helpers). |
| projects/rccl/src/transport/net_ib_cast/gin.cc | CAST GIN backend implementation updates (proxy/GDAKI handling, CQ polling, etc.). |
| projects/rccl/src/transport/net_ib_cast/gin_cast.h | Adds CAST GIN interface declarations. |
| projects/rccl/src/transport/net_ib_cast/gdr.cc | CAST-side GDR / DMA-BUF / peer-mem detection logic. |
| projects/rccl/src/transport/net_ib_cast/connect_cast.h | Adds CAST connection metadata and QP creation API declarations. |
| projects/rccl/src/transport/net_ib_cast/common.cc | CAST common globals and async event thread + net plugin vtable. |
| projects/rccl/src/transport/net_ib_cast/common_cast.h | CAST common header: types, constants, base comm/request structures; adds fault-injection state fields under ENABLE_FAULT_INJECTION. |
| projects/rccl/src/transport/net_ib_cast/CMakeLists.txt | Introduces a dedicated CMake subdir exporting CAST sources and include dirs. |
| projects/rccl/src/transport/CMakeLists.txt | Updates transport CMake to add the net_ib_cast subdirectory. |
| projects/rccl/src/misc/mlx5dvwrap.cc | Adds mlx5dv wrapper helpers (query_device/create_qp) and error-check macro. |
| projects/rccl/src/misc/mlx5dvsymbols.cc | Extends mlx5dv dynamic symbol loading for query_device/create_qp. |
| projects/rccl/src/misc/ibvwrap.cc | Updates ibverbs wrappers (query_port handling, sleep_for usage) and adds WC/opcode-to-string helpers. |
| projects/rccl/src/misc/ibvsymbols.cc | Adds dynamic loading via NCCL_IBVERBS_LIB env override and additional logging. |
| projects/rccl/src/include/mlx5/mlx5dvwrap.h | Declares new mlx5dv wrapper APIs. |
| projects/rccl/src/include/mlx5/mlx5dvsymbols.h | Adds new mlx5dv symbol pointers for query_device/create_qp. |
| projects/rccl/src/include/mlx5/mlx5dvcore.h | Expands mlx5 direct verbs core structs/enums needed for querying device capabilities. |
| projects/rccl/src/include/ibvwrap.h | Declares new WC/opcode-to-string helper functions. |
| projects/rccl/src/CMakeLists.txt | Integrates net_ib_cast CMake subdir outputs into the main RCCL build (sources + include dirs). |
| projects/rccl/cmake/rocmIb.cmake | Adds a sed rename rule to avoid duplicate-symbol issues for ncclIbFault → rocmIbFault. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ca74a5d to
dea5cbd
Compare
|
@nikitaxgusev your comments from #5515 are addressed here |
73d334d to
3173cc1
Compare
Add per-QP fault injection to ncclIbNetCommBase: faultQpDelayUs (usleep before post_send) and faultQpError (return ncclSystemError and increment fatalErrorCount). Hook fires in IbCastMultiSend (CAST path only). Public API in net_ib_fault_inject.h: ncclIbCastFault* for the CAST path. All fault code is guarded by ENABLE_FAULT_INJECTION; production builds are zero-cost. rocmIb.cmake renamed ncclIbFault → rocmIbFault to avoid duplicate symbols in the generated net_ib_rocm.cc.
CAST path: arm faultQpError on QP 0 of the CAST sender. Verifies that isend returns an error or fatalErrorCount > 0, confirming the fault hook in IbCastMultiSend propagates correctly.
CAST WRR path: arm 10 ms delay on QP 0, run 500 sends. Verifies that the RTT timer reduces initQpTokens[0] below the equal-share baseline, and that WRR token accounting invariants hold throughout.
CAST path: arm 2 ms per-QP delay, run 50 send/recv pairs. Verifies that artificial delays do not corrupt data and fatalErrorCount stays 0.
Arm error injection on exactly one QP (QP 0) using ncclIbCastSetTokens to steer all WRR tokens there before the fault send. This makes the chosen QP deterministic regardless of where the WRR cursor landed after warmup, modeling the 'one link failed' scenario. Verifies: isend returns error OR fatalErrorCount > 0. Skipped when actualNqps < 2 (FaultInjCastQpErrorIsFatal covers that).
Arm error injection on all QPs of connection 1, trigger one failed send, call FaultClear, then close connection 1. Open a fresh connection 2 and transfer 20x4096-byte messages. Verifies data integrity (memcmp) and that fatalErrorCount on connection 2 is 0. A new connection is required because a faulted connection accumulates fatalErrorCount and IbCastIsend checks that counter when NCCL_IB_RETURN_ASYNC_EVENTS=1 is set.
357805f to
46cd912
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
projects/rccl/src/transport/net_ib_cast/net_ib_cast_inspect.h:18
net_ib_cast_inspect.huses thebooltype inncclIbCastSchedState, but in the non-C++ include path it never includes<stdbool.h>. If this header is ever included from a C translation unit (the file already has a C branch), it will fail to compile. Add#include <stdbool.h>in the#elsebranch (or otherwise ensureboolis defined for C).
ddurnov
left a comment
There was a problem hiding this comment.
Please clarify why we need extra post processing step here?
|
%RCCL_TEST_CONFIG |
Fix fault injection API issues from review: - ncclIbCastFaultSetQpDelay/SetQpError: use comm->base.nqps for bounds check instead of NCCL_IB_MAX_QPS - ncclIbCastFaultClear: atomically reset fatalErrorCount to 0 - move net_ib_fault_inject.h into transport/net_ib_cast/; drop local #define NCCL_IB_MAX_QPS 128, include net_ib_cast_inspect.h and add static_assert so a size mismatch becomes a compile error - fault hook in IbCastMultiSend: use IbCastStatsFatalError (renamed from ncclIbStatsFatalError in asanniko's split) - FaultInjCastQpErrorClearRecovers: ASSERT_EQ on SetQpError return value; drain recvReq before DeregisterMemory - FaultInjCastSingleQpErrorIsFatal: EXPECT_EQ on ncclIbCastSetTokens and ncclIbCastFaultClear return values Consolidate NCCL_IB_MAX_QPS definition and reorganize headers: - add src/transport/net_ib_limits.h as single source of truth for NCCL_IB_MAX_QPS 128, removing duplicate defines from net_ib.cc, common_cast.h, and net_ib_cast_inspect.h - move net_ib_cast_inspect.h from src/include/ to src/transport/net_ib_cast/ alongside net_ib_fault_inject.h; update CMakeLists include paths - group net_ib_limits.h, net_ib_cast_inspect.h, net_ib_fault_inject.h together next to transport/net_ib.cc in src/CMakeLists.txt Reduce test code duplication: - add helpers to NetIbMPITestBase: DrainRecvRequest, TeardownConnection, ExpectEqualWeightInitTokens, ExpectActiveTokenSumInvariant - use them in FaultInjectTests.cpp and CastTests.cpp to eliminate ~300 lines of repeated teardown and sched-state assertion code Address review comments: - wrap_ibv_free_device_list: add nullptr early-return to avoid NULL dereference on error paths (e.g. net_ib.cc fail:975) - IbCastRequestRetrieveAsIndex: drop dead `reqIndex < 0` check (reqIndex is uint32_t, so always >= 0); use %u format specifier to match type - ibvsymbols.cc: fix typo "Count not open" -> "Could not open" - net_ib_fault_inject.h: fix comment referencing non-existent net_ib_cast.cc; implementation lives in net_ib_cast/p2p.cc - FaultInjCastSlowQpRebalances: skip when GetSplitDataMin()==0 to avoid uint32_t underflow in kMsgSz = GetSplitDataMin() - 1 - drop ncclIbFault sed rename from rocmIb.cmake; fault injection symbols (ncclIbCastFault*) live in our own source files, not the vendored plugin
46cd912 to
d023c86
Compare
|
%RCCL_TEST_CONFIG |
Generally we do this with functions to write nccl* prefix for code compatibility and then have rocm* prefix in binary builds. But "renaming" is not really needed for test functions. Removed |
There was a problem hiding this comment.
LGTM. Minor comment (can be ignored):
FaultInjCastQpErrorIsFatal, FaultInjCastSingleQpErrorIsFatal and Phase 1 of FaultInjCastQpErrorClearRecovers repeat the same ~50-line flow (arm → barrier → post+poll → forward FaultInjectResult → assert), only the arming differs. Worth extracting into one RunFaultInjectErrorScenario(..., armFn) helper?
Thanks for the idea! However the arming logic, message sizes, tags, and assertion messages all differ between the three tests, so a shared helper would need lot of params and it will cancel out the readability gain. The duplication is incidental rather than conceptual — each test tells its own story end-to-end, which I'd prefer to preserve. |
|
%RCCL_TEST_CONFIG |
Motivation
Adds a test-only per-QP fault injection facility to the net-ib CAST path, guarded by
ENABLE_FAULT_INJECTIONso production builds are zero-cost.Technical Details
Infrastructure (
net-ib: fault injection API and hooks...)ncclIbNetCommBase: two new fieldsfaultQpDelayUs[]/faultQpError[]IbCastMultiSend(net_ib_cast.cc): inject artificialusleepor returnncclSystemError+ incrementfatalErrorCountbeforewrap_ibv_post_sendsrc/include/net_ib_fault_inject.h:ncclIbCastFault{SetQpDelay,SetQpError,Clear,GetFatalCount}for the CAST pathcmake/rocmIb.cmake: addedncclIbFault → rocmIbFaultrename rule to fix duplicate symbol linker errors innet_ib_rocm.ccTest Plan
FaultInjCastQpErrorIsFatal— CAST: QP error propagates as a fatal event; fault is armed on all active QPs, results forwarded from rank 1 to rank 0 for assertionFaultInjCastSlowQpRebalances— CAST WRR: 10 ms delay on QP 0 causes the RTT timer to rebalance tokens below equal share; scheduler state forwarded from rank 1 to rank 0FaultInjCastDelayDataIntegrity— CAST: 2 ms per-QP delay across 50 sends, full data integrity verified on rank 0FaultInjCastSingleQpErrorIsFatal— CAST WRR "one link failed" scenario: all WRR tokens steered to QP 0 viancclIbCastSetTokensbefore arming the fault on that QP only; verifiesisendreturns error orfatalErrorCount > 0; skipped whenactualNqps < 2FaultInjCastQpErrorClearRecovers— CAST recovery: after arming error on all QPs, triggering one failed send, and callingFaultClear, a fresh connection completes 20 × 4096-byte sends with data integrity andfatalErrorCount == 0JIRA ID
AICOMNET-133
Submission Checklist