Skip to content

[RCCL] net-ib: per-QP fault injection API, hooks, and tests for IB and CAST paths#5931

Open
i-kosarev wants to merge 8 commits into
developfrom
users/ilkosare/net-ib-fault-injection-tests-for-new-CAST
Open

[RCCL] net-ib: per-QP fault injection API, hooks, and tests for IB and CAST paths#5931
i-kosarev wants to merge 8 commits into
developfrom
users/ilkosare/net-ib-fault-injection-tests-for-new-CAST

Conversation

@i-kosarev
Copy link
Copy Markdown
Contributor

Motivation

Adds a test-only per-QP fault injection facility to the net-ib CAST path, guarded by ENABLE_FAULT_INJECTION so production builds are zero-cost.

Technical Details

Infrastructure (net-ib: fault injection API and hooks...)

  • ncclIbNetCommBase: two new fields faultQpDelayUs[] / faultQpError[]
  • Hook in IbCastMultiSend (net_ib_cast.cc): inject artificial usleep or return ncclSystemError + increment fatalErrorCount before wrap_ibv_post_send
  • Public API in src/include/net_ib_fault_inject.h: ncclIbCastFault{SetQpDelay,SetQpError,Clear,GetFatalCount} for the CAST path
  • cmake/rocmIb.cmake: added ncclIbFault → rocmIbFault rename rule to fix duplicate symbol linker errors in net_ib_rocm.cc

Test Plan

  • FaultInjCastQpErrorIsFatal — CAST: QP error propagates as a fatal event; fault is armed on all active QPs, results forwarded from rank 1 to rank 0 for assertion
  • FaultInjCastSlowQpRebalances — CAST WRR: 10 ms delay on QP 0 causes the RTT timer to rebalance tokens below equal share; scheduler state forwarded from rank 1 to rank 0
  • FaultInjCastDelayDataIntegrity — CAST: 2 ms per-QP delay across 50 sends, full data integrity verified on rank 0
  • FaultInjCastSingleQpErrorIsFatal — CAST WRR "one link failed" scenario: all WRR tokens steered to QP 0 via ncclIbCastSetTokens before arming the fault on that QP only; verifies isend returns error or fatalErrorCount > 0; skipped when actualNqps < 2
  • FaultInjCastQpErrorClearRecovers — CAST recovery: after arming error on all QPs, triggering one failed send, and calling FaultClear, a fresh connection completes 20 × 4096-byte sends with data integrity and fatalErrorCount == 0

JIRA ID

AICOMNET-133

Submission Checklist

Copilot AI review requested due to automatic review settings May 8, 2026 20:10
@i-kosarev i-kosarev requested a review from a team as a code owner May 8, 2026 20:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a test-only, per-QP fault injection facility for the RCCL net-ib CAST transport (guarded by ENABLE_FAULT_INJECTION), and adds MPI-based gtests plus CI runner configurations to validate error propagation, WRR rebalancing under artificial delay, and data integrity.

Changes:

  • Added a CAST per-QP fault-injection API header and CAST transport hooks/state to support QP delay/error injection in test builds.
  • Added new MPI gtests for fault injection scenarios and enabled them via new test-runner suites/configs.
  • Refactored/build-integrated net_ib_cast sources via CMake and extended ibverbs/mlx5dv wrapper infrastructure (dynamic symbol loading + helpers).

Reviewed changes

Copilot reviewed 34 out of 35 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
projects/rccl/tools/scripts/test_runner/configs/net_ib_transport.json Adds a new CAST fault-injection test config and suite to the test runner.
projects/rccl/tools/scripts/test_runner/configs/mi300x_mellanox_ib.json Adds a MI300X-specific CAST fault-injection test config and suite.
projects/rccl/test/transport/NetIbMPI/NetIbFaultInject.hpp Test-side wrapper header to include the shared fault injection API.
projects/rccl/test/transport/NetIbMPI/FaultInjectTests.cpp Adds MPI gtests covering CAST per-QP error/delay injection, WRR behavior, and recovery.
projects/rccl/test/CMakeLists.txt Wires new fault-injection tests and net_ib_cast include paths into the test build.
projects/rccl/src/transport/net_ib_cast/scheduler.cc Adds/updates CAST WRR scheduler logic and exposes test introspection helpers.
projects/rccl/src/transport/net_ib_cast/reg.cc Adds CAST-side MR registration/deregistration implementation.
projects/rccl/src/transport/net_ib_cast/p2p_resiliency_recovery_cast.h Adds CAST resiliency recovery header declarations.
projects/rccl/src/transport/net_ib_cast/p2p_resiliency_cast.h Adds CAST resiliency types and APIs header.
projects/rccl/src/transport/net_ib_cast/p2p_cast.h Adds CAST P2P header utilities (including request index retrieval helper).
projects/rccl/src/transport/net_ib_cast/net_ib_fault_inject.h Introduces the test-only CAST per-QP fault injection API header.
projects/rccl/src/transport/net_ib_cast/init.cc CAST transport init and device discovery logic (including NUMA/root-complex helpers).
projects/rccl/src/transport/net_ib_cast/gin.cc CAST GIN backend implementation updates (proxy/GDAKI handling, CQ polling, etc.).
projects/rccl/src/transport/net_ib_cast/gin_cast.h Adds CAST GIN interface declarations.
projects/rccl/src/transport/net_ib_cast/gdr.cc CAST-side GDR / DMA-BUF / peer-mem detection logic.
projects/rccl/src/transport/net_ib_cast/connect_cast.h Adds CAST connection metadata and QP creation API declarations.
projects/rccl/src/transport/net_ib_cast/common.cc CAST common globals and async event thread + net plugin vtable.
projects/rccl/src/transport/net_ib_cast/common_cast.h CAST common header: types, constants, base comm/request structures; adds fault-injection state fields under ENABLE_FAULT_INJECTION.
projects/rccl/src/transport/net_ib_cast/CMakeLists.txt Introduces a dedicated CMake subdir exporting CAST sources and include dirs.
projects/rccl/src/transport/CMakeLists.txt Updates transport CMake to add the net_ib_cast subdirectory.
projects/rccl/src/misc/mlx5dvwrap.cc Adds mlx5dv wrapper helpers (query_device/create_qp) and error-check macro.
projects/rccl/src/misc/mlx5dvsymbols.cc Extends mlx5dv dynamic symbol loading for query_device/create_qp.
projects/rccl/src/misc/ibvwrap.cc Updates ibverbs wrappers (query_port handling, sleep_for usage) and adds WC/opcode-to-string helpers.
projects/rccl/src/misc/ibvsymbols.cc Adds dynamic loading via NCCL_IBVERBS_LIB env override and additional logging.
projects/rccl/src/include/mlx5/mlx5dvwrap.h Declares new mlx5dv wrapper APIs.
projects/rccl/src/include/mlx5/mlx5dvsymbols.h Adds new mlx5dv symbol pointers for query_device/create_qp.
projects/rccl/src/include/mlx5/mlx5dvcore.h Expands mlx5 direct verbs core structs/enums needed for querying device capabilities.
projects/rccl/src/include/ibvwrap.h Declares new WC/opcode-to-string helper functions.
projects/rccl/src/CMakeLists.txt Integrates net_ib_cast CMake subdir outputs into the main RCCL build (sources + include dirs).
projects/rccl/cmake/rocmIb.cmake Adds a sed rename rule to avoid duplicate-symbol issues for ncclIbFaultrocmIbFault.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread projects/rccl/src/misc/ibvwrap.cc
Comment thread projects/rccl/src/transport/net_ib_cast/p2p_cast.h Outdated
Comment thread projects/rccl/src/transport/net_ib_cast/common_cast.h Outdated
Comment thread projects/rccl/test/transport/NetIbMPI/FaultInjectTests.cpp
Comment thread projects/rccl/test/transport/NetIbMPI/FaultInjectTests.cpp
Comment thread projects/rccl/src/misc/ibvsymbols.cc Outdated
@i-kosarev i-kosarev force-pushed the users/ilkosare/net-ib-fault-injection-tests-for-new-CAST branch 2 times, most recently from ca74a5d to dea5cbd Compare May 8, 2026 21:39
@i-kosarev
Copy link
Copy Markdown
Contributor Author

@nikitaxgusev your comments from #5515 are addressed here

@i-kosarev i-kosarev force-pushed the users/ilkosare/net-ib-fault-injection-tests-for-new-CAST branch from 73d334d to 3173cc1 Compare May 11, 2026 12:53
Ilia Kosarev and others added 7 commits May 12, 2026 10:03
Add per-QP fault injection to ncclIbNetCommBase: faultQpDelayUs (usleep
before post_send) and faultQpError (return ncclSystemError and increment
fatalErrorCount). Hook fires in IbCastMultiSend (CAST path only).

Public API in net_ib_fault_inject.h: ncclIbCastFault* for the CAST path.

All fault code is guarded by ENABLE_FAULT_INJECTION; production builds
are zero-cost. rocmIb.cmake renamed ncclIbFault → rocmIbFault to avoid
duplicate symbols in the generated net_ib_rocm.cc.
CAST path: arm faultQpError on QP 0 of the CAST sender. Verifies that
isend returns an error or fatalErrorCount > 0, confirming the fault hook
in IbCastMultiSend propagates correctly.
CAST WRR path: arm 10 ms delay on QP 0, run 500 sends. Verifies that
the RTT timer reduces initQpTokens[0] below the equal-share baseline,
and that WRR token accounting invariants hold throughout.
CAST path: arm 2 ms per-QP delay, run 50 send/recv pairs. Verifies that
artificial delays do not corrupt data and fatalErrorCount stays 0.
Arm error injection on exactly one QP (QP 0) using ncclIbCastSetTokens to steer all WRR tokens there before the fault send. This makes the chosen QP deterministic regardless of where the WRR cursor landed after warmup, modeling the 'one link failed' scenario. Verifies: isend returns error OR fatalErrorCount > 0. Skipped when actualNqps < 2 (FaultInjCastQpErrorIsFatal covers that).
Arm error injection on all QPs of connection 1, trigger one failed send, call FaultClear, then close connection 1. Open a fresh connection 2 and transfer 20x4096-byte messages. Verifies data integrity (memcmp) and that fatalErrorCount on connection 2 is 0. A new connection is required because a faulted connection accumulates fatalErrorCount and IbCastIsend checks that counter when NCCL_IB_RETURN_ASYNC_EVENTS=1 is set.
@i-kosarev i-kosarev force-pushed the users/ilkosare/net-ib-fault-injection-tests-for-new-CAST branch 2 times, most recently from 357805f to 46cd912 Compare May 12, 2026 10:24
@i-kosarev i-kosarev requested a review from Copilot May 12, 2026 10:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

projects/rccl/src/transport/net_ib_cast/net_ib_cast_inspect.h:18

  • net_ib_cast_inspect.h uses the bool type in ncclIbCastSchedState, but in the non-C++ include path it never includes <stdbool.h>. If this header is ever included from a C translation unit (the file already has a C branch), it will fail to compile. Add #include <stdbool.h> in the #else branch (or otherwise ensure bool is defined for C).

Comment thread projects/rccl/test/transport/NetIbMPI/FaultInjectTests.cpp
Comment thread projects/rccl/src/transport/net_ib_cast/net_ib_fault_inject.h Outdated
Comment thread projects/rccl/cmake/rocmIb.cmake Outdated
Copy link
Copy Markdown
Contributor

@ddurnov ddurnov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify why we need extra post processing step here?

@atulkulk
Copy link
Copy Markdown
Contributor

%RCCL_TEST_CONFIG
net_ib_transport.json

Fix fault injection API issues from review:
- ncclIbCastFaultSetQpDelay/SetQpError: use comm->base.nqps for bounds
  check instead of NCCL_IB_MAX_QPS
- ncclIbCastFaultClear: atomically reset fatalErrorCount to 0
- move net_ib_fault_inject.h into transport/net_ib_cast/; drop local
  #define NCCL_IB_MAX_QPS 128, include net_ib_cast_inspect.h and add
  static_assert so a size mismatch becomes a compile error
- fault hook in IbCastMultiSend: use IbCastStatsFatalError (renamed from
  ncclIbStatsFatalError in asanniko's split)
- FaultInjCastQpErrorClearRecovers: ASSERT_EQ on SetQpError return value;
  drain recvReq before DeregisterMemory
- FaultInjCastSingleQpErrorIsFatal: EXPECT_EQ on ncclIbCastSetTokens and
  ncclIbCastFaultClear return values

Consolidate NCCL_IB_MAX_QPS definition and reorganize headers:
- add src/transport/net_ib_limits.h as single source of truth for
  NCCL_IB_MAX_QPS 128, removing duplicate defines from net_ib.cc,
  common_cast.h, and net_ib_cast_inspect.h
- move net_ib_cast_inspect.h from src/include/ to src/transport/net_ib_cast/
  alongside net_ib_fault_inject.h; update CMakeLists include paths
- group net_ib_limits.h, net_ib_cast_inspect.h, net_ib_fault_inject.h
  together next to transport/net_ib.cc in src/CMakeLists.txt

Reduce test code duplication:
- add helpers to NetIbMPITestBase: DrainRecvRequest, TeardownConnection,
  ExpectEqualWeightInitTokens, ExpectActiveTokenSumInvariant
- use them in FaultInjectTests.cpp and CastTests.cpp to eliminate ~300
  lines of repeated teardown and sched-state assertion code

Address review comments:
- wrap_ibv_free_device_list: add nullptr early-return to avoid NULL
  dereference on error paths (e.g. net_ib.cc fail:975)
- IbCastRequestRetrieveAsIndex: drop dead `reqIndex < 0` check (reqIndex
  is uint32_t, so always >= 0); use %u format specifier to match type
- ibvsymbols.cc: fix typo "Count not open" -> "Could not open"
- net_ib_fault_inject.h: fix comment referencing non-existent
  net_ib_cast.cc; implementation lives in net_ib_cast/p2p.cc
- FaultInjCastSlowQpRebalances: skip when GetSplitDataMin()==0 to
  avoid uint32_t underflow in kMsgSz = GetSplitDataMin() - 1
- drop ncclIbFault sed rename from rocmIb.cmake; fault injection symbols
  (ncclIbCastFault*) live in our own source files, not the vendored plugin
@i-kosarev i-kosarev force-pushed the users/ilkosare/net-ib-fault-injection-tests-for-new-CAST branch from 46cd912 to d023c86 Compare May 12, 2026 13:08
@i-kosarev
Copy link
Copy Markdown
Contributor Author

%RCCL_TEST_CONFIG
net_ib_transport.json

@i-kosarev
Copy link
Copy Markdown
Contributor Author

Please clarify why we need extra post processing step here?

Generally we do this with functions to write nccl* prefix for code compatibility and then have rocm* prefix in binary builds. But "renaming" is not really needed for test functions. Removed

@i-kosarev i-kosarev requested a review from ddurnov May 12, 2026 13:16
Copy link
Copy Markdown
Contributor

@amd-arozanov amd-arozanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Minor comment (can be ignored):

FaultInjCastQpErrorIsFatal, FaultInjCastSingleQpErrorIsFatal and Phase 1 of FaultInjCastQpErrorClearRecovers repeat the same ~50-line flow (arm → barrier → post+poll → forward FaultInjectResult → assert), only the arming differs. Worth extracting into one RunFaultInjectErrorScenario(..., armFn) helper?

Comment thread projects/rccl/test/transport/NetIbMPI/FaultInjectTests.cpp
@i-kosarev
Copy link
Copy Markdown
Contributor Author

LGTM. Minor comment (can be ignored):

FaultInjCastQpErrorIsFatal, FaultInjCastSingleQpErrorIsFatal and Phase 1 of FaultInjCastQpErrorClearRecovers repeat the same ~50-line flow (arm → barrier → post+poll → forward FaultInjectResult → assert), only the arming differs. Worth extracting into one RunFaultInjectErrorScenario(..., armFn) helper?

Thanks for the idea! However the arming logic, message sizes, tags, and assertion messages all differ between the three tests, so a shared helper would need lot of params and it will cancel out the readability gain. The duplication is incidental rather than conceptual — each test tells its own story end-to-end, which I'd prefer to preserve.

@i-kosarev i-kosarev requested a review from amd-arozanov May 12, 2026 15:44
@i-kosarev
Copy link
Copy Markdown
Contributor Author

%RCCL_TEST_CONFIG
net_ib_transport.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants