Skip to content

[RCCL] net-ib: per-QP fault injection API, hooks, and tests for IB and CAST paths#5515

Closed
i-kosarev wants to merge 8 commits into
developfrom
users/ilkosare/net-ib-fault-injection-tests
Closed

[RCCL] net-ib: per-QP fault injection API, hooks, and tests for IB and CAST paths#5515
i-kosarev wants to merge 8 commits into
developfrom
users/ilkosare/net-ib-fault-injection-tests

Conversation

@i-kosarev
Copy link
Copy Markdown
Contributor

@i-kosarev i-kosarev commented Apr 27, 2026

Motivation

Adds a test-only per-QP fault injection facility to the net-ib CAST path, guarded by ENABLE_FAULT_INJECTION so production builds are zero-cost.

Technical Details

Infrastructure (net-ib: fault injection API and hooks...)

  • ncclIbNetCommBase: two new fields faultQpDelayUs[] / faultQpError[]
  • Hook in IbCastMultiSend (net_ib_cast.cc): inject artificial usleep or return ncclSystemError + increment fatalErrorCount before wrap_ibv_post_send
  • Public API in src/include/net_ib_fault_inject.h: ncclIbCastFault{SetQpDelay,SetQpError,Clear,GetFatalCount} for the CAST path
  • cmake/rocmIb.cmake: added ncclIbFault → rocmIbFault rename rule to fix duplicate symbol linker errors in net_ib_rocm.cc

Test Plan

  • FaultInjCastQpErrorIsFatal — CAST: QP error propagates as a fatal event; fault is armed on all active QPs, results forwarded from rank 1 to rank 0 for assertion
  • FaultInjCastSlowQpRebalances — CAST WRR: 10 ms delay on QP 0 causes the RTT timer to rebalance tokens below equal share; scheduler state forwarded from rank 1 to rank 0
  • FaultInjCastDelayDataIntegrity — CAST: 2 ms per-QP delay across 50 sends, full data integrity verified on rank 0
  • FaultInjCastSingleQpErrorIsFatal — CAST WRR "one link failed" scenario: all WRR tokens steered to QP 0 via ncclIbCastSetTokens before arming the fault on that QP only; verifies isend returns error or fatalErrorCount > 0; skipped when actualNqps < 2
  • FaultInjCastQpErrorClearRecovers — CAST recovery: after arming error on all QPs, triggering one failed send, and calling FaultClear, a fresh connection completes 20 × 4096-byte sends with data integrity and fatalErrorCount == 0

JIRA ID

AICOMNET-133

Submission Checklist

Copilot AI review requested due to automatic review settings April 27, 2026 22:47
@i-kosarev i-kosarev requested a review from a team as a code owner April 27, 2026 22:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a test-only per-QP fault injection facility for the regular IB and CAST paths, along with extensive new MPI-based transport tests and corresponding test-runner suites.

Changes:

  • Added CAST WRR scheduler “inspect” APIs and per-QP fault injection APIs/hooks (guarded by ENABLE_FAULT_INJECTION) in net_ib.cc and net_ib_cast.cc.
  • Added new NetIb MPI test infrastructure plus large sets of new tests for NIC fusion/vNIC, CAST WRR behavior, and fault injection.
  • Extended test-runner JSON configs to add CAST and fault-injection suites; updated ROCm net-ib symbol renaming to avoid duplicate symbols.

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
projects/rccl/tools/scripts/test_runner/configs/net_ib_transport.json Adds CAST and fault-injection configs/suites for transport CI.
projects/rccl/tools/scripts/test_runner/configs/mi300x_mellanox_ib.json Mirrors CAST and fault-injection suites for MI300X Mellanox IB coverage.
projects/rccl/test/transport/NetIbMPI/NicFusionTests.cpp New NIC fusion / vNIC functional & stress tests.
projects/rccl/test/transport/NetIbMPI/NetIbMPITestBase.hpp New shared MPI test fixture/helpers for net-ib and CAST testing.
projects/rccl/test/transport/NetIbMPI/NetIbFaultInject.hpp Test wrapper header re-exporting the shared fault-injection API.
projects/rccl/test/transport/NetIbMPI/NetIbCastInspect.hpp Test wrapper header re-exporting CAST scheduler inspection API.
projects/rccl/test/transport/NetIbMPI/GeneralTests.cpp New general net-ib MPI tests (init, properties, send/recv, stress).
projects/rccl/test/transport/NetIbMPI/FaultInjectTests.cpp New per-QP delay/error fault-injection tests for IB and CAST.
projects/rccl/test/transport/NetIbMPI/CastTests.cpp New CAST WRR scheduler white-box tests + stress tests.
projects/rccl/test/CMakeLists.txt Splits prior NetIb MPI test file into multiple new compilation units.
projects/rccl/src/transport/net_ib_cast.cc Adds inspect + fault-injection APIs and hook points in CAST send path.
projects/rccl/src/transport/net_ib.cc Adds fault-injection API and hook points in regular IB send path.
projects/rccl/src/include/net_ib_fault_inject.h New shared test-only fault injection API header (guarded by ENABLE_FAULT_INJECTION).
projects/rccl/src/include/net_ib_cast_inspect.h New shared test-only CAST scheduler inspection API header.
projects/rccl/src/CMakeLists.txt Adds the new headers to the build’s source list.
projects/rccl/cmake/rocmIb.cmake Adds a sed rename rule to avoid ncclIbFault* symbol duplication in ROCm net-ib generation.
projects/rccl/.gitignore Broadens build directory ignore rule from build/ to build*/.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread projects/rccl/tools/scripts/test_runner/configs/net_ib_transport.json Outdated
Comment thread projects/rccl/tools/scripts/test_runner/configs/mi300x_mellanox_ib.json Outdated
Comment thread projects/rccl/test/transport/NetIbMPI/NetIbMPITestBase.hpp
Comment thread projects/rccl/test/transport/NetIbMPI/FaultInjectTests.cpp Outdated
Comment thread projects/rccl/test/transport/NetIbMPI/FaultInjectTests.cpp Outdated
@math-ci-jobs
Copy link
Copy Markdown

math-ci-jobs Bot commented Apr 27, 2026

regression-detection smoke (multi-node) on commit be7bc6bc83fdb2740a7bab15033f8b3cc7adf453

Artifacts - Results

@math-ci-jobs
Copy link
Copy Markdown

math-ci-jobs Bot commented Apr 27, 2026

regression-detection smoke (single-node) on commit be7bc6bc83fdb2740a7bab15033f8b3cc7adf453

Artifacts - No regressions preview generated.

@i-kosarev i-kosarev force-pushed the users/ilkosare/net-ib-fault-injection-tests branch from be7bc6b to 2cd755e Compare April 28, 2026 07:20
@math-ci-jobs
Copy link
Copy Markdown

math-ci-jobs Bot commented Apr 28, 2026

regression-detection smoke (multi-node) on commit 2cd755e70202a7b0c1be838d83e927d7486a5696

Artifacts - Results

@math-ci-jobs
Copy link
Copy Markdown

math-ci-jobs Bot commented Apr 28, 2026

regression-detection smoke (single-node) on commit 2cd755e70202a7b0c1be838d83e927d7486a5696

Artifacts - No regressions preview generated.

@i-kosarev i-kosarev force-pushed the users/ilkosare/net-ib-fault-injection-tests branch from 2cd755e to c222f53 Compare April 29, 2026 18:22
@math-ci-jobs
Copy link
Copy Markdown

math-ci-jobs Bot commented Apr 29, 2026

regression-detection smoke (single-node) on commit c222f5391a2db1f28d275dd8cce9af34da7cf85a

Artifacts - No regressions preview generated.

@i-kosarev i-kosarev force-pushed the users/ilkosare/net-ib-fault-injection-tests branch 3 times, most recently from d5d1d3b to 9640bb3 Compare April 30, 2026 08:02
@math-ci-jobs
Copy link
Copy Markdown

math-ci-jobs Bot commented Apr 30, 2026

regression-detection smoke (multi-node) on commit 9640bb3

Artifacts - No regressions preview generated.

@i-kosarev i-kosarev force-pushed the users/ilkosare/net-ib-fault-injection-tests branch from 9640bb3 to 5cede29 Compare April 30, 2026 18:23
@math-ci-jobs
Copy link
Copy Markdown

math-ci-jobs Bot commented Apr 30, 2026

regression-detection smoke (single-node) on commit 9640bb3

Artifacts - No regressions preview generated.

@math-ci-jobs
Copy link
Copy Markdown

math-ci-jobs Bot commented Apr 30, 2026

regression-detection smoke (multi-node) on commit 5cede29

Artifacts - No regressions preview generated.

@math-ci-jobs
Copy link
Copy Markdown

math-ci-jobs Bot commented Apr 30, 2026

regression-detection smoke (single-node) on commit 5cede29

Artifacts - No regressions preview generated.

@i-kosarev i-kosarev force-pushed the users/ilkosare/net-ib-fault-injection-tests branch 4 times, most recently from 29251a4 to 81bf5d1 Compare May 5, 2026 14:59
@i-kosarev i-kosarev requested a review from Copilot May 5, 2026 23:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread projects/rccl/src/transport/net_ib_cast.cc
Comment thread projects/rccl/src/transport/net_ib_cast.cc
Comment thread projects/rccl/src/transport/net_ib_cast.cc
Comment thread projects/rccl/test/transport/NetIbMPI/FaultInjectTests.cpp
@i-kosarev i-kosarev force-pushed the users/ilkosare/net-ib-fault-injection-tests branch 3 times, most recently from f49a77e to 5f29b91 Compare May 6, 2026 13:03
Ilia Kosarev and others added 8 commits May 6, 2026 16:39
Add per-QP fault injection to ncclIbNetCommBase: faultQpDelayUs (usleep
before post_send) and faultQpError (return ncclSystemError and increment
fatalErrorCount). Hook fires in IbCastMultiSend (CAST path only).

Public API in net_ib_fault_inject.h: ncclIbCastFault* for the CAST path.

All fault code is guarded by ENABLE_FAULT_INJECTION; production builds
are zero-cost. rocmIb.cmake renamed ncclIbFault → rocmIbFault to avoid
duplicate symbols in the generated net_ib_rocm.cc.
CAST path: arm faultQpError on QP 0 of the CAST sender. Verifies that
isend returns an error or fatalErrorCount > 0, confirming the fault hook
in IbCastMultiSend propagates correctly.
CAST WRR path: arm 10 ms delay on QP 0, run 500 sends. Verifies that
the RTT timer reduces initQpTokens[0] below the equal-share baseline,
and that WRR token accounting invariants hold throughout.
CAST path: arm 2 ms per-QP delay, run 50 send/recv pairs. Verifies that
artificial delays do not corrupt data and fatalErrorCount stays 0.
Arm error injection on exactly one QP (QP 0) using ncclIbCastSetTokens to steer all WRR tokens there before the fault send. This makes the chosen QP deterministic regardless of where the WRR cursor landed after warmup, modeling the 'one link failed' scenario. Verifies: isend returns error OR fatalErrorCount > 0. Skipped when actualNqps < 2 (FaultInjCastQpErrorIsFatal covers that).
Arm error injection on all QPs of connection 1, trigger one failed send, call FaultClear, then close connection 1. Open a fresh connection 2 and transfer 20x4096-byte messages. Verifies data integrity (memcmp) and that fatalErrorCount on connection 2 is 0. A new connection is required because a faulted connection accumulates fatalErrorCount and IbCastIsend checks that counter when NCCL_IB_RETURN_ASYNC_EVENTS=1 is set.
- Validate qpIndex against comm->base.nqps (actual QP count) instead of
  NCCL_IB_MAX_QPS (128) in FaultSetQpDelay and FaultSetQpError, so
  arming a non-existent QP returns ncclInvalidArgument immediately.
- Reset fatalErrorCount in ncclIbCastFaultClear so the connection is
  truly recoverable after clearing fault state.
- Add phase 1 verification in FaultInjCastQpErrorClearRecovers: rank 1
  forwards sendRet + fatalCount to rank 0 via MPI, rank 0 asserts that
  the fault was actually observed before testing recovery in phase 2.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
@i-kosarev i-kosarev force-pushed the users/ilkosare/net-ib-fault-injection-tests branch from 5f29b91 to 41407a5 Compare May 6, 2026 14:39
Copy link
Copy Markdown
Contributor

@nikitaxgusev nikitaxgusev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks good.

Comment thread projects/rccl/test/transport/NetIbMPI/FaultInjectTests.cpp
Comment thread projects/rccl/test/transport/NetIbMPI/FaultInjectTests.cpp
* double-definition when both headers are included together.
*/
#ifndef NCCL_IB_MAX_QPS
#define NCCL_IB_MAX_QPS 128
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If net_ib_cast_inspect.h defines NCCL_IB_MAX_QPS to a different value lets say it changes to 256 in the future , the fault arrays will silently be with different size, may be adding static_assert(NCCL_IB_MAX_QPS == 128)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Fixed in #5931 . Actually, there is still some duplication for different reasons, (net_ib.cc and net_ib_cast.cc duplicate this between each other based on #5619 , but I will make sure that this will be fully fixed in #5931 and there will be no duplication for NCCL_IB_MAX_QPS defines)

@atulkulk
Copy link
Copy Markdown
Contributor

%RCCL_TEST_CONFIG
net_ib_transport.json

@i-kosarev i-kosarev closed this May 12, 2026
@i-kosarev
Copy link
Copy Markdown
Contributor Author

Moved to #5931

@i-kosarev i-kosarev deleted the users/ilkosare/net-ib-fault-injection-tests branch May 13, 2026 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants