You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix fault injection API issues from review:
- ncclIbCastFaultSetQpDelay/SetQpError: use comm->base.nqps for bounds
check instead of NCCL_IB_MAX_QPS
- ncclIbCastFaultClear: atomically reset fatalErrorCount to 0
- move net_ib_fault_inject.h into transport/net_ib_cast/; drop local
#define NCCL_IB_MAX_QPS 128, include net_ib_cast_inspect.h and add
static_assert so a size mismatch becomes a compile error
- fault hook in IbCastMultiSend: use IbCastStatsFatalError (renamed from
ncclIbStatsFatalError in asanniko's split)
- FaultInjCastQpErrorClearRecovers: ASSERT_EQ on SetQpError return value;
drain recvReq before DeregisterMemory
- FaultInjCastSingleQpErrorIsFatal: EXPECT_EQ on ncclIbCastSetTokens and
ncclIbCastFaultClear return values
Consolidate NCCL_IB_MAX_QPS definition and reorganize headers:
- add src/transport/net_ib_limits.h as single source of truth for
NCCL_IB_MAX_QPS 128, removing duplicate defines from net_ib.cc,
common_cast.h, and net_ib_cast_inspect.h
- move net_ib_cast_inspect.h from src/include/ to src/transport/net_ib_cast/
alongside net_ib_fault_inject.h; update CMakeLists include paths
- group net_ib_limits.h, net_ib_cast_inspect.h, net_ib_fault_inject.h
together next to transport/net_ib.cc in src/CMakeLists.txt
Reduce test code duplication:
- add helpers to NetIbMPITestBase: DrainRecvRequest, TeardownConnection,
ExpectEqualWeightInitTokens, ExpectActiveTokenSumInvariant
- use them in FaultInjectTests.cpp and CastTests.cpp to eliminate ~300
lines of repeated teardown and sched-state assertion code
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
0 commit comments