How is this issue impacting you?
Data corruption
Share Your Debug Logs
How is this issue impacting you?
Data corruption
Share Your Debug Logs
ncclNvlsDeregBuffer silently swallows cuMulticastUnbind failures — poisons CUDA context across ncclCommDestroy, breaks cuda-checkpoint
Summary
ncclNvlsDeregBuffer in src/transport/nvls.cc wraps cuMulticastUnbind in the CUCALL macro, which discards the CUresult. When the unbind fails, the multicast binding stays live in the CUDA driver context, but the caller — regCleanup → ncclCommDeregister / ncclRegCleanup (inside ncclCommDestroy) — sees ncclSuccess and continues teardown. The resulting "successful" destroy leaves driver-level state orphaned.
For checkpoint/restore integrations (cuda-checkpoint, live-migration frameworks), this is fatal: the stale binding is dumped verbatim into the process image, and every CUDA op after restore — NCCL or not — fails with cudaGetErrorString = "invalid argument" / "invalid device ordinal".
The bug is present in both v2.27.5-1 and master (b3aeda4, NCCL v2.30 dev). The two files involved are unchanged between those tags in the relevant region.
Code path
ncclCommDestroy src/init.cc
→ commReclaim → commCleanup → commFree
→ if (comm->nvlsSupport) ncclNvlsFree(comm) (shared NVLS — CUCHECK, clean)
→ ncclRegCleanup(comm) ← (a)
for each regCache slot:
regCleanup(comm, reg) src/register/register.cc
if (reg->state & NVLS_REG_COMPLETE)
ncclNvlsDeregBuffer(comm, ®->mcHandle, ...) ← (b), src/transport/nvls.cc
CUCALL(cuMulticastUnbind(*mcHandler, ...)) ← (c) silently swallowed
CUCHECK(cuMemUnmap(...))
CUCHECK(cuMemAddressFree(...))
CUCHECK(cuMemRelease(*mcHandler)) (refcount-- only; does not reverse binding)
ncclCommDeregister (synchronous user path) reaches (b) directly via commDeregister → regCleanup.
The silent swallow is at (c). Behavior at (b) and (a) is compounded:
CUCALL body is literally pfn_##cmd; — no CUresult capture, no log, not even a WARN.
- Original
regCleanup turns any ncclNvlsDeregBuffer error into a WARN and returns ncclSuccess.
Combined, a failed unbind is invisible to every caller up to and including ncclCommDestroy / ncclCommDeregister.
The other NVLS teardown path — ncclNvlsFree → nvlsGroupUnbind (shared credit/buffer MC handles) — uses CUCHECK and propagates cleanly. That path is not affected.
Who triggers the broken path
NCCL auto-registers user buffers for NVLS whenever:
NCCL_LOCAL_REGISTER=1 (default) and a collective runs the NVLS algorithm (NCCL_ALGO=NVLS or auto-selected), or
NCCL_GRAPH_REGISTER=1 (default) and the collective is captured in a CUDA graph, or
- User code calls
ncclCommRegister explicitly.
Every registration goes through tryRegisterBuffer → cuMulticastCreate + cuMulticastAddDevice + cuMulticastBindAddr, producing a per-buffer MC handle in the regCache. At teardown, every such slot is dereg'd through ncclNvlsDeregBuffer. PyTorch DDP under torchrun — the most common multi-rank deployment — hits this path on every destroy.
Note: in single-process multi-GPU setups (ncclCommInitAll), comm->nvlsRegSupport is set to 0 by src/init.cc:771-779 when two ranks share a process (hostHash + pidHash match), so the per-buffer path is skipped. One-process-per-rank deployments (torchrun + DDP, MPI + NCCL, ncclCommInitRank) trigger it normally.
Proposed fix:
#2116
Would appreciate guidance on whether the team prefers:
- This minimal two-file change (only surface the failure; caller decides how to react), or
- A larger change that also adds retry / forced-unbind logic inside NCCL, or
- A new
ncclCommDestroyStrict entry point that runs destroy synchronously and distinguishes "multicast still bound" from generic unhandled CUDA errors.
Design (1) is enough to unblock our cuda-checkpoint integrations; (2)/(3) are bigger surface changes that require API discussion. But the underlying issue is that right now NVLS on destroy pollutes cuda contexts as the cuda-checkpoint somehow captures multicast leftover on the cuda-contexts.
Environment
- NCCL: v2.27.5-1 (also verified present in
master at b3aeda4)
- CUDA driver: 580.105.08
- CUDA toolkit: 12.8.r12.8
- GPU: NVIDIA B200 (sm_100), 8 per node, full NVSwitch (NV18 topology)
- OS: Linux 6.6.121.1 (Azure Linux 3.0)
Related
Triggers a complete blocker for checkpoint/restore of GPU training jobs. Current workaround is to disable NVLS.
Observed (test output on NVIDIA B200, driver 580.105.08, CUDA 12.8, NCCL v2.27.5-1)
Vanilla NCCL + injection ON
[rank 0] ncclCommDeregister send=0 recv=0
[rank 1] ncclCommDeregister send=0 recv=0
[rank 2] ncclCommDeregister send=0 recv=0
[rank 3] ncclCommDeregister send=0 recv=0
[rank 0..3] ncclCommDestroy=0
deregErr=0 destroyErr=0
8 injected cuMulticastUnbind failures, all 8 invisible to caller.
Patched NCCL (this issue's fix) + injection ON
[rank 0] ncclCommDeregister send=1 recv=1 (1 = ncclUnhandledCudaError)
[rank 1] ncclCommDeregister send=1 recv=1
[rank 2] ncclCommDeregister send=1 recv=1
[rank 3] ncclCommDeregister send=1 recv=1
[rank 0..3] ncclCommDestroy=0 (dereg caught them all)
deregErr=4 destroyErr=0
All 8 failures visible. any caller can now react.
Patched NCCL + injection OFF — regression check
[rank 0..3] ncclCommDeregister send=0 recv=0
[rank 0..3] ncclCommDestroy=0
deregErr=0 destroyErr=0
Identical to original happy path. Zero regression.
Upstream NCCL real behavior without injection, on same B200 hardware
All 8 dereg calls return cuMulticastUnbind = CUDA_SUCCESS. The failure mode does not surface in isolated NCCL workloads; it requires an external driver-state perturbation. In our deployment that perturbation is context freeze between ncclCommDestroy and the dump. Without such a perturbation, the bug is dormant — which we believe explains why it hasn't been reported before.
NCCL Version
v2.27.5-1
Your platform details
No response
Error Message & Behavior
No response
Steps to Reproduce the Issue
Steps to Reproduce the Issue
Reproducer
A 170-line standalone C++ program (forks one process per GPU) plus an opt-in env-gated failure injection inside ncclNvlsDeregBuffer. The same binary + injection demonstrates the bug against vanilla NCCL and the fix against the proposed patch.
Program: nvls_destroy_repro.cu
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <vector>
#include <sys/wait.h>
#include <unistd.h>
#include <cuda_runtime.h>
#include <nccl.h>
#define CUDACHECK(cmd) do { cudaError_t e = (cmd); if (e != cudaSuccess) { \
fprintf(stderr, "[rank %d] CUDA err %s:%d: %s\n", g_rank, __FILE__, __LINE__, cudaGetErrorString(e)); _exit(1); } } while(0)
#define NCCLCHECK(cmd) do { ncclResult_t r = (cmd); if (r != ncclSuccess) { \
fprintf(stderr, "[rank %d] NCCL err %s:%d: %s\n", g_rank, __FILE__, __LINE__, ncclGetErrorString(r)); _exit(1); } } while(0)
static int g_rank = -1;
static void child_work(int rank, int nDev, ncclUniqueId id) {
g_rank = rank;
CUDACHECK(cudaSetDevice(rank));
const size_t bytes = 64u << 20; // 64 MB
float *sendbuf, *recvbuf;
NCCLCHECK(ncclMemAlloc((void**)&sendbuf, bytes));
NCCLCHECK(ncclMemAlloc((void**)&recvbuf, bytes));
CUDACHECK(cudaMemset(sendbuf, 1, bytes));
cudaStream_t stream; CUDACHECK(cudaStreamCreate(&stream));
ncclComm_t comm;
NCCLCHECK(ncclCommInitRank(&comm, nDev, id, rank));
void *sh, *rh;
NCCLCHECK(ncclCommRegister(comm, sendbuf, bytes, &sh));
NCCLCHECK(ncclCommRegister(comm, recvbuf, bytes, &rh));
for (int it = 0; it < 5; it++)
NCCLCHECK(ncclAllReduce(sendbuf, recvbuf, bytes/sizeof(float), ncclFloat, ncclSum, comm, stream));
CUDACHECK(cudaStreamSynchronize(stream));
ncclResult_t dS = ncclCommDeregister(comm, sh);
ncclResult_t dR = ncclCommDeregister(comm, rh);
fprintf(stderr, "[rank %d] ncclCommDeregister send=%d recv=%d\n", rank, dS, dR);
ncclResult_t destroyRes = ncclCommDestroy(comm);
fprintf(stderr, "[rank %d] ncclCommDestroy=%d\n", rank, destroyRes);
sleep(2);
cudaStreamDestroy(stream);
ncclMemFree(sendbuf); ncclMemFree(recvbuf);
_exit((dS || dR || destroyRes) ? 3 : 0);
}
int main(int argc, char** argv) {
int nDev = argc > 1 ? atoi(argv[1]) : 4;
ncclUniqueId id; ncclGetUniqueId(&id);
std::vector<pid_t> kids(nDev);
for (int i = 0; i < nDev; i++) {
pid_t p = fork();
if (p == 0) child_work(i, nDev, id);
kids[i] = p;
}
int deregErr = 0, destroyErr = 0;
for (int i = 0; i < nDev; i++) {
int st; waitpid(kids[i], &st, 0);
int ec = WIFEXITED(st) ? WEXITSTATUS(st) : -1;
if (ec & 1) deregErr++; if (ec & 2) destroyErr++;
}
printf("deregErr=%d destroyErr=%d\n", deregErr, destroyErr);
return 0;
}
Build on any host with CUDA 12.1+ and an NVLS-capable GPU (H100, B200, GB200):
nvcc -O2 -std=c++17 -gencode=arch=compute_90,code=sm_90 nvls_destroy_repro.cu \
-o nvls_destroy_repro -lnccl -lcudart -lcuda
Opt-in injection (5-line diff applied alongside any NCCL build)
--- a/src/transport/nvls.cc
+++ b/src/transport/nvls.cc
@@ -110 +110,5 @@ ncclResult_t ncclNvlsDeregBuffer(struct ncclComm* comm, CUmemGenericAllocationHa
- CUCALL(cuMulticastUnbind(*mcHandler, dev, 0/*mcOffset*/, ucsize));
+ CUresult _res;
+ const char* _inj = getenv("NCCL_NVLS_REPRO_INJECT_UNBIND_FAIL");
+ if (_inj && _inj[0] == '1') _res = CUDA_ERROR_UNKNOWN;
+ else _res = CUPFN(cuMulticastUnbind(*mcHandler, dev, 0/*mcOffset*/, ucsize));
+ (void)_res; /* preserve original behavior: swallow */
This adds the env-gated injection while preserving the bug (the CUresult is still discarded). Runs against either the unpatched or patched NCCL below.
NCCL Version
v2.27.5-1
Your platform details
No response
Error Message & Behavior
No response
How is this issue impacting you?
Data corruption
Share Your Debug Logs
How is this issue impacting you?
Data corruption
Share Your Debug Logs
ncclNvlsDeregBuffersilently swallowscuMulticastUnbindfailures — poisons CUDA context acrossncclCommDestroy, breaks cuda-checkpointSummary
ncclNvlsDeregBufferinsrc/transport/nvls.ccwrapscuMulticastUnbindin theCUCALLmacro, which discards theCUresult. When the unbind fails, the multicast binding stays live in the CUDA driver context, but the caller —regCleanup→ncclCommDeregister/ncclRegCleanup(insidencclCommDestroy) — seesncclSuccessand continues teardown. The resulting "successful" destroy leaves driver-level state orphaned.For checkpoint/restore integrations (
cuda-checkpoint, live-migration frameworks), this is fatal: the stale binding is dumped verbatim into the process image, and every CUDA op after restore — NCCL or not — fails withcudaGetErrorString="invalid argument"/"invalid device ordinal".The bug is present in both
v2.27.5-1andmaster(b3aeda4, NCCL v2.30 dev). The two files involved are unchanged between those tags in the relevant region.Code path
The silent swallow is at (c). Behavior at (b) and (a) is compounded:
CUCALLbody is literallypfn_##cmd;— no CUresult capture, no log, not even aWARN.regCleanupturns anyncclNvlsDeregBuffererror into aWARNand returnsncclSuccess.Combined, a failed unbind is invisible to every caller up to and including
ncclCommDestroy/ncclCommDeregister.The other NVLS teardown path —
ncclNvlsFree→nvlsGroupUnbind(shared credit/buffer MC handles) — usesCUCHECKand propagates cleanly. That path is not affected.Who triggers the broken path
NCCL auto-registers user buffers for NVLS whenever:
NCCL_LOCAL_REGISTER=1(default) and a collective runs the NVLS algorithm (NCCL_ALGO=NVLSor auto-selected), orNCCL_GRAPH_REGISTER=1(default) and the collective is captured in a CUDA graph, orncclCommRegisterexplicitly.Every registration goes through
tryRegisterBuffer→cuMulticastCreate+cuMulticastAddDevice+cuMulticastBindAddr, producing a per-buffer MC handle in the regCache. At teardown, every such slot is dereg'd throughncclNvlsDeregBuffer. PyTorch DDP undertorchrun— the most common multi-rank deployment — hits this path on every destroy.Note: in single-process multi-GPU setups (
ncclCommInitAll),comm->nvlsRegSupportis set to 0 bysrc/init.cc:771-779when two ranks share a process (hostHash+pidHashmatch), so the per-buffer path is skipped. One-process-per-rank deployments (torchrun + DDP, MPI + NCCL,ncclCommInitRank) trigger it normally.Proposed fix:
#2116
Would appreciate guidance on whether the team prefers:
ncclCommDestroyStrictentry point that runs destroy synchronously and distinguishes "multicast still bound" from generic unhandled CUDA errors.Design (1) is enough to unblock our cuda-checkpoint integrations; (2)/(3) are bigger surface changes that require API discussion. But the underlying issue is that right now NVLS on destroy pollutes cuda contexts as the cuda-checkpoint somehow captures multicast leftover on the cuda-contexts.
Environment
masteratb3aeda4)Related
Triggers a complete blocker for checkpoint/restore of GPU training jobs. Current workaround is to disable NVLS.
Observed (test output on NVIDIA B200, driver 580.105.08, CUDA 12.8, NCCL v2.27.5-1)
Vanilla NCCL + injection ON
8 injected
cuMulticastUnbindfailures, all 8 invisible to caller.Patched NCCL (this issue's fix) + injection ON
All 8 failures visible. any caller can now react.
Patched NCCL + injection OFF — regression check
Identical to original happy path. Zero regression.
Upstream NCCL real behavior without injection, on same B200 hardware
All 8 dereg calls return
cuMulticastUnbind = CUDA_SUCCESS. The failure mode does not surface in isolated NCCL workloads; it requires an external driver-state perturbation. In our deployment that perturbation is context freeze betweenncclCommDestroyand the dump. Without such a perturbation, the bug is dormant — which we believe explains why it hasn't been reported before.NCCL Version
v2.27.5-1
Your platform details
No response
Error Message & Behavior
No response
Steps to Reproduce the Issue
Steps to Reproduce the Issue
Reproducer
A 170-line standalone C++ program (forks one process per GPU) plus an opt-in env-gated failure injection inside
ncclNvlsDeregBuffer. The same binary + injection demonstrates the bug against vanilla NCCL and the fix against the proposed patch.Program:
nvls_destroy_repro.cuBuild on any host with CUDA 12.1+ and an NVLS-capable GPU (H100, B200, GB200):
Opt-in injection (5-line diff applied alongside any NCCL build)
This adds the env-gated injection while preserving the bug (the CUresult is still discarded). Runs against either the unpatched or patched NCCL below.
NCCL Version
v2.27.5-1
Your platform details
No response
Error Message & Behavior
No response