Skip to content

[Issue]: ncclNvlsDeregBuffer silently swallows cuMulticastUnbind failures — poisons CUDA context across ncclCommDestroy, breaks cuda-checkpoint #2117

@shubh3794

Description

@shubh3794

How is this issue impacting you?

Data corruption

Share Your Debug Logs

How is this issue impacting you?

Data corruption

Share Your Debug Logs

ncclNvlsDeregBuffer silently swallows cuMulticastUnbind failures — poisons CUDA context across ncclCommDestroy, breaks cuda-checkpoint

Summary

ncclNvlsDeregBuffer in src/transport/nvls.cc wraps cuMulticastUnbind in the CUCALL macro, which discards the CUresult. When the unbind fails, the multicast binding stays live in the CUDA driver context, but the caller — regCleanupncclCommDeregister / ncclRegCleanup (inside ncclCommDestroy) — sees ncclSuccess and continues teardown. The resulting "successful" destroy leaves driver-level state orphaned.

For checkpoint/restore integrations (cuda-checkpoint, live-migration frameworks), this is fatal: the stale binding is dumped verbatim into the process image, and every CUDA op after restore — NCCL or not — fails with cudaGetErrorString = "invalid argument" / "invalid device ordinal".

The bug is present in both v2.27.5-1 and master (b3aeda4, NCCL v2.30 dev). The two files involved are unchanged between those tags in the relevant region.

Code path

ncclCommDestroy                         src/init.cc
  → commReclaim → commCleanup → commFree
    → if (comm->nvlsSupport) ncclNvlsFree(comm)                   (shared NVLS — CUCHECK, clean)
    → ncclRegCleanup(comm)                                        ← (a)
         for each regCache slot:
           regCleanup(comm, reg)                                   src/register/register.cc
             if (reg->state & NVLS_REG_COMPLETE)
               ncclNvlsDeregBuffer(comm, &reg->mcHandle, ...)      ← (b), src/transport/nvls.cc
                 CUCALL(cuMulticastUnbind(*mcHandler, ...))        ← (c)  silently swallowed
                 CUCHECK(cuMemUnmap(...))
                 CUCHECK(cuMemAddressFree(...))
                 CUCHECK(cuMemRelease(*mcHandler))                  (refcount-- only; does not reverse binding)

ncclCommDeregister (synchronous user path) reaches (b) directly via commDeregister → regCleanup.

The silent swallow is at (c). Behavior at (b) and (a) is compounded:

  • CUCALL body is literally pfn_##cmd; — no CUresult capture, no log, not even a WARN.
  • Original regCleanup turns any ncclNvlsDeregBuffer error into a WARN and returns ncclSuccess.

Combined, a failed unbind is invisible to every caller up to and including ncclCommDestroy / ncclCommDeregister.

The other NVLS teardown path — ncclNvlsFreenvlsGroupUnbind (shared credit/buffer MC handles) — uses CUCHECK and propagates cleanly. That path is not affected.

Who triggers the broken path

NCCL auto-registers user buffers for NVLS whenever:

  • NCCL_LOCAL_REGISTER=1 (default) and a collective runs the NVLS algorithm (NCCL_ALGO=NVLS or auto-selected), or
  • NCCL_GRAPH_REGISTER=1 (default) and the collective is captured in a CUDA graph, or
  • User code calls ncclCommRegister explicitly.

Every registration goes through tryRegisterBuffercuMulticastCreate + cuMulticastAddDevice + cuMulticastBindAddr, producing a per-buffer MC handle in the regCache. At teardown, every such slot is dereg'd through ncclNvlsDeregBuffer. PyTorch DDP under torchrun — the most common multi-rank deployment — hits this path on every destroy.

Note: in single-process multi-GPU setups (ncclCommInitAll), comm->nvlsRegSupport is set to 0 by src/init.cc:771-779 when two ranks share a process (hostHash + pidHash match), so the per-buffer path is skipped. One-process-per-rank deployments (torchrun + DDP, MPI + NCCL, ncclCommInitRank) trigger it normally.

Proposed fix:

#2116

Would appreciate guidance on whether the team prefers:

  1. This minimal two-file change (only surface the failure; caller decides how to react), or
  2. A larger change that also adds retry / forced-unbind logic inside NCCL, or
  3. A new ncclCommDestroyStrict entry point that runs destroy synchronously and distinguishes "multicast still bound" from generic unhandled CUDA errors.

Design (1) is enough to unblock our cuda-checkpoint integrations; (2)/(3) are bigger surface changes that require API discussion. But the underlying issue is that right now NVLS on destroy pollutes cuda contexts as the cuda-checkpoint somehow captures multicast leftover on the cuda-contexts.

Environment

  • NCCL: v2.27.5-1 (also verified present in master at b3aeda4)
  • CUDA driver: 580.105.08
  • CUDA toolkit: 12.8.r12.8
  • GPU: NVIDIA B200 (sm_100), 8 per node, full NVSwitch (NV18 topology)
  • OS: Linux 6.6.121.1 (Azure Linux 3.0)

Related

Triggers a complete blocker for checkpoint/restore of GPU training jobs. Current workaround is to disable NVLS.

Observed (test output on NVIDIA B200, driver 580.105.08, CUDA 12.8, NCCL v2.27.5-1)

Vanilla NCCL + injection ON

[rank 0] ncclCommDeregister send=0 recv=0
[rank 1] ncclCommDeregister send=0 recv=0
[rank 2] ncclCommDeregister send=0 recv=0
[rank 3] ncclCommDeregister send=0 recv=0
[rank 0..3] ncclCommDestroy=0
deregErr=0 destroyErr=0

8 injected cuMulticastUnbind failures, all 8 invisible to caller.

Patched NCCL (this issue's fix) + injection ON

[rank 0] ncclCommDeregister send=1 recv=1     (1 = ncclUnhandledCudaError)
[rank 1] ncclCommDeregister send=1 recv=1
[rank 2] ncclCommDeregister send=1 recv=1
[rank 3] ncclCommDeregister send=1 recv=1
[rank 0..3] ncclCommDestroy=0                 (dereg caught them all)
deregErr=4 destroyErr=0

All 8 failures visible. any caller can now react.

Patched NCCL + injection OFF — regression check

[rank 0..3] ncclCommDeregister send=0 recv=0
[rank 0..3] ncclCommDestroy=0
deregErr=0 destroyErr=0

Identical to original happy path. Zero regression.

Upstream NCCL real behavior without injection, on same B200 hardware

All 8 dereg calls return cuMulticastUnbind = CUDA_SUCCESS. The failure mode does not surface in isolated NCCL workloads; it requires an external driver-state perturbation. In our deployment that perturbation is context freeze between ncclCommDestroy and the dump. Without such a perturbation, the bug is dormant — which we believe explains why it hasn't been reported before.

NCCL Version

v2.27.5-1

Your platform details

No response

Error Message & Behavior

No response

Steps to Reproduce the Issue

Steps to Reproduce the Issue

Reproducer

A 170-line standalone C++ program (forks one process per GPU) plus an opt-in env-gated failure injection inside ncclNvlsDeregBuffer. The same binary + injection demonstrates the bug against vanilla NCCL and the fix against the proposed patch.

Program: nvls_destroy_repro.cu

#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <vector>
#include <sys/wait.h>
#include <unistd.h>
#include <cuda_runtime.h>
#include <nccl.h>

#define CUDACHECK(cmd) do { cudaError_t e = (cmd); if (e != cudaSuccess) { \
  fprintf(stderr, "[rank %d] CUDA err %s:%d: %s\n", g_rank, __FILE__, __LINE__, cudaGetErrorString(e)); _exit(1); } } while(0)
#define NCCLCHECK(cmd) do { ncclResult_t r = (cmd); if (r != ncclSuccess) { \
  fprintf(stderr, "[rank %d] NCCL err %s:%d: %s\n", g_rank, __FILE__, __LINE__, ncclGetErrorString(r)); _exit(1); } } while(0)

static int g_rank = -1;

static void child_work(int rank, int nDev, ncclUniqueId id) {
  g_rank = rank;
  CUDACHECK(cudaSetDevice(rank));
  const size_t bytes = 64u << 20;              // 64 MB
  float *sendbuf, *recvbuf;
  NCCLCHECK(ncclMemAlloc((void**)&sendbuf, bytes));
  NCCLCHECK(ncclMemAlloc((void**)&recvbuf, bytes));
  CUDACHECK(cudaMemset(sendbuf, 1, bytes));
  cudaStream_t stream; CUDACHECK(cudaStreamCreate(&stream));
  ncclComm_t comm;
  NCCLCHECK(ncclCommInitRank(&comm, nDev, id, rank));
  void *sh, *rh;
  NCCLCHECK(ncclCommRegister(comm, sendbuf, bytes, &sh));
  NCCLCHECK(ncclCommRegister(comm, recvbuf, bytes, &rh));
  for (int it = 0; it < 5; it++)
    NCCLCHECK(ncclAllReduce(sendbuf, recvbuf, bytes/sizeof(float), ncclFloat, ncclSum, comm, stream));
  CUDACHECK(cudaStreamSynchronize(stream));
  ncclResult_t dS = ncclCommDeregister(comm, sh);
  ncclResult_t dR = ncclCommDeregister(comm, rh);
  fprintf(stderr, "[rank %d] ncclCommDeregister send=%d recv=%d\n", rank, dS, dR);
  ncclResult_t destroyRes = ncclCommDestroy(comm);
  fprintf(stderr, "[rank %d] ncclCommDestroy=%d\n", rank, destroyRes);
  sleep(2);
  cudaStreamDestroy(stream);
  ncclMemFree(sendbuf); ncclMemFree(recvbuf);
  _exit((dS || dR || destroyRes) ? 3 : 0);
}

int main(int argc, char** argv) {
  int nDev = argc > 1 ? atoi(argv[1]) : 4;
  ncclUniqueId id; ncclGetUniqueId(&id);
  std::vector<pid_t> kids(nDev);
  for (int i = 0; i < nDev; i++) {
    pid_t p = fork();
    if (p == 0) child_work(i, nDev, id);
    kids[i] = p;
  }
  int deregErr = 0, destroyErr = 0;
  for (int i = 0; i < nDev; i++) {
    int st; waitpid(kids[i], &st, 0);
    int ec = WIFEXITED(st) ? WEXITSTATUS(st) : -1;
    if (ec & 1) deregErr++; if (ec & 2) destroyErr++;
  }
  printf("deregErr=%d destroyErr=%d\n", deregErr, destroyErr);
  return 0;
}

Build on any host with CUDA 12.1+ and an NVLS-capable GPU (H100, B200, GB200):

nvcc -O2 -std=c++17 -gencode=arch=compute_90,code=sm_90 nvls_destroy_repro.cu \
     -o nvls_destroy_repro -lnccl -lcudart -lcuda

Opt-in injection (5-line diff applied alongside any NCCL build)

--- a/src/transport/nvls.cc
+++ b/src/transport/nvls.cc
@@ -110 +110,5 @@ ncclResult_t ncclNvlsDeregBuffer(struct ncclComm* comm, CUmemGenericAllocationHa
-  CUCALL(cuMulticastUnbind(*mcHandler, dev, 0/*mcOffset*/, ucsize));
+  CUresult _res;
+  const char* _inj = getenv("NCCL_NVLS_REPRO_INJECT_UNBIND_FAIL");
+  if (_inj && _inj[0] == '1') _res = CUDA_ERROR_UNKNOWN;
+  else _res = CUPFN(cuMulticastUnbind(*mcHandler, dev, 0/*mcOffset*/, ucsize));
+  (void)_res; /* preserve original behavior: swallow */

This adds the env-gated injection while preserving the bug (the CUresult is still discarded). Runs against either the unpatched or patched NCCL below.

NCCL Version

v2.27.5-1

Your platform details

No response

Error Message & Behavior

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions