Skip to content

Nccl sync v2 28 9 clean#5934

Open
thomas-huber wants to merge 7 commits intodevelopfrom
nccl-sync-v2-28-9-clean
Open

Nccl sync v2 28 9 clean#5934
thomas-huber wants to merge 7 commits intodevelopfrom
nccl-sync-v2-28-9-clean

Conversation

@thomas-huber
Copy link
Copy Markdown
Contributor

Motivation

Technical Details

JIRA ID

Test Plan

Test Result

Submission Checklist

marksantesson and others added 7 commits September 24, 2025 13:00
The NCCL examples directory provides users and developers with
practical code samples that highlight NCCL’s core features. It covers
basic operations like communicator initialization, point-to-point
communication, and collective operations, as well as advanced features
such as User Buffer (UB), symmetric memory, and the device API.
GPU-Initiated Networking (GIN):
 * Provides device-side API for integrating GPU-Initiated Networking
   capability into application kernels.
 * New transport layer called DOCA GPUNetIO.
 * New ncclGin construct to create, destroy and manipulate GIN contexts.
 * New ncclGinBarrierSession to provide synchronization functionality.
 * New put, signal, counter operations for data movement and signaling.
 * GIN API signatures and functionalities are subject to change.
 * GIN Support Requirements
   * CUDA 12.2 or later when compiling the GPU code
   * NVIDIA GPUs: Volta or newer. NVIDIA GPU drivers >= 510.40.3
   * NVIDIA NICs: CX4 or newer. rdma-core >= 44.0
   * Requires nvidia-peermem or DMABUF support. When using DMABUF, linux
     kernel >= 6.1 is required.

New ncclCommRevoke API for fault tolerance:
 * Introduces ncclCommRevoke to quiesce ongoing NCCL work on a
   communicator without freeing resources.
 * This answers the need for a lightweight way to cancel in-flight
   collectives and bring a communicator to a safe state before
   split/shrink/finalize/destroy.
 * Includes optional cross-rank coordination (global barrier) and
   supports blocking/non-blocking usage.

New NCCL Environment Plugin:
 * The env plugin allows users to set NCCL environment variables, for
   example, after loading them from a centralized database.
 * The NCCL_ENV_PLUGIN variable can be used to let NCCL load an external
   environment plugin.

New NCCL Examples on GitHub:
 * The NCCL examples directory provides users and developers with
   practical code samples that highlight NCCL’s core features.
 * It covers basic operations like communicator initialization,
   point-to-point communication, and collective operations, as well as
   advanced features such as user buffer registration, symmetric memory,
   and the device API.

Device API improvements:
 * Adds ncclFindWindow API.
 * Adds new ncclBarrierSession to provide hybrid synchronization
   functionality.
 * Makes multimem available with as few as two ranks.
 * Removes distance (NCCL_P2P_LEVEL) considerations from determining the
   availability of symmetric memory.

Enhanced NCCL RAS output:
 * Extends RAS subsystem with JSON format to support machine-parsable
   metrics collection.
 * Enables structured data export for monitoring tools, dashboards, and
   automated analysis systems.

Github Pull Requests resolved:
 * Fast Init - CPU Optimizations for NCCL Initialization Large Scale.
   (PR #1789)
 * Fast Init - Improve Bootstrap AllGather by 2x at large scale by
   sending bootstrap information bidirectionally. (PR #1791)
 * Fixes spurious failures when PyTorch is statically linked with
   NCCL-2.28.3 because error is not drained, but rather gets propagated
   into the next CUDA kernel invocation. (PR #1864)

Other notable improvements:
 * Fixes multicast object leaks in case of failed NVLS user buffer
   registrations, which could lead to crashes. Avoids such registration
   attempts in case of the use of incompatible memory allocators.
 * Fixes potential data corruption with built-in symmetric kernels for
   small messages with size granularity under 8 bytes or when multiple
   symmetric operations were aggregated in a group.
 * Generalizes the existing point-to-point scheduling to the case of
   un-even GPU count per node.
 * Fixes a crash when network plugin assignment fails.
 * Fixes a large performance issue with NCCL_CROSS_NIC=0 and certain
   split mask settings, where NCCL cannot find a viable ring.
 * Fixes crash when NCCL is compiled with recent CUDA versions but
   running on hosts with certain specific older CUDA drivers.
Added a pure GIN all2all example and a Hybrid GIN/LSA one
Fix operation ordering between main thread and proxy thread to prevent hangs at large scale.

Fix Issue 1893 (NVIDIA/nccl#1893), a bug fix in GIN.
Subtree-merged upstream NVIDIA/nccl into projects/rccl/.

Upstream commit:

    commit dbc86fd
    Author: Mark Santesson <msantesson@nvidia.com>
    Date:   Mon Nov 10 17:57:32 2025 +0000

        NCCL 2.28.9-1

        Fix operation ordering between main thread and proxy thread to
        prevent hangs at large scale.

        Fix Issue 1893 (NVIDIA/nccl#1893), a
        bug fix in GIN.

Brings RCCL up to NCCL 2.28.9-1 (previous baseline was 2.28.3-1). This
includes the cmake build-system prep, root CMakeLists.txt and examples
additions, the GIN examples, and the operation-ordering fix in 2.28.9.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants