Nccl sync v2 28 9 clean#5934
Open
thomas-huber wants to merge 7 commits intodevelopfrom
Open
Conversation
The NCCL examples directory provides users and developers with practical code samples that highlight NCCL’s core features. It covers basic operations like communicator initialization, point-to-point communication, and collective operations, as well as advanced features such as User Buffer (UB), symmetric memory, and the device API.
GPU-Initiated Networking (GIN):
* Provides device-side API for integrating GPU-Initiated Networking
capability into application kernels.
* New transport layer called DOCA GPUNetIO.
* New ncclGin construct to create, destroy and manipulate GIN contexts.
* New ncclGinBarrierSession to provide synchronization functionality.
* New put, signal, counter operations for data movement and signaling.
* GIN API signatures and functionalities are subject to change.
* GIN Support Requirements
* CUDA 12.2 or later when compiling the GPU code
* NVIDIA GPUs: Volta or newer. NVIDIA GPU drivers >= 510.40.3
* NVIDIA NICs: CX4 or newer. rdma-core >= 44.0
* Requires nvidia-peermem or DMABUF support. When using DMABUF, linux
kernel >= 6.1 is required.
New ncclCommRevoke API for fault tolerance:
* Introduces ncclCommRevoke to quiesce ongoing NCCL work on a
communicator without freeing resources.
* This answers the need for a lightweight way to cancel in-flight
collectives and bring a communicator to a safe state before
split/shrink/finalize/destroy.
* Includes optional cross-rank coordination (global barrier) and
supports blocking/non-blocking usage.
New NCCL Environment Plugin:
* The env plugin allows users to set NCCL environment variables, for
example, after loading them from a centralized database.
* The NCCL_ENV_PLUGIN variable can be used to let NCCL load an external
environment plugin.
New NCCL Examples on GitHub:
* The NCCL examples directory provides users and developers with
practical code samples that highlight NCCL’s core features.
* It covers basic operations like communicator initialization,
point-to-point communication, and collective operations, as well as
advanced features such as user buffer registration, symmetric memory,
and the device API.
Device API improvements:
* Adds ncclFindWindow API.
* Adds new ncclBarrierSession to provide hybrid synchronization
functionality.
* Makes multimem available with as few as two ranks.
* Removes distance (NCCL_P2P_LEVEL) considerations from determining the
availability of symmetric memory.
Enhanced NCCL RAS output:
* Extends RAS subsystem with JSON format to support machine-parsable
metrics collection.
* Enables structured data export for monitoring tools, dashboards, and
automated analysis systems.
Github Pull Requests resolved:
* Fast Init - CPU Optimizations for NCCL Initialization Large Scale.
(PR #1789)
* Fast Init - Improve Bootstrap AllGather by 2x at large scale by
sending bootstrap information bidirectionally. (PR #1791)
* Fixes spurious failures when PyTorch is statically linked with
NCCL-2.28.3 because error is not drained, but rather gets propagated
into the next CUDA kernel invocation. (PR #1864)
Other notable improvements:
* Fixes multicast object leaks in case of failed NVLS user buffer
registrations, which could lead to crashes. Avoids such registration
attempts in case of the use of incompatible memory allocators.
* Fixes potential data corruption with built-in symmetric kernels for
small messages with size granularity under 8 bytes or when multiple
symmetric operations were aggregated in a group.
* Generalizes the existing point-to-point scheduling to the case of
un-even GPU count per node.
* Fixes a crash when network plugin assignment fails.
* Fixes a large performance issue with NCCL_CROSS_NIC=0 and certain
split mask settings, where NCCL cannot find a viable ring.
* Fixes crash when NCCL is compiled with recent CUDA versions but
running on hosts with certain specific older CUDA drivers.
Added a pure GIN all2all example and a Hybrid GIN/LSA one
Fix operation ordering between main thread and proxy thread to prevent hangs at large scale. Fix Issue 1893 (NVIDIA/nccl#1893), a bug fix in GIN.
Subtree-merged upstream NVIDIA/nccl into projects/rccl/.
Upstream commit:
commit dbc86fd
Author: Mark Santesson <msantesson@nvidia.com>
Date: Mon Nov 10 17:57:32 2025 +0000
NCCL 2.28.9-1
Fix operation ordering between main thread and proxy thread to
prevent hangs at large scale.
Fix Issue 1893 (NVIDIA/nccl#1893), a
bug fix in GIN.
Brings RCCL up to NCCL 2.28.9-1 (previous baseline was 2.28.3-1). This
includes the cmake build-system prep, root CMakeLists.txt and examples
additions, the GIN examples, and the operation-ordering fix in 2.28.9.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Technical Details
JIRA ID
Test Plan
Test Result
Submission Checklist