Nccl sync v2 28 9 clean by thomas-huber · Pull Request #5934 · ROCm/rocm-systems

thomas-huber · 2026-05-08T21:33:07Z

Motivation

Technical Details

JIRA ID

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

The NCCL examples directory provides users and developers with practical code samples that highlight NCCL’s core features. It covers basic operations like communicator initialization, point-to-point communication, and collective operations, as well as advanced features such as User Buffer (UB), symmetric memory, and the device API.

GPU-Initiated Networking (GIN): * Provides device-side API for integrating GPU-Initiated Networking capability into application kernels. * New transport layer called DOCA GPUNetIO. * New ncclGin construct to create, destroy and manipulate GIN contexts. * New ncclGinBarrierSession to provide synchronization functionality. * New put, signal, counter operations for data movement and signaling. * GIN API signatures and functionalities are subject to change. * GIN Support Requirements * CUDA 12.2 or later when compiling the GPU code * NVIDIA GPUs: Volta or newer. NVIDIA GPU drivers >= 510.40.3 * NVIDIA NICs: CX4 or newer. rdma-core >= 44.0 * Requires nvidia-peermem or DMABUF support. When using DMABUF, linux kernel >= 6.1 is required. New ncclCommRevoke API for fault tolerance: * Introduces ncclCommRevoke to quiesce ongoing NCCL work on a communicator without freeing resources. * This answers the need for a lightweight way to cancel in-flight collectives and bring a communicator to a safe state before split/shrink/finalize/destroy. * Includes optional cross-rank coordination (global barrier) and supports blocking/non-blocking usage. New NCCL Environment Plugin: * The env plugin allows users to set NCCL environment variables, for example, after loading them from a centralized database. * The NCCL_ENV_PLUGIN variable can be used to let NCCL load an external environment plugin. New NCCL Examples on GitHub: * The NCCL examples directory provides users and developers with practical code samples that highlight NCCL’s core features. * It covers basic operations like communicator initialization, point-to-point communication, and collective operations, as well as advanced features such as user buffer registration, symmetric memory, and the device API. Device API improvements: * Adds ncclFindWindow API. * Adds new ncclBarrierSession to provide hybrid synchronization functionality. * Makes multimem available with as few as two ranks. * Removes distance (NCCL_P2P_LEVEL) considerations from determining the availability of symmetric memory. Enhanced NCCL RAS output: * Extends RAS subsystem with JSON format to support machine-parsable metrics collection. * Enables structured data export for monitoring tools, dashboards, and automated analysis systems. Github Pull Requests resolved: * Fast Init - CPU Optimizations for NCCL Initialization Large Scale. (PR #1789) * Fast Init - Improve Bootstrap AllGather by 2x at large scale by sending bootstrap information bidirectionally. (PR #1791) * Fixes spurious failures when PyTorch is statically linked with NCCL-2.28.3 because error is not drained, but rather gets propagated into the next CUDA kernel invocation. (PR #1864) Other notable improvements: * Fixes multicast object leaks in case of failed NVLS user buffer registrations, which could lead to crashes. Avoids such registration attempts in case of the use of incompatible memory allocators. * Fixes potential data corruption with built-in symmetric kernels for small messages with size granularity under 8 bytes or when multiple symmetric operations were aggregated in a group. * Generalizes the existing point-to-point scheduling to the case of un-even GPU count per node. * Fixes a crash when network plugin assignment fails. * Fixes a large performance issue with NCCL_CROSS_NIC=0 and certain split mask settings, where NCCL cannot find a viable ring. * Fixes crash when NCCL is compiled with recent CUDA versions but running on hosts with certain specific older CUDA drivers.

Added a pure GIN all2all example and a Hybrid GIN/LSA one

Fix operation ordering between main thread and proxy thread to prevent hangs at large scale. Fix Issue 1893 (NVIDIA/nccl#1893), a bug fix in GIN.

Subtree-merged upstream NVIDIA/nccl into projects/rccl/. Upstream commit: commit dbc86fd Author: Mark Santesson <msantesson@nvidia.com> Date: Mon Nov 10 17:57:32 2025 +0000 NCCL 2.28.9-1 Fix operation ordering between main thread and proxy thread to prevent hangs at large scale. Fix Issue 1893 (NVIDIA/nccl#1893), a bug fix in GIN. Brings RCCL up to NCCL 2.28.9-1 (previous baseline was 2.28.3-1). This includes the cmake build-system prep, root CMakeLists.txt and examples additions, the GIN examples, and the operation-ordering fix in 2.28.9.

marksantesson and others added 7 commits September 24, 2025 13:00

Remove the github actions to auto-close older issues

834ef72

Added GIN examples

b17addf

Added a pure GIN all2all example and a Hybrid GIN/LSA one

Add Contribution Guide to GitHub

dd8446f

NCCL 2.28.9-1

dbc86fd

Fix operation ordering between main thread and proxy thread to prevent hangs at large scale. Fix Issue 1893 (NVIDIA/nccl#1893), a bug fix in GIN.

thomas-huber requested a review from a team as a code owner May 8, 2026 21:33

github-actions Bot added the project: rccl label May 8, 2026

systems-assistant Bot added the organization: ROCm label May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nccl sync v2 28 9 clean#5934

Nccl sync v2 28 9 clean#5934
thomas-huber wants to merge 7 commits intodevelopfrom
nccl-sync-v2-28-9-clean

thomas-huber commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

thomas-huber commented May 8, 2026

Motivation

Technical Details

JIRA ID

Test Plan

Test Result

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants