Skip to content

TL/CUDA: fix NVLS on fabric cluster UUID#1308

Open
ikryukov wants to merge 1 commit into
openucx:masterfrom
ikryukov:fix/nvls-cluster-uuid-topo-check
Open

TL/CUDA: fix NVLS on fabric cluster UUID#1308
ikryukov wants to merge 1 commit into
openucx:masterfrom
ikryukov:fix/nvls-cluster-uuid-topo-check

Conversation

@ikryukov
Copy link
Copy Markdown
Collaborator

@ikryukov ikryukov commented May 4, 2026

What

Fix silent multinode NVLS crash on standalone DGX nodes.

Why

NVML cliqueId/partitionId are unique only within a fabric cluster, so unrelated DGX nodes can collide on these values, falsely pass ucc_topo_is_single_nvlink_domain, and crash later in cuMulticastAddDevice.

How

Read clusterUuid (globally unique) from nvmlGpuFabricInfo_v2_t into ucc_gpu_info_t.fabric_cluster_uuid and require it to be non-all-zero and matching across ranks.

NVML cliqueId/partitionId are unique only within a fabric cluster, so
unrelated standalone DGX nodes can report the same cliqueId/partitionId
with different fabrics and silently pass ucc_topo_is_single_nvlink_domain.
Multinode NVLS init then crashes late at cuMulticastAddDevice and wedges
team creation.

Read clusterUuid (nvmlGpuFabricInfo_v2_t, the globally-unique fabric
identifier) into ucc_gpu_info_t.fabric_cluster_uuid and require it to be
non-all-zero and identical across ranks before approving a single NVLink
domain. Verified on gaia DGX H100 where NVML reports all-zero clusterUuid
with state=COMPLETED: the topo guard now correctly disables multinode
NVLS instead of failing late.

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
@ikryukov ikryukov self-assigned this May 4, 2026
@ikryukov ikryukov requested review from Sergei-Lebedev, janjust and nsarka and removed request for Sergei-Lebedev May 4, 2026 12:49
@ikryukov
Copy link
Copy Markdown
Collaborator Author

ikryukov commented May 4, 2026

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant