Skip to content

TOPO: resolve NVML fabric symbols lazily#1305

Open
janjust wants to merge 2 commits into
openucx:masterfrom
janjust:master-fix-cuda12-topo-dependency
Open

TOPO: resolve NVML fabric symbols lazily#1305
janjust wants to merge 2 commits into
openucx:masterfrom
janjust:master-fix-cuda12-topo-dependency

Conversation

@janjust
Copy link
Copy Markdown
Collaborator

@janjust janjust commented Apr 30, 2026

What

This PR avoids a hard runtime dependency on newer NVML fabric-info symbols in the CUDA sysinfo component.

Specifically, nvmlDeviceGetGpuFabricInfoV and the legacy nvmlDeviceGetGpuFabricInfo path are now resolved lazily with dlsym() before use.

Why ?

HPC SDK users can build UCC/HPC-X with newer CUDA/NVML headers but run on systems with older NVIDIA drivers. In that case, libucc_sysinfo_cuda.so could fail to load with:

undefined symbol: nvmlDeviceGetGpuFabricInfoV

That symbol is only available in newer libnvidia-ml.so.1 versions, so directly referencing it raises the effective runtime driver requirement even when fabric metadata is optional.

How ?

The CUDA sysinfo fabric query now uses a small runtime resolver for optional NVML fabric APIs. If the versioned fabric-info symbol is available, UCC uses it and preserves partition ID support. If it is unavailable, UCC falls back to the legacy fabric-info API when present. If neither symbol exists, fabric metadata remains unset and UCC continues without failing library load.

Testing

Tested before/after with CUDA 12.9.1 and NVML 555 at build time, then switched runtime NVML to R525.60.13.

Before this PR:

  • config.h defined HAVE_NVML_GPU_FABRIC_INFO_V.
  • readelf -Ws libucc_sysinfo_cuda.so | grep nvmlDeviceGetGpuFabricInfo showed an undefined relocation for nvmlDeviceGetGpuFabricInfoV.
  • With R525 NVML loaded, ldd -r libucc_sysinfo_cuda.so failed with:
    undefined symbol: nvmlDeviceGetGpuFabricInfoV

After this PR:

  • readelf and nm -D show no dynamic dependency on nvmlDeviceGetGpuFabricInfo*.
  • With R525 NVML loaded, ldd -r libucc_sysinfo_cuda.so reports no undefined fabric-info symbols.

@janjust
Copy link
Copy Markdown
Collaborator Author

janjust commented Apr 30, 2026

@Sergei-Lebedev @ikryukov codex solution - I check it with the above tests, before/after PR, if you get a chance and have pointers how to guide it for a better fix if needed - let me know

@janjust janjust requested review from Juee14Desai and nsarka April 30, 2026 21:17
@janjust
Copy link
Copy Markdown
Collaborator Author

janjust commented Apr 30, 2026

/build

}
#endif

return handle ? handle : RTLD_DEFAULT;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we fall through to RTLD_DEFAULT, it might be worth logging a ucc_debug() message so it's visible in debug output that we couldn't obtain a dedicated NVML handle.

@janjust
Copy link
Copy Markdown
Collaborator Author

janjust commented May 6, 2026

/build

@janjust janjust requested a review from Juee14Desai May 6, 2026 17:07
janjust added 2 commits May 6, 2026 12:15
Signed-off-by: Tomislav Janjusic <tomislavj@nvidia.com>
Signed-off-by: Tomislav Janjusic <tomislavj@nvidia.com>
@janjust janjust force-pushed the master-fix-cuda12-topo-dependency branch from 642553f to 2189389 Compare May 6, 2026 17:15
@janjust
Copy link
Copy Markdown
Collaborator Author

janjust commented May 6, 2026

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants