TOPO: resolve NVML fabric symbols lazily#1305
Open
janjust wants to merge 2 commits into
Open
Conversation
Collaborator
Author
|
@Sergei-Lebedev @ikryukov codex solution - I check it with the above tests, before/after PR, if you get a chance and have pointers how to guide it for a better fix if needed - let me know |
Collaborator
Author
|
/build |
Juee14Desai
reviewed
Apr 30, 2026
| } | ||
| #endif | ||
|
|
||
| return handle ? handle : RTLD_DEFAULT; |
Collaborator
There was a problem hiding this comment.
If we fall through to RTLD_DEFAULT, it might be worth logging a ucc_debug() message so it's visible in debug output that we couldn't obtain a dedicated NVML handle.
Collaborator
Author
|
/build |
Signed-off-by: Tomislav Janjusic <tomislavj@nvidia.com>
Signed-off-by: Tomislav Janjusic <tomislavj@nvidia.com>
642553f to
2189389
Compare
Collaborator
Author
|
/build |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
This PR avoids a hard runtime dependency on newer NVML fabric-info symbols in the CUDA sysinfo component.
Specifically,
nvmlDeviceGetGpuFabricInfoVand the legacynvmlDeviceGetGpuFabricInfopath are now resolved lazily withdlsym()before use.Why ?
HPC SDK users can build UCC/HPC-X with newer CUDA/NVML headers but run on systems with older NVIDIA drivers. In that case,
libucc_sysinfo_cuda.socould fail to load with:undefined symbol: nvmlDeviceGetGpuFabricInfoVThat symbol is only available in newer
libnvidia-ml.so.1versions, so directly referencing it raises the effective runtime driver requirement even when fabric metadata is optional.How ?
The CUDA sysinfo fabric query now uses a small runtime resolver for optional NVML fabric APIs. If the versioned fabric-info symbol is available, UCC uses it and preserves partition ID support. If it is unavailable, UCC falls back to the legacy fabric-info API when present. If neither symbol exists, fabric metadata remains unset and UCC continues without failing library load.
Testing
Tested before/after with CUDA 12.9.1 and NVML 555 at build time, then switched runtime NVML to R525.60.13.
Before this PR:
config.hdefinedHAVE_NVML_GPU_FABRIC_INFO_V.readelf -Ws libucc_sysinfo_cuda.so | grep nvmlDeviceGetGpuFabricInfoshowed an undefined relocation fornvmlDeviceGetGpuFabricInfoV.ldd -r libucc_sysinfo_cuda.sofailed with:undefined symbol: nvmlDeviceGetGpuFabricInfoVAfter this PR:
readelfandnm -Dshow no dynamic dependency onnvmlDeviceGetGpuFabricInfo*.ldd -r libucc_sysinfo_cuda.soreports no undefined fabric-info symbols.