[FEA]: Add support for pathfinder.find_bitcode_lib("nccl_device")#1975
[FEA]: Add support for pathfinder.find_bitcode_lib("nccl_device")#1975rwgk merged 4 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Xiakun Lu <xiakunl@nvidia.com>
This comment has been minimized.
This comment has been minimized.
|
lgtm, but I defer to @rwgk for the final approval. |
|
Q: is this for #1831? |
rwgk
left a comment
There was a problem hiding this comment.
Thanks! My main concern is nccl vs nccl_device, it'd be inconsistent as-is.
The other suggestion is minor, but also very easy.
As a sanity check I looked at the CI test logs. It looks good, e.g.:
INFO test_locate_bitcode_lib[nccl]: lib_path='/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl_device.bc'
| "site_packages_dirs": ("nvidia/nvshmem/lib",), | ||
| "available_on_windows": False, | ||
| }, | ||
| "nccl": { |
There was a problem hiding this comment.
Could you please make this nccl_device to match the existing nvshmem_device convention, and move the new entry before nvshmem_device to keep this list in lexicographic order?
There was a problem hiding this comment.
I still see nccl, could you please make it nccl_device for internal consistency?
There was a problem hiding this comment.
In list DESCRIPTOR_CATALOG of dynamic libs, the name of nvshmem is "nvshmem_host" and the name of nccl is "nccl", which reflect the .so filenames libnvshmem_host.so.3 and libnccl.so.2. nccl as a key in _SUPPORTED_BITCODE_LIBS_INFO doesn't match nvshmem, but it's consistent with nccl's name in dynamic lib list DESCRIPTOR_CATALOG. I'd like to keep the key simple for nccl, so I don't have to think about it should end with _device or _host when I call these pathfinder APIs.
There was a problem hiding this comment.
so I don't have to think
It's gonna be surprising/confusing for anyone else, especially because the filename is libnccl_device.bc. Unless there is more than a personal preference (?), nccl_device is clearly the more consistent name here.
There was a problem hiding this comment.
matching filename libnccl_device.bc is a good point.
|
|
||
|
|
||
| def _make_bitcode_lib_file(dir_path: Path) -> str: | ||
| def _make_bitcode_lib_file(dir_path: Path, libname: str = "device") -> str: |
There was a problem hiding this comment.
It's safer and more readable to avoid the defaults here and below. I tried out the change below. Could you please apply?
@@ -36,14 +36,14 @@ def clear_find_bitcode_lib_cache():
get_cuda_path_or_home.cache_clear()
-def _make_bitcode_lib_file(dir_path: Path, libname: str = "device") -> str:
+def _make_bitcode_lib_file(dir_path: Path, libname: str) -> str:
dir_path.mkdir(parents=True, exist_ok=True)
file_path = dir_path / _bitcode_lib_filename(libname)
file_path.touch()
return str(file_path)
-def _bitcode_lib_dir_under(anchor_dir: Path, libname: str = "device") -> Path:
+def _bitcode_lib_dir_under(anchor_dir: Path, libname: str) -> Path:
return anchor_dir / _bitcode_lib_info(libname)["rel_path"]
@@ -138,7 +138,7 @@ def test_locate_bitcode_lib_search_order(monkeypatch, tmp_path, libname):
@pytest.mark.usefixtures("clear_find_bitcode_lib_cache")
def test_find_bitcode_lib_not_found_error_includes_cuda_home_directory_listing(monkeypatch, tmp_path):
cuda_home = tmp_path / "cuda-home"
- lib_dir = _bitcode_lib_dir_under(cuda_home)
+ lib_dir = _bitcode_lib_dir_under(cuda_home, "device")
lib_dir.mkdir(parents=True, exist_ok=True)
extra_file = lib_dir / "README.txt"
extra_file.write_text("placeholder", encoding="utf-8")…ode_lib_dir_under Signed-off-by: Xiakun Lu <xiakunl@nvidia.com>
Signed-off-by: Xiakun Lu <xiakunl@nvidia.com>
Signed-off-by: Xiakun Lu <xiakunl@nvidia.com>
rwgk
left a comment
There was a problem hiding this comment.
Thanks!
Will it be helpful if this is released soon? (It's easy, <30 minutes effort.)
definitely helpful! would appreciate it. |
This comment has been minimized.
This comment has been minimized.
1 similar comment
|
Closes #1831
Description
Adds support for
cuda.pathfinder.find_bitcode_lib("nccl")The new bitcode descriptor resolves NCCL device bitcode as
libnccl_device.bcundernvidia/nccl/lib, matching the NCCL wheel package layout. The API name isnccl, consistent with the existing pathfinder naming used for NCCL dynamic library discovery.The bitcode search-order test now covers all entries in
SUPPORTED_BITCODE_LIBS, includingnccl, while accounting for descriptors that define multiple site-packages candidate directories.Checklist