Add wheel support for Newton-Schulz method via cuSolverMp#3004
Conversation
Greptile SummaryThis PR enables the Newton-Schulz matrix orthogonalization feature (backed by cuSolverMp) in published PyPI wheels by installing cuSolverMp from the NVIDIA dnf repo inside the build containers, exporting
Confidence Score: 3/5The build-container and build-script changes are sound, but two functional bugs remain open in setup.py and init.py that together make the core feature (Newton-Schulz via PyPI wheels) non-operational for end users. The unconditional cusolvermp install requirement in setup.py forces the library on every downstream TE user regardless of whether their wheel was compiled with the feature, and the mixed-case path lookup ('cusolverMp' vs the actual lowercase install path) means the runtime loader silently finds nothing — leaving Newton-Schulz calls with unresolved symbols on a standard PyPI installation. Both issues were raised in prior review rounds and are still present in the current code. setup.py (unconditional install requirement) and transformer_engine/common/init.py (mixed-case library path lookup) both need corrections before the feature works end-to-end for wheel users. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["TE __init__.py import"] --> B{NVTE_PROJECT_BUILDING\nor NVTE_RELEASE_BUILD?}
B -- "yes (release build)" --> C["_is_cusolvermp_installed_in_system()\n(runs ldconfig -p)"]
B -- "no" --> Z["skip cuSolverMp loading"]
C -- "True\n(system lib found)" --> D["_CUSOLVERMP_LIB_CTYPES = None\n(rely on dynamic linker)"]
C -- "False\n(not in ldconfig cache)" --> E{"nvidia-cusolvermp-cu12\nor cu13 installed?"}
E -- "No" --> F["_CUSOLVERMP_LIB_CTYPES = None\n(Newton-Schulz unavailable)"]
E -- "Yes" --> G["_load_cuda_library_from_python\n('cusolverMp')"]
G --> H{"Path lookup in\nnvidia/cusolverMp/\n(mixed case)"}
H -- "Found" --> I["ctypes.CDLL(so, RTLD_GLOBAL)\nNewton-Schulz available"]
H -- "Not found\n(strict=False)" --> J["_CUSOLVERMP_LIB_CTYPES = []\nsilent failure"]
Reviews (11): Last reviewed commit: "Merge branch 'main' into expand_wheel_bu..." | Re-trigger Greptile |
|
|
||
| SITE_PACKAGES=$(/opt/python/cp310-cp310/bin/python -c "import sysconfig; print(sysconfig.get_paths()['purelib'])") | ||
| export CUBLASMP_HOME="${SITE_PACKAGES}/nvidia/cublasmp/cu${CUDA_MAJOR}" | ||
| export CUSOLVERMP_HOME="${SITE_PACKAGES}/nvidia/cu${CUDA_MAJOR}" |
There was a problem hiding this comment.
Likely incorrect
CUSOLVERMP_HOME path
The path ${SITE_PACKAGES}/nvidia/cu${CUDA_MAJOR} is missing the package-name segment. Every other NVIDIA Python package follows the layout site-packages/nvidia/<package-name>/cu<ver>/ — for example, nvidia-cublasmp-cu12 installs under nvidia/cublasmp/cu12/, so nvidia-cusolvermp-cu12 should install under nvidia/cusolvermp/cu12/. With the current path the .so symlink loop silently skips cuSolverMP's lib/ directory ([ -d "$lib_dir" ] || continue), no unversioned .so stubs are created, and the linker will not find cuSolverMP at build time even though NVTE_WITH_CUSOLVERMP=1 is exported.
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
522c631 to
df140b3
Compare
On the second thought - we need nvidia-cusolvermp-cu12/cu13 dependency at runtime, not just when building TE/Common
Signed-off-by: ksivamani <ksivamani@nvidia.com>
|
/te-ci |
|
/te-ci |
Signed-off-by: ksivamani <ksivamani@nvidia.com>
|
/te-ci |
Description
#2706 added distributed Newton-Schulz matrix orthogonalization API via cuSolverMp, this PR brings the support for the same via published wheels.
Type of change
Changes
NVTE_WITH_CUSOLVERMPTE build via PyPI wheel.Checklist: