Skip to content

Add wheel support for Newton-Schulz method via cuSolverMp#3004

Merged
ksivaman merged 13 commits into
NVIDIA:mainfrom
ksivaman:expand_wheel_builds
Jun 10, 2026
Merged

Add wheel support for Newton-Schulz method via cuSolverMp#3004
ksivaman merged 13 commits into
NVIDIA:mainfrom
ksivaman:expand_wheel_builds

Conversation

@ksivaman

@ksivaman ksivaman commented May 17, 2026

Copy link
Copy Markdown
Member

Description

#2706 added distributed Newton-Schulz matrix orthogonalization API via cuSolverMp, this PR brings the support for the same via published wheels.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Enable NVTE_WITH_CUSOLVERMP TE build via PyPI wheel.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@ksivaman ksivaman requested review from cyanguwa, denera and mk-61 May 17, 2026 01:37
@ksivaman ksivaman marked this pull request as draft May 17, 2026 01:38
@greptile-apps

greptile-apps Bot commented May 17, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR enables the Newton-Schulz matrix orthogonalization feature (backed by cuSolverMp) in published PyPI wheels by installing cuSolverMp from the NVIDIA dnf repo inside the build containers, exporting NVTE_WITH_CUSOLVERMP=1 in the wheel build script, and adding a PyPI-package runtime loader in __init__.py.

  • Dockerfiles (x86 + aarch): libcusolvermp0-cuda-${CUDA_MAJOR} and its devel package are installed via dnf; symlinks are created under /opt/nvidia/cusolvermp; ldconfig is called in the same RUN layer; CUSOLVERMP_HOME and an updated LD_LIBRARY_PATH are exported — build-time linking should work correctly.
  • build_wheels.sh: Adds export NVTE_WITH_CUSOLVERMP=1 before all build stages so the common library, PyTorch, and JAX extensions all pick up the flag.
  • setup.py / __init__.py: Introduces cusolvermp_pypi_package_name() for runtime dependency declaration and a conditional ctypes loader guarded by an ldconfig-cache check. Two previously noted issues remain open: the unconditional addition of nvidia-cusolvermp-cuXX to install_reqs and a mixed-case path lookup ("cusolverMp" vs the actual lowercase nvidia/cusolvermp/ directory) that would silently prevent the library from loading at runtime for PyPI-only users.

Confidence Score: 3/5

The build-container and build-script changes are sound, but two functional bugs remain open in setup.py and init.py that together make the core feature (Newton-Schulz via PyPI wheels) non-operational for end users.

The unconditional cusolvermp install requirement in setup.py forces the library on every downstream TE user regardless of whether their wheel was compiled with the feature, and the mixed-case path lookup ('cusolverMp' vs the actual lowercase install path) means the runtime loader silently finds nothing — leaving Newton-Schulz calls with unresolved symbols on a standard PyPI installation. Both issues were raised in prior review rounds and are still present in the current code.

setup.py (unconditional install requirement) and transformer_engine/common/init.py (mixed-case library path lookup) both need corrections before the feature works end-to-end for wheel users.

Important Files Changed

Filename Overview
build_tools/utils.py Adds cusolvermp_pypi_package_name() helper and cleans up PackageNotFoundError import; distribution is imported but unused.
build_tools/wheel_utils/Dockerfile.x86 Installs cuSolverMp system packages via dnf, creates symlinks under /opt/nvidia/cusolvermp, runs ldconfig in the same layer, and exports CUSOLVERMP_HOME. Build-time setup looks correct.
build_tools/wheel_utils/Dockerfile.aarch Mirrors Dockerfile.x86 changes for aarch64; same cuSolverMp system install, symlink, and ldconfig setup.
build_tools/wheel_utils/build_wheels.sh Exports NVTE_WITH_CUSOLVERMP=1 before all build stages; CUSOLVERMP_HOME is inherited from the Docker ENV. Change is minimal and correct for the declared scope.
setup.py Adds cusolvermp_pypi_package_name() unconditionally to install_reqs with no guard on NVTE_WITH_CUSOLVERMP, making nvidia-cusolvermp-cuXX a mandatory runtime dependency for all TE wheel users regardless of build configuration.
transformer_engine/common/init.py Adds _is_cusolvermp_installed_in_system() check and conditional PyPI load path; the load call uses 'cusolverMp' (mixed case) which does not match the lowercase nvidia/cusolvermp/ install directory, causing silent runtime failure for PyPI-installed users.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["TE __init__.py import"] --> B{NVTE_PROJECT_BUILDING\nor NVTE_RELEASE_BUILD?}
    B -- "yes (release build)" --> C["_is_cusolvermp_installed_in_system()\n(runs ldconfig -p)"]
    B -- "no" --> Z["skip cuSolverMp loading"]
    C -- "True\n(system lib found)" --> D["_CUSOLVERMP_LIB_CTYPES = None\n(rely on dynamic linker)"]
    C -- "False\n(not in ldconfig cache)" --> E{"nvidia-cusolvermp-cu12\nor cu13 installed?"}
    E -- "No" --> F["_CUSOLVERMP_LIB_CTYPES = None\n(Newton-Schulz unavailable)"]
    E -- "Yes" --> G["_load_cuda_library_from_python\n('cusolverMp')"]
    G --> H{"Path lookup in\nnvidia/cusolverMp/\n(mixed case)"}
    H -- "Found" --> I["ctypes.CDLL(so, RTLD_GLOBAL)\nNewton-Schulz available"]
    H -- "Not found\n(strict=False)" --> J["_CUSOLVERMP_LIB_CTYPES = []\nsilent failure"]
Loading

Reviews (11): Last reviewed commit: "Merge branch 'main' into expand_wheel_bu..." | Re-trigger Greptile

Comment thread build_tools/wheel_utils/build_wheels.sh Outdated

SITE_PACKAGES=$(/opt/python/cp310-cp310/bin/python -c "import sysconfig; print(sysconfig.get_paths()['purelib'])")
export CUBLASMP_HOME="${SITE_PACKAGES}/nvidia/cublasmp/cu${CUDA_MAJOR}"
export CUSOLVERMP_HOME="${SITE_PACKAGES}/nvidia/cu${CUDA_MAJOR}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Likely incorrect CUSOLVERMP_HOME path

The path ${SITE_PACKAGES}/nvidia/cu${CUDA_MAJOR} is missing the package-name segment. Every other NVIDIA Python package follows the layout site-packages/nvidia/<package-name>/cu<ver>/ — for example, nvidia-cublasmp-cu12 installs under nvidia/cublasmp/cu12/, so nvidia-cusolvermp-cu12 should install under nvidia/cusolvermp/cu12/. With the current path the .so symlink loop silently skips cuSolverMP's lib/ directory ([ -d "$lib_dir" ] || continue), no unversioned .so stubs are created, and the linker will not find cuSolverMP at build time even though NVTE_WITH_CUSOLVERMP=1 is exported.

Comment thread build_tools/wheel_utils/Dockerfile.x86 Outdated
Comment thread build_tools/wheel_utils/Dockerfile.aarch Outdated
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
@ksivaman ksivaman force-pushed the expand_wheel_builds branch from 522c631 to df140b3 Compare May 19, 2026 23:33
@ksivaman ksivaman marked this pull request as ready for review May 19, 2026 23:34
Comment thread build_tools/wheel_utils/build_wheels.sh
@ksivaman ksivaman changed the title Add optional core lib features to wheel build Add wheel support for Newton-Schulz method via cuSolverMp May 19, 2026
vcherepanov-nv
vcherepanov-nv previously approved these changes May 20, 2026

@vcherepanov-nv vcherepanov-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vcherepanov-nv vcherepanov-nv dismissed their stale review May 20, 2026 00:39

On the second thought - we need nvidia-cusolvermp-cu12/cu13 dependency at runtime, not just when building TE/Common

Comment thread transformer_engine/common/__init__.py Outdated
@ksivaman

ksivaman commented Jun 2, 2026

Copy link
Copy Markdown
Member Author

/te-ci

vcherepanov-nv
vcherepanov-nv previously approved these changes Jun 2, 2026
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
@ksivaman

ksivaman commented Jun 3, 2026

Copy link
Copy Markdown
Member Author

/te-ci

vcherepanov-nv
vcherepanov-nv previously approved these changes Jun 3, 2026
Comment thread setup.py
ksivaman added 2 commits June 4, 2026 12:54
Signed-off-by: ksivamani <ksivamani@nvidia.com>
ksivaman added 3 commits June 9, 2026 16:50
Signed-off-by: ksivamani <ksivamani@nvidia.com>
Signed-off-by: ksivamani <ksivamani@nvidia.com>
@ksivaman

Copy link
Copy Markdown
Member Author

/te-ci

@ksivaman ksivaman merged commit 20e185c into NVIDIA:main Jun 10, 2026
37 of 43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants