WIP: pathfinder_compatibility_guard_rails by rwgk · Pull Request #1977 · NVIDIA/cuda-python

rwgk · 2026-04-25T05:51:08Z

Resolves #1038

Continuation of #1936

WIP — CI testing

Introduce CompatibilityGuardRails plus related errors and tests so callers can opt into CTK and driver compatibility checks while reusing the existing pathfinder lookup APIs. Made-with: Cursor

Expose process_wide_compatibility_guard_rails at import time so follow-up changes can route the default cuda.pathfinder APIs through a stable public instance. Document the singleton and pin its public availability with a small regression test. Made-with: Cursor

Make the process-wide CompatibilityGuardRails instance the default path for the public load/find/locate APIs so top-level calls share compatibility state. Factor the routing/fallback/cache-reset glue into a dedicated internal module to keep `cuda.pathfinder.__init__` focused on the public surface, and fall back to the existing raw resolvers when v1 guard rails only have insufficient metadata. Made-with: Cursor

Allow CUDA_PATHFINDER_COMPATIBILITY_GUARD_RAILS to select strict, best_effort, or off behavior so we can experiment with stricter compatibility checks without changing the public API shape. Made-with: Cursor

Treat driver-packaged libraries as compatibility-neutral so strict mode can load NVML and other driver libs without a raw fallback, while CTK-backed artifacts remain the only items that establish and enforce the process-wide CTK anchor. Made-with: Cursor

Infer the CUDA Toolkit line from both wildcard-pinned and range-based cuda-toolkit requirements so strict process-wide guard rails keep working for editable wheel installs used by nvrtc and nvJitLink. Made-with: Cursor

copy-pr-bot · 2026-04-25T05:51:11Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

rwgk · 2026-04-25T05:51:44Z

/ok to test

github-actions · 2026-04-25T06:11:49Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1977/
https://nvidia.github.io/cuda-python/pr-preview/pr-1977/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1977/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1977/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

rwgk · 2026-04-25T18:21:59Z

Analysis of CI failures for workflow run

https://github.com/NVIDIA/cuda-python/actions/runs/24924024509?pr=1977

Cursor GPT-5.4 Extra High Fast

Findings

The failures are not universal. All 34 failing Test_*.txt files I found are local matrix jobs; I did not find failing wheels jobs in this archive.
There is one real failure family: strict process-wide guard rails reject CTK dynamic libraries loaded from the checked-out local toolkit tree because pathfinder cannot infer a CTK line for those paths. Representative Linux log: /wrk/logs_24924024509/Test_linux-64___py3.11__13.0.2__local__l4.txt:7237, /wrk/logs_24924024509/Test_linux-64___py3.11__13.0.2__local__l4.txt:7307, /wrk/logs_24924024509/Test_linux-64___py3.11__13.0.2__local__l4.txt:7369, /wrk/logs_24924024509/Test_linux-64___py3.11__13.0.2__local__l4.txt:7442. Representative Windows log: /wrk/logs_24924024509/Test_win-64___py3.10__13.0.2__local__rtxpro6000__TCC_.txt:10323, /wrk/logs_24924024509/Test_win-64___py3.10__13.0.2__local__rtxpro6000__TCC_.txt:10387, /wrk/logs_24924024509/Test_win-64___py3.10__13.0.2__local__rtxpro6000__TCC_.txt:10443, /wrk/logs_24924024509/Test_win-64___py3.10__13.0.2__local__rtxpro6000__TCC_.txt:10510.
The 12.9 local jobs hit the same root cause later and therefore fail in cuda_core rather than cuda_bindings, because nvJitLink is the first affected library touched there. Examples: /wrk/logs_24924024509/Test_linux-64___py3.10__12.9.1__local__v100.txt:7389 and /wrk/logs_24924024509/Test_win-64___py3.13__12.9.1__local__l4__TCC_.txt:10289.
This matches current intended behavior exactly. In the same failing jobs, tests/test_compatibility_guard_rails.py::test_missing_version_json_raises_insufficient_metadata passes, for example /wrk/logs_24924024509/Test_linux-64___py3.11__13.0.2__local__l4.txt:6414 and /wrk/logs_24924024509/Test_win-64___py3.10__13.0.2__local__rtxpro6000__TCC_.txt:8971.
The ##[error]WARNING: Running pip as the 'root' user ... lines are red herrings from GitHub Actions log formatting, not the job-failure cause. The real failures are the exit code 2 jobs listed above.

Why

The current logic in cuda_pathfinder/cuda/pathfinder/_compatibility_guard_rails.py only accepts CTK metadata from an enclosing version.json or from wheel ownership / cuda-toolkit metadata.
The local CI toolkit layout under cuda_toolkit/ appears to provide neither. I also checked the current tree and there is no version.json under cuda_toolkit/.
A filename-based guess would not be a proper fix. The local 13.0.2 and 13.2.1 jobs use the same artifact names (libnvrtc.so.13 on Linux, nvrtc64_130_0.dll / nvJitLink_130_0.dll on Windows), so the minor CTK line is not recoverable from the library filename alone.

Proper fix

The clean fix is to add a third authoritative CTK metadata source for local-toolkit jobs.
Best option: make the local CI cuda_toolkit/ tree include version.json at its root, so the existing pathfinder logic works unchanged.
Second-best option: teach pathfinder to read another authoritative manifest already present in that local toolkit tree, if CI already stages one.
Least attractive option: add an explicit env-var metadata source for local-toolkit test jobs. That would work, but it is more CI-specific and less clean than shipping real toolkit metadata.

Introduce a small toolkit-info utility that reads the CUDA_VERSION macro from cuda.h so follow-up guard-rails changes can infer CTK major.minor from toolkit headers without depending on version.json. Made-with: Cursor

Centralize encoded CUDA version parsing and validation so toolkit and driver version helpers stay aligned and cuda.h parsing gets consistent string conversion and error reporting. Made-with: Cursor

Replace version.json-based CTK root metadata with cuda.h parsing so compatibility checks use a simpler, more universal toolkit source while preserving wheel-based metadata inference. Made-with: Cursor

rwgk · 2026-04-25T20:26:39Z

/ok to test

rwgk · 2026-04-26T17:33:21Z

At commit c6c38e3, the CI has a single failure in Test_linux-aarch64___py3.14t__13.2.1__local__l4.txt:

All jobs: https://github.com/NVIDIA/cuda-python/actions/runs/24939925406?pr=1977
The single failure: https://github.com/NVIDIA/cuda-python/actions/runs/24939925406/job/73032499833?pr=1977

That failure does not look like a cuda.h / guard-rails regression. The failing test is tests/test_compatibility_guard_rails.py::test_real_wheel_ctk_items_are_compatible, and the relevant lines are:

INFO test_find_binary_utilities[nvcc]: bin_path=None
INFO test_real_wheel_ctk_items_are_compatible: nvcc=None
FAILED tests/test_compatibility_guard_rails.py::test_real_wheel_ctk_items_are_compatible - assert None is not None

Spot-checking sibling logs shows that the underlying nvcc lookup behavior is inconsistent across closely related aarch64 local jobs:

Test_linux-aarch64___py3.11__13.2.1__local__l4.txt resolves nvcc successfully in both the local-toolkit phase ('/__w/cuda-python/cuda-python/cuda_toolkit/bin/nvcc') and the wheel phase ('/opt/hostedtoolcache/.../site-packages/nvidia/cu13/bin/nvcc').
Test_linux-aarch64___py3.14__13.2.1__local__a100.txt also resolves nvcc successfully in both phases and passes the same guard-rails test.
Test_linux-aarch64___py3.13__13.0.2__local__a100.txt already shows the same local-phase symptom as the failing job (INFO test_find_binary_utilities[nvcc]: bin_path=None), but the guard-rails test does not fail there because it exits earlier with CompatibilityCheckError: ... resolves to CTK 13.0, which does not satisfy ctk_minor==2.
Test_linux-aarch64___py3.14t__12.9.1__local__l4.txt resolves nvcc successfully in the local phase, but the guard-rails test also exits earlier because the local toolkit resolves to CTK 12.9, which does not satisfy the test's hard-wired ctk_major==13.

So the most important takeaway from the logs is: the single red test is a combination of two conditions happening in the same job:

the local mini-CTK nvcc lookup returns None, and
the job happens to use the one CTK line (13.2) that drives test_real_wheel_ctk_items_are_compatible all the way to assert nvcc is not None.

That explains why this shows up as only one visible failure even though the broader nvcc lookup behavior is already inconsistent elsewhere in the same aarch64 local family.

Issues to look into next:

Why does nvcc lookup flip between None and '/__w/.../cuda_toolkit/bin/nvcc' across closely related local jobs?
Replace shutil.which() with a two-stage check: 1. find the file, 2. assert the executable bit is set.
CUDA_PATHFINDER_TEST_FIND_NVIDIA_BINARY_UTILITY_STRICTNESS is missing (all_must_work).
Hard-wired CTK version numbers in guard rail unit tests.

rwgk added 6 commits April 24, 2026 15:15

Add explicit compatibility guard rails to cuda.pathfinder.

895d503

Introduce CompatibilityGuardRails plus related errors and tests so callers can opt into CTK and driver compatibility checks while reusing the existing pathfinder lookup APIs. Made-with: Cursor

Add guard-rails mode switch for public pathfinder APIs.

0b15665

Allow CUDA_PATHFINDER_COMPATIBILITY_GUARD_RAILS to select strict, best_effort, or off behavior so we can experiment with stricter compatibility checks without changing the public API shape. Made-with: Cursor

Accept wheel metadata version ranges in strict guard rails.

b622613

Infer the CUDA Toolkit line from both wildcard-pinned and range-based cuda-toolkit requirements so strict process-wide guard rails keep working for editable wheel installs used by nvrtc and nvJitLink. Made-with: Cursor

rwgk self-assigned this Apr 25, 2026

github-actions Bot added the cuda.pathfinder Everything related to the cuda.pathfinder module label Apr 25, 2026

rwgk added the P0 High priority - Must do! label Apr 25, 2026

rwgk added this to the cuda.pathfinder next milestone Apr 25, 2026

Merge branch 'main' into pathfinder_compatibility_guard_rails

3bf0e98

rwgk mentioned this pull request Apr 25, 2026

WIP: pathfinder_with_compatibility_checks_v0 #1936

Closed

rwgk added the feature New feature or request label Apr 25, 2026

rwgk added 3 commits April 25, 2026 12:12

Add cuda.h toolkit version parser.

f7e81ed

Introduce a small toolkit-info utility that reads the CUDA_VERSION macro from cuda.h so follow-up guard-rails changes can infer CTK major.minor from toolkit headers without depending on version.json. Made-with: Cursor

Share encoded CUDA version decoding logic.

e3b402a

Centralize encoded CUDA version parsing and validation so toolkit and driver version helpers stay aligned and cuda.h parsing gets consistent string conversion and error reporting. Made-with: Cursor

Use cuda.h for CTK guard-rails metadata.

c6c38e3

Replace version.json-based CTK root metadata with cuda.h parsing so compatibility checks use a simpler, more universal toolkit source while preserving wheel-based metadata inference. Made-with: Cursor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: pathfinder_compatibility_guard_rails#1977

WIP: pathfinder_compatibility_guard_rails#1977
rwgk wants to merge 10 commits intoNVIDIA:mainfrom
rwgk:pathfinder_compatibility_guard_rails

rwgk commented Apr 25, 2026

Uh oh!

copy-pr-bot Bot commented Apr 25, 2026

Uh oh!

rwgk commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

rwgk commented Apr 25, 2026

Uh oh!

rwgk commented Apr 25, 2026

Uh oh!

rwgk commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rwgk commented Apr 25, 2026

Uh oh!

copy-pr-bot Bot commented Apr 25, 2026

Uh oh!

rwgk commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

rwgk commented Apr 25, 2026

Findings

Why

Proper fix

Uh oh!

rwgk commented Apr 25, 2026

Uh oh!

rwgk commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant