Fix tensor bridge DLL import failure on Windows by leofang · Pull Request #1988 · NVIDIA/cuda-python

leofang · 2026-04-29T07:22:17Z

Follow-up of #1894.

aoti_torch_get_current_cuda_stream lives in torch_cuda.dll / libtorch_cuda.so, not torch_cpu.dll. Linking against it at build time breaks CPU-only PyTorch installs and caused ImportError: DLL load failed on Windows.

Root cause

The .def file declared all AOTI symbols under LIBRARY torch_cpu.dll, but aoti_torch_get_current_cuda_stream is CUDA-specific. On Linux this wasn't visible because RTLD_GLOBAL makes all symbols visible across shared libraries, but Windows DLL imports are strict.

Fix

Resolve aoti_torch_get_current_cuda_stream lazily at runtime instead of link-time:

Inline C helper using LoadLibrary + GetProcAddress (Windows) / dlsym(RTLD_DEFAULT, ...) (Linux)
Cached function pointer — resolved once on first CUDA tensor stream sync, then pure C on subsequent calls
CPU-only PyTorch works fine: the resolution only triggers when a CUDA tensor actually needs stream ordering

Changes

_tensor_bridge.pyx: Remove static extern declaration, add inline C resolver + cached function pointer in sync_torch_stream()
aoti_shim.h: Remove aoti_torch_get_current_cuda_stream declaration, add explanatory comments
aoti_shim.def: Remove the symbol (already done in previous commit)
build_hooks.py: Revert to single-def-file (no second stub lib needed)

Testing

Verified on Windows with both CUDA-enabled and CPU-only PyTorch:

CUDA tensors: StridedMemoryView.from_dlpack(cuda_tensor) works correctly
CPU tensors: module imports and works without triggering CUDA resolution
CPU-only PyTorch: no import errors

-- Leo's bot

aoti_torch_get_current_cuda_stream lives in torch_cuda.dll, not torch_cpu.dll. The stub import library pointed at the wrong DLL, causing "The specified procedure could not be found" on Windows. - Move aoti_torch_get_current_cuda_stream from aoti_shim.def (torch_cpu.dll) to new aoti_shim_cuda.def (torch_cuda.dll) - Update build_hooks.py to generate stub libs for both DLLs via a loop - Add torch_cuda.dll to delvewheel exclude list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

copy-pr-bot · 2026-04-29T07:22:21Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

leofang · 2026-04-29T13:50:44Z

/ok to test 77c0b7b

leofang · 2026-04-29T15:28:45Z

This bug is discovered by #1987:
https://github.com/NVIDIA/cuda-python/actions/runs/25112913255/job/73591754844?pr=1987#step:33:374

rwgk

Generated with Cursor GPT-5.4 Extra High Fast with a few rounds of prompting:

Is the intent that this Windows fast path only supports CUDA-enabled PyTorch?

As written, _tensor_bridge now links against torch_cuda.dll, but the fast path is still taken for any supported torch.Tensor, including CPU tensors. On Windows, that looks like it could make the module fail to load under CPU-only PyTorch, or even for CPU-tensor-only use, before the code can reach the CPU branch. Could this be avoided by keeping the base bridge linked only to torch_cpu.dll and loading or resolving the CUDA-specific symbol lazily only when a CUDA tensor actually needs stream synchronization?

The symbol lives in torch_cuda (not torch_cpu), so linking against it at build time breaks CPU-only PyTorch installs and requires a second stub import library on Windows. Instead, resolve it lazily on first use via dlsym (Linux) / LoadLibrary+GetProcAddress (Windows). The cached function pointer keeps subsequent calls fully in C with zero Python overhead. This reverts the two-def-file approach from the previous commit and replaces it with a self-contained inline C helper that handles both platforms. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

leofang · 2026-04-29T20:24:22Z

Is the intent that this Windows fast path only supports CUDA-enabled PyTorch?

Good idea. Fixed in commit 50294b8.

leofang · 2026-04-29T20:24:53Z

/ok to test 50294b8

rwgk

Approving.

For peace of mind, I'd fix the race, but it'd seem defendable to assume that the C-level data race under free-threading is benign.

rwgk · 2026-04-29T22:09:28Z


-    check_aoti(aoti_torch_get_current_cuda_stream(device_index, &producer_s),
+    if _cached_get_cuda_stream == NULL:
+        _cached_get_cuda_stream = <_get_cuda_stream_fn_t>_resolve_cuda_stream_fn()


We have "freethreading_compatible": True in cuda_core/build_hooks.py, therefore this can be a C-level data race.

Could we leave it as-is?

This is what Cursor GPT-5.4 Extra High Fast said when I asked about "worst-case consequences":

If both threads ultimately resolve the same correct function address, there are two answers.

The formal C answer is still: undefined behavior.

C does not say "same value is okay." Unsynchronized concurrent read/write or write/write to the same object is a data race regardless of whether the bits happen to match. Once you have that, the compiler is allowed to assume the race does not exist.

For this exact snippet in cuda_core/cuda/core/_tensor_bridge.pyx, the practical risk is probably much smaller than the formal one:

most likely, both threads just call _resolve_cuda_stream_fn(), both get the same pointer, and both store it;

on Windows that may mean an extra LoadLibraryA("torch_cuda.dll") refcount bump;

after that, things keep working.

The worst credible consequences are:

a thread observes a stale or torn value and calls through a bad pointer or NULL, leading to a crash;

compiler optimization based on the "no races exist" assumption produces behavior that is not obviously tied to the source;

a thread sees the uninitialized state longer than expected and does duplicate initialization or raises spuriously.

That said, in this particular pattern, on mainstream 64-bit Linux/Windows builds, I would rate the real-world risk as "probably benign most of the time," not "likely to explode." It is still a real bug for a module that claims free-threading compatibility, but it is more of a correctness-contract violation / sanitizer-finding than an obvious imminent crash bug.

I also asked Cursor for a fix, this is what it generated (untested):

diff --git a/cuda_core/cuda/core/_tensor_bridge.pyx b/cuda_core/cuda/core/_tensor_bridge.pyx index 07eec56537b..THREADSAFE000 100644 --- a/cuda_core/cuda/core/_tensor_bridge.pyx +++ b/cuda_core/cuda/core/_tensor_bridge.pyx @@ ctypedef AOTITorchError (*_get_cuda_stream_fn_t)(int32_t, void**) nogil cdef extern from *: """ #ifdef _WIN32 #include <windows.h> - static void* _resolve_cuda_stream_fn(void) { - HMODULE h = LoadLibraryA("torch_cuda.dll"); - if (!h) return NULL; - return (void*)GetProcAddress(h, "aoti_torch_get_current_cuda_stream"); - } + static INIT_ONCE _cuda_stream_init_once = INIT_ONCE_STATIC_INIT; + static void* _cached_cuda_stream_fn = NULL; + + static BOOL CALLBACK _init_cuda_stream_fn( + PINIT_ONCE init_once, PVOID param, PVOID* context) { + HMODULE h = LoadLibraryA("torch_cuda.dll"); + if (h) { + _cached_cuda_stream_fn = + (void*)GetProcAddress(h, "aoti_torch_get_current_cuda_stream"); + } + return TRUE; + } + + static void* _resolve_cuda_stream_fn(void) { + InitOnceExecuteOnce(&_cuda_stream_init_once, _init_cuda_stream_fn, NULL, NULL); + return _cached_cuda_stream_fn; + } #else #include <dlfcn.h> + #include <pthread.h> #ifndef RTLD_DEFAULT #define RTLD_DEFAULT ((void*)0) #endif + static pthread_once_t _cuda_stream_once = PTHREAD_ONCE_INIT; + static void* _cached_cuda_stream_fn = NULL; + + static void _init_cuda_stream_fn(void) { + _cached_cuda_stream_fn = + dlsym(RTLD_DEFAULT, "aoti_torch_get_current_cuda_stream"); + } + static void* _resolve_cuda_stream_fn(void) { - return dlsym(RTLD_DEFAULT, "aoti_torch_get_current_cuda_stream"); + pthread_once(&_cuda_stream_once, _init_cuda_stream_fn); + return _cached_cuda_stream_fn; } #endif """ void* _resolve_cuda_stream_fn() nogil - -cdef _get_cuda_stream_fn_t _cached_get_cuda_stream = NULL @@ cpdef int sync_torch_stream(int32_t device_index, intptr_t consumer_s) except? -1: @@ - global _cached_get_cuda_stream cdef void* producer_s cdef EventHandle h_event + cdef _get_cuda_stream_fn_t get_cuda_stream - if _cached_get_cuda_stream == NULL: - _cached_get_cuda_stream = <_get_cuda_stream_fn_t>_resolve_cuda_stream_fn() - if _cached_get_cuda_stream == NULL: - raise RuntimeError( - "Cannot resolve aoti_torch_get_current_cuda_stream from " - "torch_cuda — is CUDA-enabled PyTorch installed?") - check_aoti(_cached_get_cuda_stream(device_index, &producer_s), + get_cuda_stream = <_get_cuda_stream_fn_t>_resolve_cuda_stream_fn() + if get_cuda_stream == NULL: + raise RuntimeError( + "Cannot resolve aoti_torch_get_current_cuda_stream from " + "torch_cuda — is CUDA-enabled PyTorch installed?") + check_aoti(get_cuda_stream(device_index, &producer_s), b"aoti_torch_get_current_cuda_stream")

leofang · 2026-04-30T03:03:30Z

Thanks, Ralf!

For peace of mind, I'd fix the race, but it'd seem defendable to assume that the C-level data race under free-threading is benign.

Yeah I don't think it would cause issues in practice. Let's see who complains... This would have been a non-issue if we feed the shim header to cybind and generate cython/internal bindings, where we take good care of protecting the module-level variables.

github-actions · 2026-04-30T03:22:00Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

github-actions Bot added the cuda.core Everything related to the cuda.core module label Apr 29, 2026

leofang added bug Something isn't working P0 High priority - Must do! labels Apr 29, 2026

leofang added this to the cuda.core v1.0.0 milestone Apr 29, 2026

leofang self-assigned this Apr 29, 2026

Add SPDX headers to aoti_shim_cuda.def

77c0b7b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

leofang marked this pull request as ready for review April 29, 2026 07:26

leofang requested a review from rwgk April 29, 2026 15:28

This comment has been minimized.

Sign in to view

rwgk reviewed Apr 29, 2026

View reviewed changes

Comment thread cuda_core/build_hooks.py Outdated

leofang mentioned this pull request Apr 29, 2026

Add nightly CI for optional-dependency testing (PyTorch, numba-cuda) #1987

Merged

rwgk approved these changes Apr 29, 2026

View reviewed changes

leofang merged commit edb1901 into NVIDIA:main Apr 30, 2026
181 of 183 checks passed

leofang deleted the fix-tensor-bridge-win-dll branch April 30, 2026 03:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix tensor bridge DLL import failure on Windows#1988

Fix tensor bridge DLL import failure on Windows#1988
leofang merged 3 commits into
NVIDIA:mainfrom
leofang:fix-tensor-bridge-win-dll

leofang commented Apr 29, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 29, 2026

Uh oh!

leofang commented Apr 29, 2026

Uh oh!

leofang commented Apr 29, 2026

Uh oh!

This comment has been minimized.

rwgk left a comment

Uh oh!

Uh oh!

leofang commented Apr 29, 2026

Uh oh!

leofang commented Apr 29, 2026

Uh oh!

rwgk left a comment

Uh oh!

rwgk Apr 29, 2026

Uh oh!

leofang commented Apr 30, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

leofang commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Fix

Changes

Testing

Uh oh!

copy-pr-bot Bot commented Apr 29, 2026

Uh oh!

leofang commented Apr 29, 2026

Uh oh!

leofang commented Apr 29, 2026

Uh oh!

This comment has been minimized.

rwgk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

leofang commented Apr 29, 2026

Uh oh!

leofang commented Apr 29, 2026

Uh oh!

rwgk left a comment

Choose a reason for hiding this comment

Uh oh!

rwgk Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

leofang commented Apr 30, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

leofang commented Apr 29, 2026 •

edited

Loading