Skip to content

Fix tensor bridge DLL import failure on Windows#1988

Merged
leofang merged 3 commits intoNVIDIA:mainfrom
leofang:fix-tensor-bridge-win-dll
Apr 30, 2026
Merged

Fix tensor bridge DLL import failure on Windows#1988
leofang merged 3 commits intoNVIDIA:mainfrom
leofang:fix-tensor-bridge-win-dll

Conversation

@leofang
Copy link
Copy Markdown
Member

@leofang leofang commented Apr 29, 2026

Follow-up of #1894.

aoti_torch_get_current_cuda_stream lives in torch_cuda.dll / libtorch_cuda.so, not torch_cpu.dll. Linking against it at build time breaks CPU-only PyTorch installs and caused ImportError: DLL load failed on Windows.

Root cause

The .def file declared all AOTI symbols under LIBRARY torch_cpu.dll, but aoti_torch_get_current_cuda_stream is CUDA-specific. On Linux this wasn't visible because RTLD_GLOBAL makes all symbols visible across shared libraries, but Windows DLL imports are strict.

Fix

Resolve aoti_torch_get_current_cuda_stream lazily at runtime instead of link-time:

  • Inline C helper using LoadLibrary + GetProcAddress (Windows) / dlsym(RTLD_DEFAULT, ...) (Linux)
  • Cached function pointer — resolved once on first CUDA tensor stream sync, then pure C on subsequent calls
  • CPU-only PyTorch works fine: the resolution only triggers when a CUDA tensor actually needs stream ordering

Changes

  • _tensor_bridge.pyx: Remove static extern declaration, add inline C resolver + cached function pointer in sync_torch_stream()
  • aoti_shim.h: Remove aoti_torch_get_current_cuda_stream declaration, add explanatory comments
  • aoti_shim.def: Remove the symbol (already done in previous commit)
  • build_hooks.py: Revert to single-def-file (no second stub lib needed)

Testing

Verified on Windows with both CUDA-enabled and CPU-only PyTorch:

  • CUDA tensors: StridedMemoryView.from_dlpack(cuda_tensor) works correctly
  • CPU tensors: module imports and works without triggering CUDA resolution
  • CPU-only PyTorch: no import errors

-- Leo's bot

aoti_torch_get_current_cuda_stream lives in torch_cuda.dll, not
torch_cpu.dll. The stub import library pointed at the wrong DLL,
causing "The specified procedure could not be found" on Windows.

- Move aoti_torch_get_current_cuda_stream from aoti_shim.def
  (torch_cpu.dll) to new aoti_shim_cuda.def (torch_cuda.dll)
- Update build_hooks.py to generate stub libs for both DLLs
  via a loop
- Add torch_cuda.dll to delvewheel exclude list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Apr 29, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the cuda.core Everything related to the cuda.core module label Apr 29, 2026
@leofang leofang added bug Something isn't working P0 High priority - Must do! labels Apr 29, 2026
@leofang leofang added this to the cuda.core v1.0.0 milestone Apr 29, 2026
@leofang leofang self-assigned this Apr 29, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@leofang leofang marked this pull request as ready for review April 29, 2026 07:26
@leofang
Copy link
Copy Markdown
Member Author

leofang commented Apr 29, 2026

/ok to test 77c0b7b

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Apr 29, 2026

@leofang leofang requested a review from rwgk April 29, 2026 15:28
@github-actions

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generated with Cursor GPT-5.4 Extra High Fast with a few rounds of prompting:


Is the intent that this Windows fast path only supports CUDA-enabled PyTorch?

As written, _tensor_bridge now links against torch_cuda.dll, but the fast path is still taken for any supported torch.Tensor, including CPU tensors. On Windows, that looks like it could make the module fail to load under CPU-only PyTorch, or even for CPU-tensor-only use, before the code can reach the CPU branch. Could this be avoided by keeping the base bridge linked only to torch_cpu.dll and loading or resolving the CUDA-specific symbol lazily only when a CUDA tensor actually needs stream synchronization?

Comment thread cuda_core/build_hooks.py Outdated
The symbol lives in torch_cuda (not torch_cpu), so linking against it
at build time breaks CPU-only PyTorch installs and requires a second
stub import library on Windows.

Instead, resolve it lazily on first use via dlsym (Linux) /
LoadLibrary+GetProcAddress (Windows).  The cached function pointer
keeps subsequent calls fully in C with zero Python overhead.

This reverts the two-def-file approach from the previous commit and
replaces it with a self-contained inline C helper that handles both
platforms.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@leofang
Copy link
Copy Markdown
Member Author

leofang commented Apr 29, 2026

Is the intent that this Windows fast path only supports CUDA-enabled PyTorch?

Good idea. Fixed in commit 50294b8.

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Apr 29, 2026

/ok to test 50294b8

Copy link
Copy Markdown
Contributor

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving.

For peace of mind, I'd fix the race, but it'd seem defendable to assume that the C-level data race under free-threading is benign.


check_aoti(aoti_torch_get_current_cuda_stream(device_index, &producer_s),
if _cached_get_cuda_stream == NULL:
_cached_get_cuda_stream = <_get_cuda_stream_fn_t>_resolve_cuda_stream_fn()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have "freethreading_compatible": True in cuda_core/build_hooks.py, therefore this can be a C-level data race.

Could we leave it as-is?

This is what Cursor GPT-5.4 Extra High Fast said when I asked about "worst-case consequences":


If both threads ultimately resolve the same correct function address, there are two answers.

The formal C answer is still: undefined behavior.

C does not say "same value is okay." Unsynchronized concurrent read/write or write/write to the same object is a data race regardless of whether the bits happen to match. Once you have that, the compiler is allowed to assume the race does not exist.

For this exact snippet in cuda_core/cuda/core/_tensor_bridge.pyx, the practical risk is probably much smaller than the formal one:

  • most likely, both threads just call _resolve_cuda_stream_fn(), both get the same pointer, and both store it;
  • on Windows that may mean an extra LoadLibraryA("torch_cuda.dll") refcount bump;
  • after that, things keep working.

The worst credible consequences are:

  • a thread observes a stale or torn value and calls through a bad pointer or NULL, leading to a crash;
  • compiler optimization based on the "no races exist" assumption produces behavior that is not obviously tied to the source;
  • a thread sees the uninitialized state longer than expected and does duplicate initialization or raises spuriously.

That said, in this particular pattern, on mainstream 64-bit Linux/Windows builds, I would rate the real-world risk as "probably benign most of the time," not "likely to explode." It is still a real bug for a module that claims free-threading compatibility, but it is more of a correctness-contract violation / sanitizer-finding than an obvious imminent crash bug.


I also asked Cursor for a fix, this is what it generated (untested):

diff --git a/cuda_core/cuda/core/_tensor_bridge.pyx b/cuda_core/cuda/core/_tensor_bridge.pyx
index 07eec56537b..THREADSAFE000 100644
--- a/cuda_core/cuda/core/_tensor_bridge.pyx
+++ b/cuda_core/cuda/core/_tensor_bridge.pyx
@@
 ctypedef AOTITorchError (*_get_cuda_stream_fn_t)(int32_t, void**) nogil
 
 cdef extern from *:
     """
     #ifdef _WIN32
     #include <windows.h>
-    static void* _resolve_cuda_stream_fn(void) {
-        HMODULE h = LoadLibraryA("torch_cuda.dll");
-        if (!h) return NULL;
-        return (void*)GetProcAddress(h, "aoti_torch_get_current_cuda_stream");
-    }
+    static INIT_ONCE _cuda_stream_init_once = INIT_ONCE_STATIC_INIT;
+    static void* _cached_cuda_stream_fn = NULL;
+
+    static BOOL CALLBACK _init_cuda_stream_fn(
+            PINIT_ONCE init_once, PVOID param, PVOID* context) {
+        HMODULE h = LoadLibraryA("torch_cuda.dll");
+        if (h) {
+            _cached_cuda_stream_fn =
+                (void*)GetProcAddress(h, "aoti_torch_get_current_cuda_stream");
+        }
+        return TRUE;
+    }
+
+    static void* _resolve_cuda_stream_fn(void) {
+        InitOnceExecuteOnce(&_cuda_stream_init_once, _init_cuda_stream_fn, NULL, NULL);
+        return _cached_cuda_stream_fn;
+    }
     #else
     #include <dlfcn.h>
+    #include <pthread.h>
     #ifndef RTLD_DEFAULT
     #define RTLD_DEFAULT ((void*)0)
     #endif
+    static pthread_once_t _cuda_stream_once = PTHREAD_ONCE_INIT;
+    static void* _cached_cuda_stream_fn = NULL;
+
+    static void _init_cuda_stream_fn(void) {
+        _cached_cuda_stream_fn =
+            dlsym(RTLD_DEFAULT, "aoti_torch_get_current_cuda_stream");
+    }
+
     static void* _resolve_cuda_stream_fn(void) {
-        return dlsym(RTLD_DEFAULT, "aoti_torch_get_current_cuda_stream");
+        pthread_once(&_cuda_stream_once, _init_cuda_stream_fn);
+        return _cached_cuda_stream_fn;
     }
     #endif
     """
     void* _resolve_cuda_stream_fn() nogil
-
-cdef _get_cuda_stream_fn_t _cached_get_cuda_stream = NULL
 
 @@
 cpdef int sync_torch_stream(int32_t device_index,
                             intptr_t consumer_s) except? -1:
 @@
-    global _cached_get_cuda_stream
     cdef void* producer_s
     cdef EventHandle h_event
+    cdef _get_cuda_stream_fn_t get_cuda_stream
 
-    if _cached_get_cuda_stream == NULL:
-        _cached_get_cuda_stream = <_get_cuda_stream_fn_t>_resolve_cuda_stream_fn()
-        if _cached_get_cuda_stream == NULL:
-            raise RuntimeError(
-                "Cannot resolve aoti_torch_get_current_cuda_stream from "
-                "torch_cuda — is CUDA-enabled PyTorch installed?")
-    check_aoti(_cached_get_cuda_stream(device_index, &producer_s),
+    get_cuda_stream = <_get_cuda_stream_fn_t>_resolve_cuda_stream_fn()
+    if get_cuda_stream == NULL:
+        raise RuntimeError(
+            "Cannot resolve aoti_torch_get_current_cuda_stream from "
+            "torch_cuda — is CUDA-enabled PyTorch installed?")
+    check_aoti(get_cuda_stream(device_index, &producer_s),
                b"aoti_torch_get_current_cuda_stream")

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Apr 30, 2026

Thanks, Ralf!

For peace of mind, I'd fix the race, but it'd seem defendable to assume that the C-level data race under free-threading is benign.

Yeah I don't think it would cause issues in practice. Let's see who complains... This would have been a non-issue if we feed the shim header to cybind and generate cython/internal bindings, where we take good care of protecting the module-level variables.

@leofang leofang merged commit edb1901 into NVIDIA:main Apr 30, 2026
181 of 183 checks passed
@leofang leofang deleted the fix-tensor-bridge-win-dll branch April 30, 2026 03:03
@github-actions
Copy link
Copy Markdown

Doc Preview CI
Preview removed because the pull request was closed or merged.

mdboom pushed a commit to mdboom/cuda-python that referenced this pull request Apr 30, 2026
* Fix tensor bridge DLL import failure on Windows

aoti_torch_get_current_cuda_stream lives in torch_cuda.dll, not
torch_cpu.dll. The stub import library pointed at the wrong DLL,
causing "The specified procedure could not be found" on Windows.

- Move aoti_torch_get_current_cuda_stream from aoti_shim.def
  (torch_cpu.dll) to new aoti_shim_cuda.def (torch_cuda.dll)
- Update build_hooks.py to generate stub libs for both DLLs
  via a loop
- Add torch_cuda.dll to delvewheel exclude list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add SPDX headers to aoti_shim_cuda.def

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Resolve aoti_torch_get_current_cuda_stream lazily at runtime

The symbol lives in torch_cuda (not torch_cpu), so linking against it
at build time breaks CPU-only PyTorch installs and requires a second
stub import library on Windows.

Instead, resolve it lazily on first use via dlsym (Linux) /
LoadLibrary+GetProcAddress (Windows).  The cached function pointer
keeps subsequent calls fully in C with zero Python overhead.

This reverts the two-def-file approach from the previous commit and
replaces it with a self-contained inline C helper that handles both
platforms.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.core Everything related to the cuda.core module P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants