Fix tensor bridge DLL import failure on Windows#1988
Conversation
aoti_torch_get_current_cuda_stream lives in torch_cuda.dll, not torch_cpu.dll. The stub import library pointed at the wrong DLL, causing "The specified procedure could not be found" on Windows. - Move aoti_torch_get_current_cuda_stream from aoti_shim.def (torch_cpu.dll) to new aoti_shim_cuda.def (torch_cuda.dll) - Update build_hooks.py to generate stub libs for both DLLs via a loop - Add torch_cuda.dll to delvewheel exclude list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ok to test 77c0b7b |
|
This bug is discovered by #1987: |
This comment has been minimized.
This comment has been minimized.
rwgk
left a comment
There was a problem hiding this comment.
Generated with Cursor GPT-5.4 Extra High Fast with a few rounds of prompting:
Is the intent that this Windows fast path only supports CUDA-enabled PyTorch?
As written, _tensor_bridge now links against torch_cuda.dll, but the fast path is still taken for any supported torch.Tensor, including CPU tensors. On Windows, that looks like it could make the module fail to load under CPU-only PyTorch, or even for CPU-tensor-only use, before the code can reach the CPU branch. Could this be avoided by keeping the base bridge linked only to torch_cpu.dll and loading or resolving the CUDA-specific symbol lazily only when a CUDA tensor actually needs stream synchronization?
The symbol lives in torch_cuda (not torch_cpu), so linking against it at build time breaks CPU-only PyTorch installs and requires a second stub import library on Windows. Instead, resolve it lazily on first use via dlsym (Linux) / LoadLibrary+GetProcAddress (Windows). The cached function pointer keeps subsequent calls fully in C with zero Python overhead. This reverts the two-def-file approach from the previous commit and replaces it with a self-contained inline C helper that handles both platforms. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Good idea. Fixed in commit 50294b8. |
|
/ok to test 50294b8 |
rwgk
left a comment
There was a problem hiding this comment.
Approving.
For peace of mind, I'd fix the race, but it'd seem defendable to assume that the C-level data race under free-threading is benign.
|
|
||
| check_aoti(aoti_torch_get_current_cuda_stream(device_index, &producer_s), | ||
| if _cached_get_cuda_stream == NULL: | ||
| _cached_get_cuda_stream = <_get_cuda_stream_fn_t>_resolve_cuda_stream_fn() |
There was a problem hiding this comment.
We have "freethreading_compatible": True in cuda_core/build_hooks.py, therefore this can be a C-level data race.
Could we leave it as-is?
This is what Cursor GPT-5.4 Extra High Fast said when I asked about "worst-case consequences":
If both threads ultimately resolve the same correct function address, there are two answers.
The formal C answer is still: undefined behavior.
C does not say "same value is okay." Unsynchronized concurrent read/write or write/write to the same object is a data race regardless of whether the bits happen to match. Once you have that, the compiler is allowed to assume the race does not exist.
For this exact snippet in cuda_core/cuda/core/_tensor_bridge.pyx, the practical risk is probably much smaller than the formal one:
- most likely, both threads just call
_resolve_cuda_stream_fn(), both get the same pointer, and both store it; - on Windows that may mean an extra
LoadLibraryA("torch_cuda.dll")refcount bump; - after that, things keep working.
The worst credible consequences are:
- a thread observes a stale or torn value and calls through a bad pointer or
NULL, leading to a crash; - compiler optimization based on the "no races exist" assumption produces behavior that is not obviously tied to the source;
- a thread sees the uninitialized state longer than expected and does duplicate initialization or raises spuriously.
That said, in this particular pattern, on mainstream 64-bit Linux/Windows builds, I would rate the real-world risk as "probably benign most of the time," not "likely to explode." It is still a real bug for a module that claims free-threading compatibility, but it is more of a correctness-contract violation / sanitizer-finding than an obvious imminent crash bug.
I also asked Cursor for a fix, this is what it generated (untested):
diff --git a/cuda_core/cuda/core/_tensor_bridge.pyx b/cuda_core/cuda/core/_tensor_bridge.pyx
index 07eec56537b..THREADSAFE000 100644
--- a/cuda_core/cuda/core/_tensor_bridge.pyx
+++ b/cuda_core/cuda/core/_tensor_bridge.pyx
@@
ctypedef AOTITorchError (*_get_cuda_stream_fn_t)(int32_t, void**) nogil
cdef extern from *:
"""
#ifdef _WIN32
#include <windows.h>
- static void* _resolve_cuda_stream_fn(void) {
- HMODULE h = LoadLibraryA("torch_cuda.dll");
- if (!h) return NULL;
- return (void*)GetProcAddress(h, "aoti_torch_get_current_cuda_stream");
- }
+ static INIT_ONCE _cuda_stream_init_once = INIT_ONCE_STATIC_INIT;
+ static void* _cached_cuda_stream_fn = NULL;
+
+ static BOOL CALLBACK _init_cuda_stream_fn(
+ PINIT_ONCE init_once, PVOID param, PVOID* context) {
+ HMODULE h = LoadLibraryA("torch_cuda.dll");
+ if (h) {
+ _cached_cuda_stream_fn =
+ (void*)GetProcAddress(h, "aoti_torch_get_current_cuda_stream");
+ }
+ return TRUE;
+ }
+
+ static void* _resolve_cuda_stream_fn(void) {
+ InitOnceExecuteOnce(&_cuda_stream_init_once, _init_cuda_stream_fn, NULL, NULL);
+ return _cached_cuda_stream_fn;
+ }
#else
#include <dlfcn.h>
+ #include <pthread.h>
#ifndef RTLD_DEFAULT
#define RTLD_DEFAULT ((void*)0)
#endif
+ static pthread_once_t _cuda_stream_once = PTHREAD_ONCE_INIT;
+ static void* _cached_cuda_stream_fn = NULL;
+
+ static void _init_cuda_stream_fn(void) {
+ _cached_cuda_stream_fn =
+ dlsym(RTLD_DEFAULT, "aoti_torch_get_current_cuda_stream");
+ }
+
static void* _resolve_cuda_stream_fn(void) {
- return dlsym(RTLD_DEFAULT, "aoti_torch_get_current_cuda_stream");
+ pthread_once(&_cuda_stream_once, _init_cuda_stream_fn);
+ return _cached_cuda_stream_fn;
}
#endif
"""
void* _resolve_cuda_stream_fn() nogil
-
-cdef _get_cuda_stream_fn_t _cached_get_cuda_stream = NULL
@@
cpdef int sync_torch_stream(int32_t device_index,
intptr_t consumer_s) except? -1:
@@
- global _cached_get_cuda_stream
cdef void* producer_s
cdef EventHandle h_event
+ cdef _get_cuda_stream_fn_t get_cuda_stream
- if _cached_get_cuda_stream == NULL:
- _cached_get_cuda_stream = <_get_cuda_stream_fn_t>_resolve_cuda_stream_fn()
- if _cached_get_cuda_stream == NULL:
- raise RuntimeError(
- "Cannot resolve aoti_torch_get_current_cuda_stream from "
- "torch_cuda — is CUDA-enabled PyTorch installed?")
- check_aoti(_cached_get_cuda_stream(device_index, &producer_s),
+ get_cuda_stream = <_get_cuda_stream_fn_t>_resolve_cuda_stream_fn()
+ if get_cuda_stream == NULL:
+ raise RuntimeError(
+ "Cannot resolve aoti_torch_get_current_cuda_stream from "
+ "torch_cuda — is CUDA-enabled PyTorch installed?")
+ check_aoti(get_cuda_stream(device_index, &producer_s),
b"aoti_torch_get_current_cuda_stream")|
Thanks, Ralf!
Yeah I don't think it would cause issues in practice. Let's see who complains... This would have been a non-issue if we feed the shim header to cybind and generate cython/internal bindings, where we take good care of protecting the module-level variables. |
|
* Fix tensor bridge DLL import failure on Windows aoti_torch_get_current_cuda_stream lives in torch_cuda.dll, not torch_cpu.dll. The stub import library pointed at the wrong DLL, causing "The specified procedure could not be found" on Windows. - Move aoti_torch_get_current_cuda_stream from aoti_shim.def (torch_cpu.dll) to new aoti_shim_cuda.def (torch_cuda.dll) - Update build_hooks.py to generate stub libs for both DLLs via a loop - Add torch_cuda.dll to delvewheel exclude list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add SPDX headers to aoti_shim_cuda.def Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Resolve aoti_torch_get_current_cuda_stream lazily at runtime The symbol lives in torch_cuda (not torch_cpu), so linking against it at build time breaks CPU-only PyTorch installs and requires a second stub import library on Windows. Instead, resolve it lazily on first use via dlsym (Linux) / LoadLibrary+GetProcAddress (Windows). The cached function pointer keeps subsequent calls fully in C with zero Python overhead. This reverts the two-def-file approach from the previous commit and replaces it with a self-contained inline C helper that handles both platforms. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Follow-up of #1894.
aoti_torch_get_current_cuda_streamlives intorch_cuda.dll/libtorch_cuda.so, nottorch_cpu.dll. Linking against it at build time breaks CPU-only PyTorch installs and causedImportError: DLL load failedon Windows.Root cause
The
.deffile declared all AOTI symbols underLIBRARY torch_cpu.dll, butaoti_torch_get_current_cuda_streamis CUDA-specific. On Linux this wasn't visible becauseRTLD_GLOBALmakes all symbols visible across shared libraries, but Windows DLL imports are strict.Fix
Resolve
aoti_torch_get_current_cuda_streamlazily at runtime instead of link-time:LoadLibrary+GetProcAddress(Windows) /dlsym(RTLD_DEFAULT, ...)(Linux)Changes
_tensor_bridge.pyx: Remove static extern declaration, add inline C resolver + cached function pointer insync_torch_stream()aoti_shim.h: Removeaoti_torch_get_current_cuda_streamdeclaration, add explanatory commentsaoti_shim.def: Remove the symbol (already done in previous commit)build_hooks.py: Revert to single-def-file (no second stub lib needed)Testing
Verified on Windows with both CUDA-enabled and CPU-only PyTorch:
StridedMemoryView.from_dlpack(cuda_tensor)works correctly-- Leo's bot