Commit acc78f7
authored
RAII resource handles for CUDA objects (#1368)
* Centralize context management code
- Create _context.pxd with Context class declaration
- Move get_primary_context from _device.pyx to _context.pyx
- Add Cython-level context functions: get_current_context, set_current_context, get_stream_context
- Update _device.pyx to use context module functions
- Update _stream.pyx to use context module functions
- Remove push_context and pop_context (unused, replaced with direct CUDA API calls)
- Reorganize _context.pyx according to style guide (principal class first)
* Integrate resource handles into Context class
- Replace Context._handle (object) with ContextHandle (shared_ptr) resource handle
- Add handle property to Context returning driver.CUcontext
- Update Context._from_ctx to create ContextHandle using create_context_handle()
- Update Context.__eq__ to compare actual CUcontext values (not shared_ptr addresses)
- Update Context.__hash__ to include type(self) and handle value with NULL safety
- Update _device.pyx to use ctx._handle.get()[0] for direct access
- Update _graph.pyx to use context.handle property
- Update C++ implementation to use default deleter (simplifies code)
- Rename _resource_handles_impl.cpp to _context_impl.cpp
- Remove test_dev.py development script
- Update .gitignore to allow *_impl.cpp files
- Fix all test files to use context.handle instead of context._handle
* Refactor context helpers to use ContextHandle and TLS cache
- switch context helper APIs to return ContextHandle instead of raw CUcontext
- add TLS wrapper for primary context caching using handles
- update device/stream code to consume ContextHandle-based helpers
- expose create_context_handle_ref as nogil-safe in the pxd
* Add helper functions to extract raw resources from ContextHandle
Introduce three helper functions for ContextHandle resource extraction:
- native(h): Returns cydriver.CUcontext for use with cydriver API calls
- py(h): Returns driver.CUcontext for use with Python driver API
- intptr(h): Returns uintptr_t for internal APIs expecting integer addresses
These helpers replace direct h_context.get()[0] calls, providing:
- Cleaner, more semantic code
- Consistent extraction pattern across all handle types
- Type-safe conversions with clear intent
Implementation details:
- native() and intptr() are inline nogil functions in .pxd
- py() requires Python module access, implemented in new _resource_handles.pyx
- Updated all call sites in _context, _device, and _stream modules
* Refactor context acquisition to C++ handle helpers
Move get_primary_context/get_current_context into C++ with thread-local caching and conditional GIL release; inline create_context_handle_ref in the header; update Cython modules and build hooks to link handle users (including _device) against resource_handles and libcuda.
* Fix link error by loading _resource_handles with RTLD_GLOBAL
The C++ implementation in _resource_handles_impl.cpp is now compiled
only into _resource_handles.so. Other modules that depend on these
symbols (_context, _device, etc.) resolve them at runtime via the
global symbol table.
This ensures a single shared instance of thread-local caches and
avoids setuptools issues with shared source files across extensions.
* Move helper functions to C++ for overloading support
Move native(), intptr(), and py() from Cython inline functions to
inline C++ functions in resource_handles.hpp. This enables function
overloading when additional handle types (e.g., StreamHandle) are added.
- native(): extract raw CUDA handle from ContextHandle
- intptr(): extract handle as uintptr_t for Python interop
- py(): convert handle to Python driver wrapper object
* Extend resource handle paradigm to Stream
Add StreamHandle for automatic stream lifetime management using the same
shared_ptr-based pattern established for ContextHandle.
Changes:
- Add StreamHandle type and create_stream_handle/create_stream_handle_ref
functions in C++ with implementations in _resource_handles_impl.cpp
- Add overloaded native(), intptr(), py() helpers for StreamHandle
- Update Stream class to use _h_stream (StreamHandle) instead of raw _handle
- Owned streams are automatically destroyed when last reference is released
- Borrowed streams (from __cuda_stream__ protocol) hold _owner reference
- Update memory resource files to use native(stream._h_stream)
- Simplify Context using intptr() and py() helpers
* Simplify Stream by moving more logic to C++
- Move stream creation to C++ (create_stream_handle now calls
cuStreamCreateWithPriority internally)
- Add get_legacy_stream/get_per_thread_stream for built-in streams
- Add create_stream_handle_with_owner for borrowed streams that
prevents Python owner from being GC'd via captured PyObject*
- Add GILAcquireGuard (symmetric to GILReleaseGuard) for safely
acquiring GIL in C++ destructors
- Simplify Stream class: remove __cinit__, _owner, _builtin,
_legacy_default, _per_thread_default
- Use _from_handle as single initialization point for Stream
- Remove obsolete subclassing tests for removed methods
* Refactor Stream to use ContextHandle and simplify initialization
- Replace raw CUcontext _ctx_handle with ContextHandle _h_context
for consistent handle paradigm and cleaner code
- Replace CUdevice _device_id with int using -1 sentinel
- Use intptr() helper instead of <uintptr_t>() casts throughout
- Add _from_handle(type cls, ...) factory with subclass support
- Add _legacy_default and _per_thread_default classmethods
- Eliminate duplicated initialization code in _init
* Extend ContextHandle to Event and standardize naming
- Event now uses ContextHandle for _h_context instead of raw object
- Event._init is now a cdef staticmethod accepting ContextHandle
- Context._from_ctx renamed to Context._from_handle (cdef staticmethod)
- Moved get_device_from_ctx to Stream module as Stream_ensure_ctx_device
- Inlined get_stream_context into Stream_ensure_ctx
- Simplified context push/pop logic in Stream_ensure_ctx_device
Naming standardization:
- Device._id -> Device._device_id
- _dev_id -> _device_id throughout codebase
- dev_id -> device_id for local variables
- Updated tests to use public APIs instead of internal _init methods
* Store owning context handle in Device
Device now stores its Context in _context slot, set during set_current().
This ensures Device holds an owning reference to its context, enabling
proper lifetime management when passed to Stream and Event creation.
Changes:
- Add _context to Device.__slots__
- Store Context in set_current() for both primary and explicit context paths
- Simplify context property to return stored _context
- Update create_event() to use self._context._h_context
- Remove get_current_context import (no longer needed in _device.pyx)
Add structural context dependency to owned streams
StreamBox now holds ContextHandle to ensure context outlives the stream.
This structural dependency is only for owned streams - borrowed streams
delegate context lifetime management to their owners.
C++ changes:
- StreamBox gains h_context member
- create_stream_handle(h_ctx, flags, priority) takes owning context
- create_stream_handle_ref(stream) - caller manages context
- create_stream_handle_with_owner(stream, owner) - Python owner manages context
Cython/Python changes:
- Stream._init() accepts optional ctx parameter
- Device.create_stream() passes self._context to Stream._init()
- Owned streams get context handle embedded in C++ handle
* Convert Event to use resource handles
Event now uses EventHandle (shared_ptr) for RAII-based lifetime management,
following the same pattern as Stream.
C++ changes:
- Add EventHandle type alias and EventBox struct
- Add create_event_handle(h_ctx, flags) with context captured in deleter
- Add create_event_handle_ipc(ipc_handle) for IPC events (no context dep)
- Add native(), intptr(), py() overloads for EventHandle
Cython changes:
- Event._h_event replaces raw CUevent _handle
- _init() uses create_event_handle()
- from_ipc_descriptor() uses create_event_handle_ipc()
- close() uses _h_event.reset()
- Keep _h_context for cached fast access
* Clean up Stream.wait() to use EventHandle for temporary events
- Simplified branch structure: early return for Event, single path for Stream
- Use native() helper for handle access instead of casting via handle property
- Temporary events now use EventHandle with RAII cleanup (no explicit cuEventDestroy)
- Added create_event_handle import
* Add create_event_handle overload for temporary events
- New overload takes only flags (no ContextHandle) for temporary events
- Delegates to existing overload with empty ContextHandle
- Updated _stream.pyx and _memoryview.pyx to use simpler overload
- Removed unnecessary get_current_context import from _memoryview.pyx
- Removed unnecessary Stream_ensure_ctx call from Stream.wait()
* Convert DeviceMemoryResource to use MemoryPoolHandle
C++ layer (resource_handles.hpp/cpp):
- Add MemoryPoolHandle = std::shared_ptr<const CUmemoryPool>
- Add create_mempool_handle(props) - owning, calls cuMemPoolDestroy on release
- Add create_mempool_handle_ref(pool) - non-owning reference
- Add create_mempool_handle_ipc(fd, handle_type) - owning from IPC import
- Add get_device_mempool(device_id) - get current pool for device (non-owning)
- Add native(), intptr(), py() overloads for MemoryPoolHandle
Cython layer:
- Update _resource_handles.pxd with new types and functions
- Update _device_memory_resource.pxd: replace raw handle with MemoryPoolHandle
- Reorder members: _h_pool first (matches Stream/Event pattern)
- Update _device_memory_resource.pyx to use new handle functions
- Update _ipc.pyx to use create_mempool_handle_ipc for IPC imports
- DMR_close now uses RAII (_h_pool.reset()) instead of explicit cuMemPoolDestroy
- Consistent member initialization order across __cinit__, init functions, and close
* Add DevicePtrHandle for RAII device pointer management
Introduce DevicePtrHandle (std::shared_ptr<const CUdeviceptr>) to manage
device pointer lifetimes with automatic deallocation. Key features:
- Allocation functions: deviceptr_alloc_from_pool, deviceptr_alloc_async,
deviceptr_alloc, deviceptr_alloc_host, deviceptr_create_ref
- IPC import via deviceptr_import_ipc with error output parameter
- Deallocation stream stored in mutable DevicePtrBox, accessible via
deallocation_stream() and set_deallocation_stream()
- cuMemFreeAsync used for deallocation (NULL stream = legacy default)
- Buffer class updated to use DevicePtrHandle instead of raw pointers
- Buffer.handle returns integer for backward compatibility with ctypes
- IPCBufferDescriptor.payload_ptr() helper to simplify casting
Note: IPC-imported pointers do not yet implement reference counting
workaround for nvbug 5570902.
* Use intptr_t for all handle integer conversions
Change all intptr() overloads to return std::intptr_t (signed) instead
of std::uintptr_t per C standard convention for pointer-to-integer
conversion. This addresses issue #1342 which requires Buffer.handle
to return a signed integer.
Fixes #1342
* Add thread-local error handling for resource handle functions
Implement a systematic error handling approach for C++ resource handle
functions using thread-local storage, similar to cudaGetLastError().
API:
- get_last_error(): Returns and clears the last CUDA error
- peek_last_error(): Returns without clearing
- clear_last_error(): Explicitly clears the error
All functions that can fail now set the thread-local error before
returning an empty handle. This allows callers to retrieve specific
CUDA error codes for proper exception propagation.
Updated deviceptr_import_ipc to use this pattern instead of an
output parameter.
* Add IPC pointer cache to fix duplicate import issue (nvbug 5570902)
IPC-imported device pointers are not correctly reference counted by the
driver - the first cuMemFreeAsync incorrectly unmaps the memory even when
the pointer was imported multiple times.
Work around this by caching imported pointers and returning the same
handle for duplicate imports. The cache uses weak_ptr so entries are
automatically cleaned up when all references are released.
The workaround can be easily bypassed via use_ipc_ptr_cache() when a
driver fix becomes available.
* Fix lint issues: remove unused imports and variables
* Add deviceptr_create_with_owner for handle-based owner tracking
Implements handle-based owner tracking for device pointers, consistent
with the pattern used for streams (create_stream_handle_with_owner).
Changes:
- Add deviceptr_create_with_owner() - creates non-owning handle that
keeps a Python owner alive via Py_INCREF/DECREF (lambda capture)
- If owner is nullptr, delegates to deviceptr_create_ref
- Buffer._owner field tracks owner in Python for property access
- Buffer._init() simplified to always call deviceptr_create_with_owner
* Add resource handles _CXX_API capsule and lazy driver loading
Expose a full C++ handles function table via PyCapsule so extensions can dispatch without RTLD_GLOBAL, and switch resource_handles.cpp to load libcuda symbols at runtime to support CPU-only imports.
* Resolve CUDA driver entrypoints via cuda-bindings cuGetProcAddress
Use a lazy PyCapsule in _resource_handles to resolve and cache required CUDA driver entrypoints via cuda.bindings.driver.cuGetProcAddress, and have resource_handles.cpp consume that table on first use. This avoids duplicating driver pathfinding logic and removes dlopen/dlsym linkage requirements.
* Centralize resource handles capsule dispatch in _resource_handles.pxd
Hide _CXX_API capsule import/version checks behind inline pxd wrappers so call sites stay clean, and remove redundant ensure_driver_loaded() checks in C++ deleters.
* Drop RTLD_GLOBAL import for _resource_handles
Resource handle consumers now dispatch through the exported PyCapsule table, so _resource_handles no longer needs to be loaded with RTLD_GLOBAL.
* Fix Python 3.13 finalization check
Use public Py_IsFinalizing() API instead of removed _Py_IsFinalizing().
* Fix finalization check across Python versions
Use Py_IsFinalizing() on Python 3.13+ and fall back to _Py_IsFinalizing() on older versions.
* Fix circular import for _resource_handles
Use a relative import in cuda.core.experimental.__init__ to avoid failing imports
from partially-initialized packages during test collection.
* Fix circular import in _resource_handles module
Use relative cimports instead of fully-qualified cimports to prevent
Cython from generating code that imports the parent package during
module initialization, which caused circular import errors.
* Fix circular import by using importlib.import_module
Replace relative import `from . import _resource_handles` with
`importlib.import_module("cuda.core.experimental._resource_handles")`
to avoid circular import issues during package initialization.
The relative import can fail with "partially initialized module" errors
on some Python versions (e.g., Python 3.10) when the package is still
being initialized. Using importlib.import_module with an absolute path
bypasses the relative import machinery and avoids this issue.
* Fix wheel merge script to keep _resource_handles module
The wheel merge script was removing _resource_handles.cpython-*.so
during the merge process because it only kept a small set of files
at the cuda/core/ top level. However, _resource_handles is shared
code (not CUDA-version-specific) and must remain at the top level
because it's imported early in __init__.py before versioned code.
Also keep _cpp/ directory for Cython development headers.
* Fix IPC pointer cache to use export data as key
The cache was using the returned pointer as the key, but checking the
cache after calling cuMemPoolImportPointer. This caused duplicate
imports to fail with CUDA_ERROR_ALREADY_MAPPED before the cache check.
Fix by using the export_data bytes (CUmemPoolPtrExportData) as the
cache key and checking the cache BEFORE calling cuMemPoolImportPointer.
* Improve IPC pointer cache comments and fix race condition
- Clarify that the cache handles two different memory type behaviors:
memory pool allocations (nvbug 5570902) and pinned memory (ALREADY_MAPPED)
- Fix race condition in deleter: only erase cache entry if expired,
avoiding erasure of a new entry created by another thread
- Move GILReleaseGuard before mutex acquisition in deleter
* Refactor load_driver_api to use RAII GIL guard
Replace raw PyGILState_Ensure/Release calls with a simple GILGuard
class, eliminating manual release on each early return path.
* Add DESIGN.md and optimize GIL usage in resource handle wrappers and CUDA operations
- Change resource handle wrapper functions from `except * nogil` to
`noexcept nogil` to avoid GIL acquisition on every call
- Add `_init_handles_table()` for consumers to initialize at module level
- Move CUDA operations into nogil blocks: cuMemcpyAsync, deviceptr_alloc_*,
create_event_handle_noctx
- Add Buffer._clear() to properly reset the handle shared_ptr
- Add DESIGN.md documenting the resource handles architecture
* linter fix
* Consolidate GIL helper classes at top of resource_handles.cpp
Move GILReleaseGuard and GILAcquireGuard to the top of the file before
first use, and remove redundant GILGuard class that duplicated
GILAcquireGuard functionality.
* refactor: deduplicate py() overloads in resource_handles.hpp
Extract common class lookup logic into detail::py_class_for() helper.
Use thread-safe static initialization and reuse existing intptr()
functions to reduce code duplication from ~55 lines to ~25 lines.
* refactor: rename native() to cu() for extracting raw CUDA handles
Shorter name that clearly indicates the return type is a CU* driver handle.
Updates all definitions, declarations, cimports, and call sites across
15 files.
* fix: prevent GIL/mutex deadlock hazards in resource handles
- Refactor py() overloads to use detail::make_py() helper, eliminating
code duplication
- Remove static caching of Python class objects in py() to avoid
deadlock between static initialization guard and GIL
- Fix ensure_driver_loaded() by releasing GIL before call_once to
ensure consistent lock order (guard -> GIL)
- Add "Static Initialization and Deadlock Hazards" section to DESIGN.md
with link to pybind11 documentation
* address review comments
- Add note to GILReleaseGuard about behavior when GIL isn't released
- Add comment for mutable h_stream in DevicePtrBox
- Add comment to get_box() explaining pointer recovery technique
- Add _init_handles_table() comments to all .pyx files referencing DESIGN.md
- Remove redundant static from items inside anonymous namespace
- Wrap *Box structs in anonymous namespace for encapsulation
* Address review comments
- MANIFEST.in: Be specific about C++ files to avoid Cython-generated .cpp
- _event.pxd: Remove inaccurate comment about caching for fast access
- Remove unnecessary experimental/_context.pxd stub
* Rename handle accessor functions for clarity
Rename cu() -> as_cu(), py() -> as_py(), intptr() -> as_intptr()
to avoid overly short two-letter function names per review feedback.1 parent 3722ea4 commit acc78f7
File tree
41 files changed
+2846
-622
lines changed- ci/tools
- cuda_core
- cuda/core
- _cpp
- _memory
- _utils
- tests
- memory_ipc
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
41 files changed
+2846
-622
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
9 | 12 | | |
10 | 13 | | |
11 | 14 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
| 14 | + | |
14 | 15 | | |
15 | 16 | | |
16 | 17 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
150 | 150 | | |
151 | 151 | | |
152 | 152 | | |
| 153 | + | |
153 | 154 | | |
154 | 155 | | |
155 | 156 | | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
156 | 160 | | |
157 | 161 | | |
158 | 162 | | |
159 | 163 | | |
160 | 164 | | |
161 | 165 | | |
| 166 | + | |
| 167 | + | |
162 | 168 | | |
163 | 169 | | |
164 | 170 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| 6 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
100 | 100 | | |
101 | 101 | | |
102 | 102 | | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
103 | 114 | | |
104 | 115 | | |
105 | 116 | | |
| |||
110 | 121 | | |
111 | 122 | | |
112 | 123 | | |
113 | | - | |
114 | | - | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
115 | 130 | | |
116 | 131 | | |
117 | 132 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
18 | 27 | | |
19 | 28 | | |
20 | 29 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
7 | | - | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
8 | 12 | | |
9 | 13 | | |
10 | | - | |
11 | | - | |
12 | | - | |
| 14 | + | |
13 | 15 | | |
14 | 16 | | |
15 | 17 | | |
| 18 | + | |
16 | 19 | | |
17 | | - | |
18 | | - | |
19 | | - | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
20 | 23 | | |
21 | 24 | | |
22 | 25 | | |
23 | 26 | | |
24 | | - | |
25 | | - | |
26 | | - | |
27 | | - | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
28 | 32 | | |
29 | 33 | | |
30 | 34 | | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
31 | 42 | | |
32 | 43 | | |
33 | 44 | | |
34 | 45 | | |
35 | | - | |
| 46 | + | |
36 | 47 | | |
37 | 48 | | |
38 | | - | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
0 commit comments