Skip to content

Commit acc78f7

Browse files
authored
RAII resource handles for CUDA objects (#1368)
* Centralize context management code - Create _context.pxd with Context class declaration - Move get_primary_context from _device.pyx to _context.pyx - Add Cython-level context functions: get_current_context, set_current_context, get_stream_context - Update _device.pyx to use context module functions - Update _stream.pyx to use context module functions - Remove push_context and pop_context (unused, replaced with direct CUDA API calls) - Reorganize _context.pyx according to style guide (principal class first) * Integrate resource handles into Context class - Replace Context._handle (object) with ContextHandle (shared_ptr) resource handle - Add handle property to Context returning driver.CUcontext - Update Context._from_ctx to create ContextHandle using create_context_handle() - Update Context.__eq__ to compare actual CUcontext values (not shared_ptr addresses) - Update Context.__hash__ to include type(self) and handle value with NULL safety - Update _device.pyx to use ctx._handle.get()[0] for direct access - Update _graph.pyx to use context.handle property - Update C++ implementation to use default deleter (simplifies code) - Rename _resource_handles_impl.cpp to _context_impl.cpp - Remove test_dev.py development script - Update .gitignore to allow *_impl.cpp files - Fix all test files to use context.handle instead of context._handle * Refactor context helpers to use ContextHandle and TLS cache - switch context helper APIs to return ContextHandle instead of raw CUcontext - add TLS wrapper for primary context caching using handles - update device/stream code to consume ContextHandle-based helpers - expose create_context_handle_ref as nogil-safe in the pxd * Add helper functions to extract raw resources from ContextHandle Introduce three helper functions for ContextHandle resource extraction: - native(h): Returns cydriver.CUcontext for use with cydriver API calls - py(h): Returns driver.CUcontext for use with Python driver API - intptr(h): Returns uintptr_t for internal APIs expecting integer addresses These helpers replace direct h_context.get()[0] calls, providing: - Cleaner, more semantic code - Consistent extraction pattern across all handle types - Type-safe conversions with clear intent Implementation details: - native() and intptr() are inline nogil functions in .pxd - py() requires Python module access, implemented in new _resource_handles.pyx - Updated all call sites in _context, _device, and _stream modules * Refactor context acquisition to C++ handle helpers Move get_primary_context/get_current_context into C++ with thread-local caching and conditional GIL release; inline create_context_handle_ref in the header; update Cython modules and build hooks to link handle users (including _device) against resource_handles and libcuda. * Fix link error by loading _resource_handles with RTLD_GLOBAL The C++ implementation in _resource_handles_impl.cpp is now compiled only into _resource_handles.so. Other modules that depend on these symbols (_context, _device, etc.) resolve them at runtime via the global symbol table. This ensures a single shared instance of thread-local caches and avoids setuptools issues with shared source files across extensions. * Move helper functions to C++ for overloading support Move native(), intptr(), and py() from Cython inline functions to inline C++ functions in resource_handles.hpp. This enables function overloading when additional handle types (e.g., StreamHandle) are added. - native(): extract raw CUDA handle from ContextHandle - intptr(): extract handle as uintptr_t for Python interop - py(): convert handle to Python driver wrapper object * Extend resource handle paradigm to Stream Add StreamHandle for automatic stream lifetime management using the same shared_ptr-based pattern established for ContextHandle. Changes: - Add StreamHandle type and create_stream_handle/create_stream_handle_ref functions in C++ with implementations in _resource_handles_impl.cpp - Add overloaded native(), intptr(), py() helpers for StreamHandle - Update Stream class to use _h_stream (StreamHandle) instead of raw _handle - Owned streams are automatically destroyed when last reference is released - Borrowed streams (from __cuda_stream__ protocol) hold _owner reference - Update memory resource files to use native(stream._h_stream) - Simplify Context using intptr() and py() helpers * Simplify Stream by moving more logic to C++ - Move stream creation to C++ (create_stream_handle now calls cuStreamCreateWithPriority internally) - Add get_legacy_stream/get_per_thread_stream for built-in streams - Add create_stream_handle_with_owner for borrowed streams that prevents Python owner from being GC'd via captured PyObject* - Add GILAcquireGuard (symmetric to GILReleaseGuard) for safely acquiring GIL in C++ destructors - Simplify Stream class: remove __cinit__, _owner, _builtin, _legacy_default, _per_thread_default - Use _from_handle as single initialization point for Stream - Remove obsolete subclassing tests for removed methods * Refactor Stream to use ContextHandle and simplify initialization - Replace raw CUcontext _ctx_handle with ContextHandle _h_context for consistent handle paradigm and cleaner code - Replace CUdevice _device_id with int using -1 sentinel - Use intptr() helper instead of <uintptr_t>() casts throughout - Add _from_handle(type cls, ...) factory with subclass support - Add _legacy_default and _per_thread_default classmethods - Eliminate duplicated initialization code in _init * Extend ContextHandle to Event and standardize naming - Event now uses ContextHandle for _h_context instead of raw object - Event._init is now a cdef staticmethod accepting ContextHandle - Context._from_ctx renamed to Context._from_handle (cdef staticmethod) - Moved get_device_from_ctx to Stream module as Stream_ensure_ctx_device - Inlined get_stream_context into Stream_ensure_ctx - Simplified context push/pop logic in Stream_ensure_ctx_device Naming standardization: - Device._id -> Device._device_id - _dev_id -> _device_id throughout codebase - dev_id -> device_id for local variables - Updated tests to use public APIs instead of internal _init methods * Store owning context handle in Device Device now stores its Context in _context slot, set during set_current(). This ensures Device holds an owning reference to its context, enabling proper lifetime management when passed to Stream and Event creation. Changes: - Add _context to Device.__slots__ - Store Context in set_current() for both primary and explicit context paths - Simplify context property to return stored _context - Update create_event() to use self._context._h_context - Remove get_current_context import (no longer needed in _device.pyx) Add structural context dependency to owned streams StreamBox now holds ContextHandle to ensure context outlives the stream. This structural dependency is only for owned streams - borrowed streams delegate context lifetime management to their owners. C++ changes: - StreamBox gains h_context member - create_stream_handle(h_ctx, flags, priority) takes owning context - create_stream_handle_ref(stream) - caller manages context - create_stream_handle_with_owner(stream, owner) - Python owner manages context Cython/Python changes: - Stream._init() accepts optional ctx parameter - Device.create_stream() passes self._context to Stream._init() - Owned streams get context handle embedded in C++ handle * Convert Event to use resource handles Event now uses EventHandle (shared_ptr) for RAII-based lifetime management, following the same pattern as Stream. C++ changes: - Add EventHandle type alias and EventBox struct - Add create_event_handle(h_ctx, flags) with context captured in deleter - Add create_event_handle_ipc(ipc_handle) for IPC events (no context dep) - Add native(), intptr(), py() overloads for EventHandle Cython changes: - Event._h_event replaces raw CUevent _handle - _init() uses create_event_handle() - from_ipc_descriptor() uses create_event_handle_ipc() - close() uses _h_event.reset() - Keep _h_context for cached fast access * Clean up Stream.wait() to use EventHandle for temporary events - Simplified branch structure: early return for Event, single path for Stream - Use native() helper for handle access instead of casting via handle property - Temporary events now use EventHandle with RAII cleanup (no explicit cuEventDestroy) - Added create_event_handle import * Add create_event_handle overload for temporary events - New overload takes only flags (no ContextHandle) for temporary events - Delegates to existing overload with empty ContextHandle - Updated _stream.pyx and _memoryview.pyx to use simpler overload - Removed unnecessary get_current_context import from _memoryview.pyx - Removed unnecessary Stream_ensure_ctx call from Stream.wait() * Convert DeviceMemoryResource to use MemoryPoolHandle C++ layer (resource_handles.hpp/cpp): - Add MemoryPoolHandle = std::shared_ptr<const CUmemoryPool> - Add create_mempool_handle(props) - owning, calls cuMemPoolDestroy on release - Add create_mempool_handle_ref(pool) - non-owning reference - Add create_mempool_handle_ipc(fd, handle_type) - owning from IPC import - Add get_device_mempool(device_id) - get current pool for device (non-owning) - Add native(), intptr(), py() overloads for MemoryPoolHandle Cython layer: - Update _resource_handles.pxd with new types and functions - Update _device_memory_resource.pxd: replace raw handle with MemoryPoolHandle - Reorder members: _h_pool first (matches Stream/Event pattern) - Update _device_memory_resource.pyx to use new handle functions - Update _ipc.pyx to use create_mempool_handle_ipc for IPC imports - DMR_close now uses RAII (_h_pool.reset()) instead of explicit cuMemPoolDestroy - Consistent member initialization order across __cinit__, init functions, and close * Add DevicePtrHandle for RAII device pointer management Introduce DevicePtrHandle (std::shared_ptr<const CUdeviceptr>) to manage device pointer lifetimes with automatic deallocation. Key features: - Allocation functions: deviceptr_alloc_from_pool, deviceptr_alloc_async, deviceptr_alloc, deviceptr_alloc_host, deviceptr_create_ref - IPC import via deviceptr_import_ipc with error output parameter - Deallocation stream stored in mutable DevicePtrBox, accessible via deallocation_stream() and set_deallocation_stream() - cuMemFreeAsync used for deallocation (NULL stream = legacy default) - Buffer class updated to use DevicePtrHandle instead of raw pointers - Buffer.handle returns integer for backward compatibility with ctypes - IPCBufferDescriptor.payload_ptr() helper to simplify casting Note: IPC-imported pointers do not yet implement reference counting workaround for nvbug 5570902. * Use intptr_t for all handle integer conversions Change all intptr() overloads to return std::intptr_t (signed) instead of std::uintptr_t per C standard convention for pointer-to-integer conversion. This addresses issue #1342 which requires Buffer.handle to return a signed integer. Fixes #1342 * Add thread-local error handling for resource handle functions Implement a systematic error handling approach for C++ resource handle functions using thread-local storage, similar to cudaGetLastError(). API: - get_last_error(): Returns and clears the last CUDA error - peek_last_error(): Returns without clearing - clear_last_error(): Explicitly clears the error All functions that can fail now set the thread-local error before returning an empty handle. This allows callers to retrieve specific CUDA error codes for proper exception propagation. Updated deviceptr_import_ipc to use this pattern instead of an output parameter. * Add IPC pointer cache to fix duplicate import issue (nvbug 5570902) IPC-imported device pointers are not correctly reference counted by the driver - the first cuMemFreeAsync incorrectly unmaps the memory even when the pointer was imported multiple times. Work around this by caching imported pointers and returning the same handle for duplicate imports. The cache uses weak_ptr so entries are automatically cleaned up when all references are released. The workaround can be easily bypassed via use_ipc_ptr_cache() when a driver fix becomes available. * Fix lint issues: remove unused imports and variables * Add deviceptr_create_with_owner for handle-based owner tracking Implements handle-based owner tracking for device pointers, consistent with the pattern used for streams (create_stream_handle_with_owner). Changes: - Add deviceptr_create_with_owner() - creates non-owning handle that keeps a Python owner alive via Py_INCREF/DECREF (lambda capture) - If owner is nullptr, delegates to deviceptr_create_ref - Buffer._owner field tracks owner in Python for property access - Buffer._init() simplified to always call deviceptr_create_with_owner * Add resource handles _CXX_API capsule and lazy driver loading Expose a full C++ handles function table via PyCapsule so extensions can dispatch without RTLD_GLOBAL, and switch resource_handles.cpp to load libcuda symbols at runtime to support CPU-only imports. * Resolve CUDA driver entrypoints via cuda-bindings cuGetProcAddress Use a lazy PyCapsule in _resource_handles to resolve and cache required CUDA driver entrypoints via cuda.bindings.driver.cuGetProcAddress, and have resource_handles.cpp consume that table on first use. This avoids duplicating driver pathfinding logic and removes dlopen/dlsym linkage requirements. * Centralize resource handles capsule dispatch in _resource_handles.pxd Hide _CXX_API capsule import/version checks behind inline pxd wrappers so call sites stay clean, and remove redundant ensure_driver_loaded() checks in C++ deleters. * Drop RTLD_GLOBAL import for _resource_handles Resource handle consumers now dispatch through the exported PyCapsule table, so _resource_handles no longer needs to be loaded with RTLD_GLOBAL. * Fix Python 3.13 finalization check Use public Py_IsFinalizing() API instead of removed _Py_IsFinalizing(). * Fix finalization check across Python versions Use Py_IsFinalizing() on Python 3.13+ and fall back to _Py_IsFinalizing() on older versions. * Fix circular import for _resource_handles Use a relative import in cuda.core.experimental.__init__ to avoid failing imports from partially-initialized packages during test collection. * Fix circular import in _resource_handles module Use relative cimports instead of fully-qualified cimports to prevent Cython from generating code that imports the parent package during module initialization, which caused circular import errors. * Fix circular import by using importlib.import_module Replace relative import `from . import _resource_handles` with `importlib.import_module("cuda.core.experimental._resource_handles")` to avoid circular import issues during package initialization. The relative import can fail with "partially initialized module" errors on some Python versions (e.g., Python 3.10) when the package is still being initialized. Using importlib.import_module with an absolute path bypasses the relative import machinery and avoids this issue. * Fix wheel merge script to keep _resource_handles module The wheel merge script was removing _resource_handles.cpython-*.so during the merge process because it only kept a small set of files at the cuda/core/ top level. However, _resource_handles is shared code (not CUDA-version-specific) and must remain at the top level because it's imported early in __init__.py before versioned code. Also keep _cpp/ directory for Cython development headers. * Fix IPC pointer cache to use export data as key The cache was using the returned pointer as the key, but checking the cache after calling cuMemPoolImportPointer. This caused duplicate imports to fail with CUDA_ERROR_ALREADY_MAPPED before the cache check. Fix by using the export_data bytes (CUmemPoolPtrExportData) as the cache key and checking the cache BEFORE calling cuMemPoolImportPointer. * Improve IPC pointer cache comments and fix race condition - Clarify that the cache handles two different memory type behaviors: memory pool allocations (nvbug 5570902) and pinned memory (ALREADY_MAPPED) - Fix race condition in deleter: only erase cache entry if expired, avoiding erasure of a new entry created by another thread - Move GILReleaseGuard before mutex acquisition in deleter * Refactor load_driver_api to use RAII GIL guard Replace raw PyGILState_Ensure/Release calls with a simple GILGuard class, eliminating manual release on each early return path. * Add DESIGN.md and optimize GIL usage in resource handle wrappers and CUDA operations - Change resource handle wrapper functions from `except * nogil` to `noexcept nogil` to avoid GIL acquisition on every call - Add `_init_handles_table()` for consumers to initialize at module level - Move CUDA operations into nogil blocks: cuMemcpyAsync, deviceptr_alloc_*, create_event_handle_noctx - Add Buffer._clear() to properly reset the handle shared_ptr - Add DESIGN.md documenting the resource handles architecture * linter fix * Consolidate GIL helper classes at top of resource_handles.cpp Move GILReleaseGuard and GILAcquireGuard to the top of the file before first use, and remove redundant GILGuard class that duplicated GILAcquireGuard functionality. * refactor: deduplicate py() overloads in resource_handles.hpp Extract common class lookup logic into detail::py_class_for() helper. Use thread-safe static initialization and reuse existing intptr() functions to reduce code duplication from ~55 lines to ~25 lines. * refactor: rename native() to cu() for extracting raw CUDA handles Shorter name that clearly indicates the return type is a CU* driver handle. Updates all definitions, declarations, cimports, and call sites across 15 files. * fix: prevent GIL/mutex deadlock hazards in resource handles - Refactor py() overloads to use detail::make_py() helper, eliminating code duplication - Remove static caching of Python class objects in py() to avoid deadlock between static initialization guard and GIL - Fix ensure_driver_loaded() by releasing GIL before call_once to ensure consistent lock order (guard -> GIL) - Add "Static Initialization and Deadlock Hazards" section to DESIGN.md with link to pybind11 documentation * address review comments - Add note to GILReleaseGuard about behavior when GIL isn't released - Add comment for mutable h_stream in DevicePtrBox - Add comment to get_box() explaining pointer recovery technique - Add _init_handles_table() comments to all .pyx files referencing DESIGN.md - Remove redundant static from items inside anonymous namespace - Wrap *Box structs in anonymous namespace for encapsulation * Address review comments - MANIFEST.in: Be specific about C++ files to avoid Cython-generated .cpp - _event.pxd: Remove inaccurate comment about caching for fast access - Remove unnecessary experimental/_context.pxd stub * Rename handle accessor functions for clarity Rename cu() -> as_cu(), py() -> as_py(), intptr() -> as_intptr() to avoid overly short two-letter function names per review feedback.
1 parent 3722ea4 commit acc78f7

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+2846
-622
lines changed

.gitattributes

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@ cuda/_version.py export-subst
66
# we do not own any headers checked in, don't touch them
77
*.h binary
88
*.hpp binary
9+
# Exception: headers we own (cuda_core C++ implementation)
10+
cuda_core/cuda/core/_cpp/*.h -binary text diff
11+
cuda_core/cuda/core/_cpp/*.hpp -binary text diff
912
# git should not convert line endings in PNG files
1013
*.png binary
1114
*.svg binary

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ __pycache__/
1111
.pytest_cache/
1212
.benchmarks/
1313
*.cpp
14+
!*_impl.cpp
1415
!cuda_bindings/cuda/bindings/_lib/param_packer.cpp
1516
!cuda_bindings/cuda/bindings/_bindings/loader.cpp
1617
cache_driver

ci/tools/merge_cuda_core_wheels.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -150,15 +150,21 @@ def merge_wheels(wheels: List[Path], output_dir: Path, show_wheel_contents: bool
150150
"__init__.py",
151151
"_version.py",
152152
"_include",
153+
"_cpp", # Headers for Cython development
153154
"cu12",
154155
"cu13",
155156
)
157+
# _resource_handles is shared (not CUDA-version-specific) and must stay
158+
# at top level. It's imported early in __init__.py before versioned code.
159+
items_to_keep_prefix = ("_resource_handles",)
156160
all_items = os.scandir(base_wheel / base_dir)
157161
removed_count = 0
158162
for f in all_items:
159163
f_abspath = f.path
160164
if f.name in items_to_keep:
161165
continue
166+
if any(f.name.startswith(prefix) for prefix in items_to_keep_prefix):
167+
continue
162168
if f.is_dir():
163169
print(f" Removing directory: {f.name}", file=sys.stderr)
164170
shutil.rmtree(f_abspath)

cuda_core/MANIFEST.in

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@
33
# SPDX-License-Identifier: Apache-2.0
44

55
recursive-include cuda/core *.pyx *.pxd
6+
recursive-include cuda/core/_cpp *.cpp *.hpp

cuda_core/build_hooks.py

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,17 @@ def module_names():
100100
for filename in glob.glob(f"{root_path}/**/*.pyx", recursive=True):
101101
yield filename[len(root_path) : -4]
102102

103+
def get_sources(mod_name):
104+
"""Get source files for a module, including any .cpp files."""
105+
sources = [f"cuda/core/{mod_name}.pyx"]
106+
107+
# Add module-specific .cpp file from _cpp/ directory if it exists
108+
cpp_file = f"cuda/core/_cpp/{mod_name.lstrip('_')}.cpp"
109+
if os.path.exists(cpp_file):
110+
sources.append(cpp_file)
111+
112+
return sources
113+
103114
all_include_dirs = list(os.path.join(root, "include") for root in _get_cuda_paths())
104115
extra_compile_args = []
105116
if COMPILE_FOR_COVERAGE:
@@ -110,8 +121,12 @@ def module_names():
110121
ext_modules = tuple(
111122
Extension(
112123
f"cuda.core.{mod.replace(os.path.sep, '.')}",
113-
sources=[f"cuda/core/{mod}.pyx"],
114-
include_dirs=all_include_dirs,
124+
sources=get_sources(mod),
125+
include_dirs=[
126+
"cuda/core/_include",
127+
"cuda/core/_cpp",
128+
]
129+
+ all_include_dirs,
115130
language="c++",
116131
extra_compile_args=extra_compile_args,
117132
)

cuda_core/cuda/core/__init__.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,15 @@
1515

1616
import importlib
1717

18+
# The _resource_handles module exports a PyCapsule dispatch table that other
19+
# extension modules access via PyCapsule_Import. We import it here to ensure
20+
# it's loaded before other modules try to use it.
21+
#
22+
# We use importlib.import_module with the full path to avoid triggering
23+
# circular import issues that can occur with relative imports during
24+
# package initialization.
25+
_resource_handles = importlib.import_module("cuda.core._resource_handles")
26+
1827
subdir = f"cu{cuda_major}"
1928
try:
2029
versioned_mod = importlib.import_module(f".{subdir}", __package__)

cuda_core/cuda/core/_context.pxd

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
from cuda.core._resource_handles cimport ContextHandle
6+
7+
cdef class Context:
8+
"""Cython declaration for Context class.
9+
10+
This class provides access to CUDA contexts. Context objects cannot be
11+
instantiated directly - use factory methods or Device/Stream APIs.
12+
"""
13+
14+
cdef:
15+
ContextHandle _h_context
16+
int _device_id
17+
18+
@staticmethod
19+
cdef Context _from_handle(type cls, ContextHandle h_context, int device_id)

cuda_core/cuda/core/_context.pyx

Lines changed: 33 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -4,35 +4,55 @@
44

55
from dataclasses import dataclass
66

7-
from cuda.core._utils.cuda_utils import driver
7+
from cuda.core._resource_handles cimport (
8+
ContextHandle,
9+
as_intptr,
10+
as_py,
11+
)
812

913

10-
@dataclass
11-
class ContextOptions:
12-
pass # TODO
14+
__all__ = ['Context', 'ContextOptions']
1315

1416

1517
cdef class Context:
18+
"""CUDA context wrapper.
1619
17-
cdef:
18-
readonly object _handle
19-
int _device_id
20+
Context objects represent CUDA contexts and cannot be instantiated directly.
21+
Use Device or Stream APIs to obtain context objects.
22+
"""
2023

2124
def __init__(self, *args, **kwargs):
2225
raise RuntimeError("Context objects cannot be instantiated directly. Please use Device or Stream APIs.")
2326

24-
@classmethod
25-
def _from_ctx(cls, handle: driver.CUcontext, int device_id):
26-
cdef Context ctx = Context.__new__(Context)
27-
ctx._handle = handle
27+
@staticmethod
28+
cdef Context _from_handle(type cls, ContextHandle h_context, int device_id):
29+
"""Create Context from existing ContextHandle (cdef-only factory)."""
30+
cdef Context ctx = cls.__new__(cls)
31+
ctx._h_context = h_context
2832
ctx._device_id = device_id
2933
return ctx
3034

35+
@property
36+
def handle(self):
37+
"""Return the underlying CUcontext handle."""
38+
if self._h_context.get() == NULL:
39+
return None
40+
return as_py(self._h_context)
41+
3142
def __eq__(self, other):
3243
if not isinstance(other, Context):
3344
return NotImplemented
3445
cdef Context _other = <Context>other
35-
return int(self._handle) == int(_other._handle)
46+
return as_intptr(self._h_context) == as_intptr(_other._h_context)
3647

3748
def __hash__(self) -> int:
38-
return hash(int(self._handle))
49+
return hash((type(self), as_intptr(self._h_context)))
50+
51+
52+
@dataclass
53+
class ContextOptions:
54+
"""Options for context creation.
55+
56+
Currently unused, reserved for future use.
57+
"""
58+
pass # TODO

0 commit comments

Comments
 (0)