.. currentmodule:: cuda.core
- Introduced support for explicit graph construction. CUDA graphs can now be built programmatically by adding nodes and edges, and their topology can be modified after construction.
- Added CUDA-OpenGL interoperability support, enabling zero-copy sharing of GPU memory between CUDA compute kernels and OpenGL renderers.
- Added :class:`TensorMapDescriptor` for Hopper+ TMA (Tensor Memory Accelerator) bulk data movement, with automatic kernel argument integration.
- :class:`~utils.StridedMemoryView` now supports DLPack export via
from_dlpack()array API.
- Added the :mod:`cuda.core.graph` public module containing :class:`~graph.GraphDef` for explicit graph construction, typed node subclasses, and supporting types. :class:`~graph.GraphBuilder` (stream capture) also moves into this module.
- Added :meth:`~graph.GraphBuilder.callback` for CPU callbacks during stream capture, mirroring the existing :meth:`~graph.GraphDef.callback` API.
- Added :class:`GraphicsResource` for CUDA-OpenGL interoperability. Factory classmethods :meth:`~GraphicsResource.from_gl_buffer` and :meth:`~GraphicsResource.from_gl_image` register OpenGL objects for CUDA access, and mapping returns a :class:`Buffer` for zero-copy kernel use.
- Added :class:`TensorMapDescriptor` wrapping the CUDA driver's
CUtensorMapfor Hopper+ TMA (Tensor Memory Accelerator) bulk data movement. :class:`~utils.StridedMemoryView` gains an :meth:`~utils.StridedMemoryView.as_tensor_map` method for convenient descriptor creation, with automatic dtype inference, stride computation, and first-class kernel argument integration. - Added DLPack export support to :class:`~utils.StridedMemoryView` via
__dlpack__and__dlpack_device__, complementing the existing import path. - Added the DLPack C exchange API (
__dlpack_c_exchange_api__) to :class:`~utils.StridedMemoryView`. - Added NVRTC precompiled header (PCH) support (CUDA 12.8+).
:class:`ProgramOptions` gains
pch,create_pch,use_pch,pch_dir, and related options. :attr:`Program.pch_status` reports the PCH creation outcome, and :meth:`~Program.compile` automatically resizes the NVRTC PCH heap and retries when PCH creation fails due to heap exhaustion. - Added NUMA-aware managed memory pool placement.
:class:`ManagedMemoryResourceOptions` gains a
preferred_location_typeoption ("device","host", or"host_numa"), and :attr:`ManagedMemoryResource.preferred_location` queries the resolved location. The existingpreferred_locationparameter retains full backwards compatibility. - Added NUMA-aware pinned memory pool placement.
:class:`PinnedMemoryResourceOptions` gains a
numa_idoption, and :attr:`PinnedMemoryResource.numa_id` queries the host NUMA node ID used for pool placement. Whenipc_enabled=Trueandnuma_idis not set, the NUMA node is automatically derived from the current CUDA device. - Added support for CUDA 13.2.
gl_interop_plasma.py: Real-time plasma effect demonstrating CUDA-OpenGL interoperability via :class:`GraphicsResource`.tma_tensor_map.py: TMA bulk data movement using :class:`TensorMapDescriptor` on Hopper+ GPUs.
- Fixed managed memory buffers being misclassified as
kDLCUDAHostin DLPack device mapping. They are now correctly reported askDLCUDAManaged. (#1863) - Fixed IPC-enabled pinned memory pools using a hardcoded NUMA node ID of
0instead of the NUMA node closest to the active CUDA device. On multi-NUMA systems where the device is attached to a non-zero host NUMA node, this could cause pool creation or allocation failures. (#1603) - Fixed :attr:`DeviceMemoryResource.peer_accessible_by` returning stale results when wrapping a non-owned (default) memory pool. The property now always queries the CUDA driver for non-owned pools, so multiple wrappers around the same pool see consistent state. (#1720)
- Fixed a bare
exceptclause in stream acceptance that silently swallowed all exceptions, includingKeyboardInterruptandSystemExit. Only the expected "protocol not supported" case is now caught. (#1631) - :class:`~utils.StridedMemoryView` now validates strides at construction time so unsupported layouts fail immediately instead of on first metadata access. (#1429)
- IPC file descriptor cleanup now uses a C++
shared_ptrwith a POSIX deleter, avoiding cryptic errors when a :class:`DeviceMemoryResource` is destroyed during Python shutdown. - Improved error message when :class:`ManagedMemoryResource` is called without options on platforms that lack a default managed memory pool (e.g. WSL2). (#1617)
- Handle properties on core API objects now return
Noneduring Python shutdown instead of crashing. - Reduced Python overhead in :class:`Program` and :class:`Linker` by moving compilation and linking operations to the C level and releasing the GIL during backend calls. This benefits workloads that create many programs or linkers, and enables concurrent compilation in multithreaded applications.
- Error enum explanations are now derived from
cuda-bindingsdocstrings when available (bindings 12.9.6+ or 13.2.0+), with frozen tables as fallback for older versions. - Improved optional dependency handling for NVVM and nvJitLink imports so that only genuinely
missing optional modules are treated as unavailable; unrelated import failures now surface
normally, and
cuda.corenow depends directly oncuda-pathfinder.