diff --git a/cuda_core/README.md b/cuda_core/README.md index d7dfe83bfa9..7ea41966017 100644 --- a/cuda_core/README.md +++ b/cuda_core/README.md @@ -1,10 +1,10 @@ -# `cuda.core`: (experimental) Pythonic CUDA module +# `cuda.core`: Pythonic CUDA module Currently under active development; see [the documentation](https://nvidia.github.io/cuda-python/cuda-core/latest/) for more details. ## Installing -Please refer to the [Installation page](https://nvidia.github.io/cuda-python/cuda-bindings/latest/install.html) for instructions and required/optional dependencies. +Please refer to the [Installation page](https://nvidia.github.io/cuda-python/cuda-core/latest/install.html) for instructions and required/optional dependencies. ## Developing diff --git a/cuda_core/docs/nv-versions.json b/cuda_core/docs/nv-versions.json index 80f9de3e69a..d55ec26f53f 100644 --- a/cuda_core/docs/nv-versions.json +++ b/cuda_core/docs/nv-versions.json @@ -3,6 +3,10 @@ "version": "latest", "url": "https://nvidia.github.io/cuda-python/cuda-core/latest/" }, + { + "version": "0.7.0", + "url": "https://nvidia.github.io/cuda-python/cuda-core/0.7.0/" + }, { "version": "0.6.0", "url": "https://nvidia.github.io/cuda-python/cuda-core/0.6.0/" diff --git a/cuda_core/docs/source/api.rst b/cuda_core/docs/source/api.rst index 0c877dcc81e..005866ddb2d 100644 --- a/cuda_core/docs/source/api.rst +++ b/cuda_core/docs/source/api.rst @@ -129,12 +129,40 @@ Each subclass exposes attributes unique to its operation type. graph.SwitchNode +Graphics interoperability +------------------------- + +.. autosummary:: + :toctree: generated/ + + :template: autosummary/cyclass.rst + + GraphicsResource + + +Tensor Memory Accelerator (TMA) +------------------------------- + +.. autosummary:: + :toctree: generated/ + + :template: autosummary/cyclass.rst + + TensorMapDescriptor + + :template: dataclass.rst + + TensorMapDescriptorOptions + + CUDA compilation toolchain -------------------------- .. autosummary:: :toctree: generated/ + :template: autosummary/cyclass.rst + Program Linker ObjectCode diff --git a/cuda_core/docs/source/release/0.6.0-notes.rst b/cuda_core/docs/source/release/0.6.0-notes.rst index 654eb7641bf..b7d6188cc25 100644 --- a/cuda_core/docs/source/release/0.6.0-notes.rst +++ b/cuda_core/docs/source/release/0.6.0-notes.rst @@ -54,11 +54,6 @@ New features - Added CUDA version compatibility check at import time to detect mismatches between ``cuda.core`` and the installed ``cuda-bindings`` version. -- ``Program.compile()`` now automatically resizes the NVRTC PCH heap and - retries when precompiled header creation fails due to heap exhaustion. - The ``pch_status`` property reports the PCH creation outcome - (``"created"``, ``"not_attempted"``, ``"failed"``, or ``None``). - Fixes and enhancements ---------------------- diff --git a/cuda_core/docs/source/release/0.7.0-notes.rst b/cuda_core/docs/source/release/0.7.0-notes.rst new file mode 100644 index 00000000000..3946c8804bf --- /dev/null +++ b/cuda_core/docs/source/release/0.7.0-notes.rst @@ -0,0 +1,116 @@ +.. SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +.. SPDX-License-Identifier: Apache-2.0 + +.. currentmodule:: cuda.core + +``cuda.core`` 0.7.0 Release Notes +================================= + + +Highlights +---------- + +- Introduced support for explicit graph construction. CUDA graphs can now be + built programmatically by adding nodes and edges, and their topology can be + modified after construction. +- Added CUDA-OpenGL interoperability support, enabling zero-copy sharing of + GPU memory between CUDA compute kernels and OpenGL renderers. +- Added :class:`TensorMapDescriptor` for Hopper+ TMA (Tensor Memory Accelerator) + bulk data movement, with automatic kernel argument integration. +- :class:`~utils.StridedMemoryView` now supports DLPack export via + ``from_dlpack()`` array API. + + +New features +------------ + +- Added the :mod:`cuda.core.graph` public module containing + :class:`~graph.GraphDef` for explicit graph construction, typed node + subclasses, and supporting types. :class:`~graph.GraphBuilder` (stream + capture) also moves into this module. + +- Added :meth:`~graph.GraphBuilder.callback` for CPU callbacks during stream + capture, mirroring the existing :meth:`~graph.GraphDef.callback` API. + +- Added :class:`GraphicsResource` for CUDA-OpenGL interoperability. + Factory classmethods :meth:`~GraphicsResource.from_gl_buffer` and + :meth:`~GraphicsResource.from_gl_image` register OpenGL objects for CUDA + access, and mapping returns a :class:`Buffer` for zero-copy kernel use. + +- Added :class:`TensorMapDescriptor` wrapping the CUDA driver's ``CUtensorMap`` + for Hopper+ TMA (Tensor Memory Accelerator) bulk data movement. + :class:`~utils.StridedMemoryView` gains an :meth:`~utils.StridedMemoryView.as_tensor_map` + method for convenient descriptor creation, with automatic dtype inference, stride + computation, and first-class kernel argument integration. + +- Added DLPack export support to :class:`~utils.StridedMemoryView` via + ``__dlpack__`` and ``__dlpack_device__``, complementing the existing import + path. + +- Added the DLPack C exchange API (``__dlpack_c_exchange_api__``) to + :class:`~utils.StridedMemoryView`. + +- Added NVRTC precompiled header (PCH) support (CUDA 12.8+). + :class:`ProgramOptions` gains ``pch``, ``create_pch``, ``use_pch``, + ``pch_dir``, and related options. :attr:`Program.pch_status` reports the + PCH creation outcome, and :meth:`~Program.compile` automatically resizes the NVRTC + PCH heap and retries when PCH creation fails due to heap exhaustion. + +- Added NUMA-aware managed memory pool placement. + :class:`ManagedMemoryResourceOptions` gains a ``preferred_location_type`` + option (``"device"``, ``"host"``, or ``"host_numa"``), and + :attr:`ManagedMemoryResource.preferred_location` queries the resolved + location. The existing ``preferred_location`` parameter retains full + backwards compatibility. + +- Added NUMA-aware pinned memory pool placement. + :class:`PinnedMemoryResourceOptions` gains a ``numa_id`` option, and + :attr:`PinnedMemoryResource.numa_id` queries the host NUMA node ID used for + pool placement. When ``ipc_enabled=True`` and ``numa_id`` is not set, the + NUMA node is automatically derived from the current CUDA device. + +- Added support for CUDA 13.2. + + +New examples +------------ + +- ``gl_interop_plasma.py``: Real-time plasma effect demonstrating CUDA-OpenGL + interoperability via :class:`GraphicsResource`. +- ``tma_tensor_map.py``: TMA bulk data movement using + :class:`TensorMapDescriptor` on Hopper+ GPUs. + + +Fixes and enhancements +---------------------- + +- Fixed managed memory buffers being misclassified as ``kDLCUDAHost`` in DLPack + device mapping. They are now correctly reported as ``kDLCUDAManaged``. + (`#1863 `__) +- Fixed IPC-enabled pinned memory pools using a hardcoded NUMA node ID of ``0`` + instead of the NUMA node closest to the active CUDA device. On multi-NUMA + systems where the device is attached to a non-zero host NUMA node, this could + cause pool creation or allocation failures. (`#1603 `__) +- Fixed :attr:`DeviceMemoryResource.peer_accessible_by` returning stale results when wrapping + a non-owned (default) memory pool. The property now always queries the CUDA driver for + non-owned pools, so multiple wrappers around the same pool see consistent state. (`#1720 `__) +- Fixed a bare ``except`` clause in stream acceptance that silently swallowed all exceptions, + including ``KeyboardInterrupt`` and ``SystemExit``. Only the expected "protocol not + supported" case is now caught. (`#1631 `__) +- :class:`~utils.StridedMemoryView` now validates strides at construction time so unsupported + layouts fail immediately instead of on first metadata access. (`#1429 `__) +- IPC file descriptor cleanup now uses a C++ ``shared_ptr`` with a POSIX deleter, avoiding + cryptic errors when a :class:`DeviceMemoryResource` is destroyed during Python shutdown. +- Improved error message when :class:`ManagedMemoryResource` is called without options on platforms + that lack a default managed memory pool (e.g. WSL2). (`#1617 `__) +- Handle properties on core API objects now return ``None`` during Python shutdown instead of + crashing. +- Reduced Python overhead in :class:`Program` and :class:`Linker` by moving compilation and + linking operations to the C level and releasing the GIL during backend calls. This benefits + workloads that create many programs or linkers, and enables concurrent compilation in + multithreaded applications. +- Error enum explanations are now derived from ``cuda-bindings`` docstrings when available + (bindings 12.9.6+ or 13.2.0+), with frozen tables as fallback for older versions. +- Improved optional dependency handling for NVVM and nvJitLink imports so that only genuinely + missing optional modules are treated as unavailable; unrelated import failures now surface + normally, and ``cuda.core`` now depends directly on ``cuda-pathfinder``. diff --git a/cuda_core/docs/source/release/0.7.x-notes.rst b/cuda_core/docs/source/release/0.7.x-notes.rst deleted file mode 100644 index 20e6738987a..00000000000 --- a/cuda_core/docs/source/release/0.7.x-notes.rst +++ /dev/null @@ -1,76 +0,0 @@ -.. SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -.. SPDX-License-Identifier: Apache-2.0 - -.. currentmodule:: cuda.core - -``cuda.core`` 0.7.x Release Notes -================================= - - -Highlights ----------- - -- Introduced support for explicit graph construction. CUDA graphs can now be - built programmatically by adding nodes and edges, and their topology can be - modified after construction. - - -Breaking Changes ----------------- - -- Building ``cuda.core`` from source now requires ``cuda-bindings`` >= 12.9.0, due to Cython-level - dependencies on the NVVM and nvJitLink bindings (``cynvvm``, ``cynvjitlink``). Pre-built wheels - are unaffected. The previous minimum was 12.8.0. - - -New features ------------- - -- Added the :mod:`cuda.core.graph` public module containing - :class:`~graph.GraphDef` for explicit graph construction, typed node - subclasses, and supporting types. :class:`~graph.GraphBuilder` (stream - capture) also moves into this module. - -- Added ``preferred_location_type`` option to :class:`ManagedMemoryResourceOptions` - for explicit control over the preferred location kind (``"device"``, - ``"host"``, or ``"host_numa"``). This enables NUMA-aware managed memory - pool placement. The existing ``preferred_location`` parameter retains full - backwards compatibility when ``preferred_location_type`` is not set. - -- Added :attr:`ManagedMemoryResource.preferred_location` property to query the - resolved preferred location of a managed memory pool. Returns ``None`` for no - preference, or a tuple such as ``("device", 0)``, ``("host", None)``, or - ``("host_numa", 3)``. - -- Added ``numa_id`` option to :class:`PinnedMemoryResourceOptions` for explicit - control over host NUMA node placement. When ``ipc_enabled=True`` and - ``numa_id`` is not set, the NUMA node is automatically derived from the - current CUDA device. - -- Added :attr:`PinnedMemoryResource.numa_id` property to query the host NUMA - node ID used for pool placement. Returns ``-1`` for OS-managed placement. - - -New examples ------------- - -None. - - -Fixes and enhancements ----------------------- - -- Fixed IPC-enabled pinned memory pools using a hardcoded NUMA node ID of ``0`` - instead of the NUMA node closest to the active CUDA device. On multi-NUMA - systems where the device is attached to a non-zero host NUMA node, this could - cause pool creation or allocation failures. (:issue:`1603`) -- Fixed :attr:`DeviceMemoryResource.peer_accessible_by` returning stale results when wrapping - a non-owned (default) memory pool. The property now always queries the CUDA driver for - non-owned pools, so multiple wrappers around the same pool see consistent state. (:issue:`1720`) -- Reduced Python overhead in :class:`Program` and :class:`Linker` by moving compilation and - linking operations to the C level and releasing the GIL during backend calls. This benefits - workloads that create many programs or linkers, and enables concurrent compilation in - multithreaded applications. -- Improved optional dependency handling for NVVM and nvJitLink imports so that only genuinely - missing optional modules are treated as unavailable; unrelated import failures now surface - normally, and ``cuda.core`` now depends directly on ``cuda-pathfinder``. diff --git a/cuda_core/pixi.toml b/cuda_core/pixi.toml index 913472c07e1..1696a4a4c57 100644 --- a/cuda_core/pixi.toml +++ b/cuda_core/pixi.toml @@ -107,7 +107,7 @@ examples = { features = ["cu13", "examples", "local-deps"], solve-group = "examp # TODO: check if these can be extracted from pyproject.toml [package] name = "cuda-core" -version = "0.6.0" +version = "0.7.0" [package.build] backend = { name = "pixi-build-python", version = "*" } diff --git a/cuda_core/pyproject.toml b/cuda_core/pyproject.toml index aacbe4f4c59..80711f39ede 100644 --- a/cuda_core/pyproject.toml +++ b/cuda_core/pyproject.toml @@ -19,7 +19,7 @@ dynamic = [ "readme", ] requires-python = '>=3.10' -description = "cuda.core: (experimental) pythonic CUDA module" +description = "cuda.core: pythonic CUDA module" authors = [ { name = "NVIDIA Corporation" } ]