Skip to content

Commit a56ff12

Browse files
leofangclaude
andauthored
Prepare cuda.core v0.7.0 release (#1877)
* Prepare cuda.core v0.7.0 release Finalize release notes with all changes since v0.6.0: - Explicit graph construction (GraphDef, GraphBuilder, typed nodes) - CUDA-Graphics (OpenGL) interop via GraphicsResource - TensorMapDescriptor for Hopper+ TMA - StridedMemoryView DLPack export and C exchange API - NVRTC PCH runtime APIs on Program - CPU callbacks for stream capture (GraphBuilder.callback) - CUDA 13.2 support - Multiple bug fixes and enhancements Also: - Add 0.7.0 to nv-versions.json - Bump pixi.toml version to 0.7.0 - Add GraphicsResource, TensorMapDescriptor to api.rst - Remove "(experimental)" from pyproject.toml and README.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix install page URL in cuda_core README Point to cuda-core's own install page instead of cuda-bindings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Address review feedback on release notes - Use consistent "CUDA-OpenGL" naming (not "CUDA-Graphics") - Highlight DLPack export via from_dlpack() array API; move C exchange API detail to New features section - TensorMapDescriptor: reference public StridedMemoryView.as_tensor_map() instead of private _from_tiled/_from_im2col methods Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update cuda_core/docs/source/release/0.7.0-notes.rst * Trim DLPack export bullet per review suggestion Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update cuda_core/docs/source/release/0.7.0-notes.rst * Address second round of release note review feedback - Fix PCH entry: reference actual public API (ProgramOptions fields, Program.pch_status property) instead of non-existent methods - Combine ManagedMemoryResource NUMA entries into single bullet - Combine PinnedMemoryResource NUMA entries into single bullet - Replace :issue: role (not configured) with explicit GitHub links - Use :class: cross-ref for ManagedMemoryResource in fixes section Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Use :meth: cross-ref for Program.compile in PCH entry Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Use cyclass.rst template for Program, Linker, ObjectCode, Kernel These Cython classes were using the default autosummary template, which does not expand methods and properties. Switch to cyclass.rst so that properties like Program.pch_status and methods like Program.compile appear in the generated docs and can be cross-referenced. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * remove incorrect entry slipping for --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent a40fd94 commit a56ff12

File tree

8 files changed

+152
-85
lines changed

8 files changed

+152
-85
lines changed

cuda_core/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
# `cuda.core`: (experimental) Pythonic CUDA module
1+
# `cuda.core`: Pythonic CUDA module
22

33
Currently under active development; see [the documentation](https://nvidia.github.io/cuda-python/cuda-core/latest/) for more details.
44

55
## Installing
66

7-
Please refer to the [Installation page](https://nvidia.github.io/cuda-python/cuda-bindings/latest/install.html) for instructions and required/optional dependencies.
7+
Please refer to the [Installation page](https://nvidia.github.io/cuda-python/cuda-core/latest/install.html) for instructions and required/optional dependencies.
88

99
## Developing
1010

cuda_core/docs/nv-versions.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,10 @@
33
"version": "latest",
44
"url": "https://nvidia.github.io/cuda-python/cuda-core/latest/"
55
},
6+
{
7+
"version": "0.7.0",
8+
"url": "https://nvidia.github.io/cuda-python/cuda-core/0.7.0/"
9+
},
610
{
711
"version": "0.6.0",
812
"url": "https://nvidia.github.io/cuda-python/cuda-core/0.6.0/"

cuda_core/docs/source/api.rst

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,12 +129,40 @@ Each subclass exposes attributes unique to its operation type.
129129
graph.SwitchNode
130130

131131

132+
Graphics interoperability
133+
-------------------------
134+
135+
.. autosummary::
136+
:toctree: generated/
137+
138+
:template: autosummary/cyclass.rst
139+
140+
GraphicsResource
141+
142+
143+
Tensor Memory Accelerator (TMA)
144+
-------------------------------
145+
146+
.. autosummary::
147+
:toctree: generated/
148+
149+
:template: autosummary/cyclass.rst
150+
151+
TensorMapDescriptor
152+
153+
:template: dataclass.rst
154+
155+
TensorMapDescriptorOptions
156+
157+
132158
CUDA compilation toolchain
133159
--------------------------
134160

135161
.. autosummary::
136162
:toctree: generated/
137163

164+
:template: autosummary/cyclass.rst
165+
138166
Program
139167
Linker
140168
ObjectCode

cuda_core/docs/source/release/0.6.0-notes.rst

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -54,11 +54,6 @@ New features
5454
- Added CUDA version compatibility check at import time to detect mismatches between
5555
``cuda.core`` and the installed ``cuda-bindings`` version.
5656

57-
- ``Program.compile()`` now automatically resizes the NVRTC PCH heap and
58-
retries when precompiled header creation fails due to heap exhaustion.
59-
The ``pch_status`` property reports the PCH creation outcome
60-
(``"created"``, ``"not_attempted"``, ``"failed"``, or ``None``).
61-
6257

6358
Fixes and enhancements
6459
----------------------
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
.. SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
.. SPDX-License-Identifier: Apache-2.0
3+
4+
.. currentmodule:: cuda.core
5+
6+
``cuda.core`` 0.7.0 Release Notes
7+
=================================
8+
9+
10+
Highlights
11+
----------
12+
13+
- Introduced support for explicit graph construction. CUDA graphs can now be
14+
built programmatically by adding nodes and edges, and their topology can be
15+
modified after construction.
16+
- Added CUDA-OpenGL interoperability support, enabling zero-copy sharing of
17+
GPU memory between CUDA compute kernels and OpenGL renderers.
18+
- Added :class:`TensorMapDescriptor` for Hopper+ TMA (Tensor Memory Accelerator)
19+
bulk data movement, with automatic kernel argument integration.
20+
- :class:`~utils.StridedMemoryView` now supports DLPack export via
21+
``from_dlpack()`` array API.
22+
23+
24+
New features
25+
------------
26+
27+
- Added the :mod:`cuda.core.graph` public module containing
28+
:class:`~graph.GraphDef` for explicit graph construction, typed node
29+
subclasses, and supporting types. :class:`~graph.GraphBuilder` (stream
30+
capture) also moves into this module.
31+
32+
- Added :meth:`~graph.GraphBuilder.callback` for CPU callbacks during stream
33+
capture, mirroring the existing :meth:`~graph.GraphDef.callback` API.
34+
35+
- Added :class:`GraphicsResource` for CUDA-OpenGL interoperability.
36+
Factory classmethods :meth:`~GraphicsResource.from_gl_buffer` and
37+
:meth:`~GraphicsResource.from_gl_image` register OpenGL objects for CUDA
38+
access, and mapping returns a :class:`Buffer` for zero-copy kernel use.
39+
40+
- Added :class:`TensorMapDescriptor` wrapping the CUDA driver's ``CUtensorMap``
41+
for Hopper+ TMA (Tensor Memory Accelerator) bulk data movement.
42+
:class:`~utils.StridedMemoryView` gains an :meth:`~utils.StridedMemoryView.as_tensor_map`
43+
method for convenient descriptor creation, with automatic dtype inference, stride
44+
computation, and first-class kernel argument integration.
45+
46+
- Added DLPack export support to :class:`~utils.StridedMemoryView` via
47+
``__dlpack__`` and ``__dlpack_device__``, complementing the existing import
48+
path.
49+
50+
- Added the DLPack C exchange API (``__dlpack_c_exchange_api__``) to
51+
:class:`~utils.StridedMemoryView`.
52+
53+
- Added NVRTC precompiled header (PCH) support (CUDA 12.8+).
54+
:class:`ProgramOptions` gains ``pch``, ``create_pch``, ``use_pch``,
55+
``pch_dir``, and related options. :attr:`Program.pch_status` reports the
56+
PCH creation outcome, and :meth:`~Program.compile` automatically resizes the NVRTC
57+
PCH heap and retries when PCH creation fails due to heap exhaustion.
58+
59+
- Added NUMA-aware managed memory pool placement.
60+
:class:`ManagedMemoryResourceOptions` gains a ``preferred_location_type``
61+
option (``"device"``, ``"host"``, or ``"host_numa"``), and
62+
:attr:`ManagedMemoryResource.preferred_location` queries the resolved
63+
location. The existing ``preferred_location`` parameter retains full
64+
backwards compatibility.
65+
66+
- Added NUMA-aware pinned memory pool placement.
67+
:class:`PinnedMemoryResourceOptions` gains a ``numa_id`` option, and
68+
:attr:`PinnedMemoryResource.numa_id` queries the host NUMA node ID used for
69+
pool placement. When ``ipc_enabled=True`` and ``numa_id`` is not set, the
70+
NUMA node is automatically derived from the current CUDA device.
71+
72+
- Added support for CUDA 13.2.
73+
74+
75+
New examples
76+
------------
77+
78+
- ``gl_interop_plasma.py``: Real-time plasma effect demonstrating CUDA-OpenGL
79+
interoperability via :class:`GraphicsResource`.
80+
- ``tma_tensor_map.py``: TMA bulk data movement using
81+
:class:`TensorMapDescriptor` on Hopper+ GPUs.
82+
83+
84+
Fixes and enhancements
85+
----------------------
86+
87+
- Fixed managed memory buffers being misclassified as ``kDLCUDAHost`` in DLPack
88+
device mapping. They are now correctly reported as ``kDLCUDAManaged``.
89+
(`#1863 <https://github.com/NVIDIA/cuda-python/pull/1863>`__)
90+
- Fixed IPC-enabled pinned memory pools using a hardcoded NUMA node ID of ``0``
91+
instead of the NUMA node closest to the active CUDA device. On multi-NUMA
92+
systems where the device is attached to a non-zero host NUMA node, this could
93+
cause pool creation or allocation failures. (`#1603 <https://github.com/NVIDIA/cuda-python/issues/1603>`__)
94+
- Fixed :attr:`DeviceMemoryResource.peer_accessible_by` returning stale results when wrapping
95+
a non-owned (default) memory pool. The property now always queries the CUDA driver for
96+
non-owned pools, so multiple wrappers around the same pool see consistent state. (`#1720 <https://github.com/NVIDIA/cuda-python/issues/1720>`__)
97+
- Fixed a bare ``except`` clause in stream acceptance that silently swallowed all exceptions,
98+
including ``KeyboardInterrupt`` and ``SystemExit``. Only the expected "protocol not
99+
supported" case is now caught. (`#1631 <https://github.com/NVIDIA/cuda-python/issues/1631>`__)
100+
- :class:`~utils.StridedMemoryView` now validates strides at construction time so unsupported
101+
layouts fail immediately instead of on first metadata access. (`#1429 <https://github.com/NVIDIA/cuda-python/issues/1429>`__)
102+
- IPC file descriptor cleanup now uses a C++ ``shared_ptr`` with a POSIX deleter, avoiding
103+
cryptic errors when a :class:`DeviceMemoryResource` is destroyed during Python shutdown.
104+
- Improved error message when :class:`ManagedMemoryResource` is called without options on platforms
105+
that lack a default managed memory pool (e.g. WSL2). (`#1617 <https://github.com/NVIDIA/cuda-python/issues/1617>`__)
106+
- Handle properties on core API objects now return ``None`` during Python shutdown instead of
107+
crashing.
108+
- Reduced Python overhead in :class:`Program` and :class:`Linker` by moving compilation and
109+
linking operations to the C level and releasing the GIL during backend calls. This benefits
110+
workloads that create many programs or linkers, and enables concurrent compilation in
111+
multithreaded applications.
112+
- Error enum explanations are now derived from ``cuda-bindings`` docstrings when available
113+
(bindings 12.9.6+ or 13.2.0+), with frozen tables as fallback for older versions.
114+
- Improved optional dependency handling for NVVM and nvJitLink imports so that only genuinely
115+
missing optional modules are treated as unavailable; unrelated import failures now surface
116+
normally, and ``cuda.core`` now depends directly on ``cuda-pathfinder``.

cuda_core/docs/source/release/0.7.x-notes.rst

Lines changed: 0 additions & 76 deletions
This file was deleted.

cuda_core/pixi.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ examples = { features = ["cu13", "examples", "local-deps"], solve-group = "examp
107107
# TODO: check if these can be extracted from pyproject.toml
108108
[package]
109109
name = "cuda-core"
110-
version = "0.6.0"
110+
version = "0.7.0"
111111

112112
[package.build]
113113
backend = { name = "pixi-build-python", version = "*" }

cuda_core/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ dynamic = [
1919
"readme",
2020
]
2121
requires-python = '>=3.10'
22-
description = "cuda.core: (experimental) pythonic CUDA module"
22+
description = "cuda.core: pythonic CUDA module"
2323
authors = [
2424
{ name = "NVIDIA Corporation" }
2525
]

0 commit comments

Comments
 (0)