Prepare cuda.core v0.7.0 release (#1877)

leofang · claude · web-flow · commit a56ff12da046 · 2026-04-07T22:45:49.000-04:00
* Prepare cuda.core v0.7.0 release

Finalize release notes with all changes since v0.6.0:
- Explicit graph construction (GraphDef, GraphBuilder, typed nodes)
- CUDA-Graphics (OpenGL) interop via GraphicsResource
- TensorMapDescriptor for Hopper+ TMA
- StridedMemoryView DLPack export and C exchange API
- NVRTC PCH runtime APIs on Program
- CPU callbacks for stream capture (GraphBuilder.callback)
- CUDA 13.2 support
- Multiple bug fixes and enhancements

Also:
- Add 0.7.0 to nv-versions.json
- Bump pixi.toml version to 0.7.0
- Add GraphicsResource, TensorMapDescriptor to api.rst
- Remove "(experimental)" from pyproject.toml and README.md

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

* Fix install page URL in cuda_core README

Point to cuda-core's own install page instead of cuda-bindings.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

* Address review feedback on release notes

- Use consistent "CUDA-OpenGL" naming (not "CUDA-Graphics")
- Highlight DLPack export via from_dlpack() array API; move C exchange
  API detail to New features section
- TensorMapDescriptor: reference public StridedMemoryView.as_tensor_map()
  instead of private _from_tiled/_from_im2col methods

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

* Update cuda_core/docs/source/release/0.7.0-notes.rst

* Trim DLPack export bullet per review suggestion

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

* Update cuda_core/docs/source/release/0.7.0-notes.rst

* Address second round of release note review feedback

- Fix PCH entry: reference actual public API (ProgramOptions fields,
  Program.pch_status property) instead of non-existent methods
- Combine ManagedMemoryResource NUMA entries into single bullet
- Combine PinnedMemoryResource NUMA entries into single bullet
- Replace :issue: role (not configured) with explicit GitHub links
- Use :class: cross-ref for ManagedMemoryResource in fixes section

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

* Use :meth: cross-ref for Program.compile in PCH entry

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

* Use cyclass.rst template for Program, Linker, ObjectCode, Kernel

These Cython classes were using the default autosummary template, which
does not expand methods and properties. Switch to cyclass.rst so that
properties like Program.pch_status and methods like Program.compile
appear in the generated docs and can be cross-referenced.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

* remove incorrect entry slipping for

---------

Co-authored-by: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/cuda_core/README.md b/cuda_core/README.md
@@ -1,10 +1,10 @@
-# `cuda.core`: (experimental) Pythonic CUDA module
+# `cuda.core`: Pythonic CUDA module
 
 Currently under active development; see [the documentation](https://nvidia.github.io/cuda-python/cuda-core/latest/) for more details.
 
 ## Installing
 
-Please refer to the [Installation page](https://nvidia.github.io/cuda-python/cuda-bindings/latest/install.html) for instructions and required/optional dependencies.
+Please refer to the [Installation page](https://nvidia.github.io/cuda-python/cuda-core/latest/install.html) for instructions and required/optional dependencies.
 
 ## Developing
 
diff --git a/cuda_core/docs/nv-versions.json b/cuda_core/docs/nv-versions.json
@@ -3,6 +3,10 @@
         "version": "latest",
         "url": "https://nvidia.github.io/cuda-python/cuda-core/latest/"
     },
+    {
+        "version": "0.7.0",
+        "url": "https://nvidia.github.io/cuda-python/cuda-core/0.7.0/"
+    },
     {
         "version": "0.6.0",
         "url": "https://nvidia.github.io/cuda-python/cuda-core/0.6.0/"
diff --git a/cuda_core/docs/source/api.rst b/cuda_core/docs/source/api.rst
@@ -129,12 +129,40 @@ Each subclass exposes attributes unique to its operation type.
    graph.SwitchNode
 
 
+Graphics interoperability
+-------------------------
+
+.. autosummary::
+   :toctree: generated/
+
+   :template: autosummary/cyclass.rst
+
+   GraphicsResource
+
+
+Tensor Memory Accelerator (TMA)
+-------------------------------
+
+.. autosummary::
+   :toctree: generated/
+
+   :template: autosummary/cyclass.rst
+
+   TensorMapDescriptor
+
+   :template: dataclass.rst
+
+   TensorMapDescriptorOptions
+
+
 CUDA compilation toolchain
 --------------------------
 
 .. autosummary::
    :toctree: generated/
 
+   :template: autosummary/cyclass.rst
+
    Program
    Linker
    ObjectCode
diff --git a/cuda_core/docs/source/release/0.6.0-notes.rst b/cuda_core/docs/source/release/0.6.0-notes.rst
@@ -54,11 +54,6 @@ New features
 - Added CUDA version compatibility check at import time to detect mismatches between
   ``cuda.core`` and the installed ``cuda-bindings`` version.
 
-- ``Program.compile()`` now automatically resizes the NVRTC PCH heap and
-  retries when precompiled header creation fails due to heap exhaustion.
-  The ``pch_status`` property reports the PCH creation outcome
-  (``"created"``, ``"not_attempted"``, ``"failed"``, or ``None``).
-
 
 Fixes and enhancements
 ----------------------
diff --git a/cuda_core/docs/source/release/0.7.0-notes.rst b/cuda_core/docs/source/release/0.7.0-notes.rst
@@ -0,0 +1,116 @@
+.. SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: Apache-2.0
+
+.. currentmodule:: cuda.core
+
+``cuda.core`` 0.7.0 Release Notes
+=================================
+
+
+Highlights
+----------
+
+- Introduced support for explicit graph construction. CUDA graphs can now be
+  built programmatically by adding nodes and edges, and their topology can be
+  modified after construction.
+- Added CUDA-OpenGL interoperability support, enabling zero-copy sharing of
+  GPU memory between CUDA compute kernels and OpenGL renderers.
+- Added :class:`TensorMapDescriptor` for Hopper+ TMA (Tensor Memory Accelerator)
+  bulk data movement, with automatic kernel argument integration.
+- :class:`~utils.StridedMemoryView` now supports DLPack export via
+  ``from_dlpack()`` array API.
+
+
+New features
+------------
+
+- Added the :mod:`cuda.core.graph` public module containing
+  :class:`~graph.GraphDef` for explicit graph construction, typed node
+  subclasses, and supporting types. :class:`~graph.GraphBuilder` (stream
+  capture) also moves into this module.
+
+- Added :meth:`~graph.GraphBuilder.callback` for CPU callbacks during stream
+  capture, mirroring the existing :meth:`~graph.GraphDef.callback` API.
+
+- Added :class:`GraphicsResource` for CUDA-OpenGL interoperability.
+  Factory classmethods :meth:`~GraphicsResource.from_gl_buffer` and
+  :meth:`~GraphicsResource.from_gl_image` register OpenGL objects for CUDA
+  access, and mapping returns a :class:`Buffer` for zero-copy kernel use.
+
+- Added :class:`TensorMapDescriptor` wrapping the CUDA driver's ``CUtensorMap``
+  for Hopper+ TMA (Tensor Memory Accelerator) bulk data movement.
+  :class:`~utils.StridedMemoryView` gains an :meth:`~utils.StridedMemoryView.as_tensor_map`
+  method for convenient descriptor creation, with automatic dtype inference, stride
+  computation, and first-class kernel argument integration.
+
+- Added DLPack export support to :class:`~utils.StridedMemoryView` via
+  ``__dlpack__`` and ``__dlpack_device__``, complementing the existing import
+  path.
+
+- Added the DLPack C exchange API (``__dlpack_c_exchange_api__``) to
+  :class:`~utils.StridedMemoryView`.
+
+- Added NVRTC precompiled header (PCH) support (CUDA 12.8+).
+  :class:`ProgramOptions` gains ``pch``, ``create_pch``, ``use_pch``,
+  ``pch_dir``, and related options. :attr:`Program.pch_status` reports the
+  PCH creation outcome, and :meth:`~Program.compile` automatically resizes the NVRTC
+  PCH heap and retries when PCH creation fails due to heap exhaustion.
+
+- Added NUMA-aware managed memory pool placement.
+  :class:`ManagedMemoryResourceOptions` gains a ``preferred_location_type``
+  option (``"device"``, ``"host"``, or ``"host_numa"``), and
+  :attr:`ManagedMemoryResource.preferred_location` queries the resolved
+  location. The existing ``preferred_location`` parameter retains full
+  backwards compatibility.
+
+- Added NUMA-aware pinned memory pool placement.
+  :class:`PinnedMemoryResourceOptions` gains a ``numa_id`` option, and
+  :attr:`PinnedMemoryResource.numa_id` queries the host NUMA node ID used for
+  pool placement. When ``ipc_enabled=True`` and ``numa_id`` is not set, the
+  NUMA node is automatically derived from the current CUDA device.
+
+- Added support for CUDA 13.2.
+
+
+New examples
+------------
+
+- ``gl_interop_plasma.py``: Real-time plasma effect demonstrating CUDA-OpenGL
+  interoperability via :class:`GraphicsResource`.
+- ``tma_tensor_map.py``: TMA bulk data movement using
+  :class:`TensorMapDescriptor` on Hopper+ GPUs.
+
+
+Fixes and enhancements
+----------------------
+
+- Fixed managed memory buffers being misclassified as ``kDLCUDAHost`` in DLPack
+  device mapping. They are now correctly reported as ``kDLCUDAManaged``.
+  (`#1863 <https://github.com/NVIDIA/cuda-python/pull/1863>`__)
+- Fixed IPC-enabled pinned memory pools using a hardcoded NUMA node ID of ``0``
+  instead of the NUMA node closest to the active CUDA device. On multi-NUMA
+  systems where the device is attached to a non-zero host NUMA node, this could
+  cause pool creation or allocation failures. (`#1603 <https://github.com/NVIDIA/cuda-python/issues/1603>`__)
+- Fixed :attr:`DeviceMemoryResource.peer_accessible_by` returning stale results when wrapping
+  a non-owned (default) memory pool. The property now always queries the CUDA driver for
+  non-owned pools, so multiple wrappers around the same pool see consistent state. (`#1720 <https://github.com/NVIDIA/cuda-python/issues/1720>`__)
+- Fixed a bare ``except`` clause in stream acceptance that silently swallowed all exceptions,
+  including ``KeyboardInterrupt`` and ``SystemExit``. Only the expected "protocol not
+  supported" case is now caught. (`#1631 <https://github.com/NVIDIA/cuda-python/issues/1631>`__)
+- :class:`~utils.StridedMemoryView` now validates strides at construction time so unsupported
+  layouts fail immediately instead of on first metadata access. (`#1429 <https://github.com/NVIDIA/cuda-python/issues/1429>`__)
+- IPC file descriptor cleanup now uses a C++ ``shared_ptr`` with a POSIX deleter, avoiding
+  cryptic errors when a :class:`DeviceMemoryResource` is destroyed during Python shutdown.
+- Improved error message when :class:`ManagedMemoryResource` is called without options on platforms
+  that lack a default managed memory pool (e.g. WSL2). (`#1617 <https://github.com/NVIDIA/cuda-python/issues/1617>`__)
+- Handle properties on core API objects now return ``None`` during Python shutdown instead of
+  crashing.
+- Reduced Python overhead in :class:`Program` and :class:`Linker` by moving compilation and
+  linking operations to the C level and releasing the GIL during backend calls. This benefits
+  workloads that create many programs or linkers, and enables concurrent compilation in
+  multithreaded applications.
+- Error enum explanations are now derived from ``cuda-bindings`` docstrings when available
+  (bindings 12.9.6+ or 13.2.0+), with frozen tables as fallback for older versions.
+- Improved optional dependency handling for NVVM and nvJitLink imports so that only genuinely
+  missing optional modules are treated as unavailable; unrelated import failures now surface
+  normally, and ``cuda.core`` now depends directly on ``cuda-pathfinder``.
diff --git a/cuda_core/docs/source/release/0.7.x-notes.rst b/cuda_core/docs/source/release/0.7.x-notes.rst
diff --git a/cuda_core/pixi.toml b/cuda_core/pixi.toml
@@ -107,7 +107,7 @@ examples = { features = ["cu13", "examples", "local-deps"], solve-group = "examp
 # TODO: check if these can be extracted from pyproject.toml
 [package]
 name = "cuda-core"
-version = "0.6.0"
+version = "0.7.0"
 
 [package.build]
 backend = { name = "pixi-build-python", version = "*" }
diff --git a/cuda_core/pyproject.toml b/cuda_core/pyproject.toml
@@ -19,7 +19,7 @@ dynamic = [
     "readme",
 ]
 requires-python = '>=3.10'
-description = "cuda.core: (experimental) pythonic CUDA module"
+description = "cuda.core: pythonic CUDA module"
 authors = [
     { name = "NVIDIA Corporation" }
 ]

Original file line number	Diff line number	Diff line change
`@@ -19,7 +19,7 @@ dynamic = [`
`19`	`19`	`"readme",`
`20`	`20`	`]`
`21`	`21`	`requires-python = '>=3.10'`
`22`		`-description = "cuda.core: (experimental) pythonic CUDA module"`
	`22`	`+description = "cuda.core: pythonic CUDA module"`
`23`	`23`	`authors = [`
`24`	`24`	`{ name = "NVIDIA Corporation" }`
`25`	`25`	`]`