Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions cuda_core/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# `cuda.core`: (experimental) Pythonic CUDA module
# `cuda.core`: Pythonic CUDA module

Currently under active development; see [the documentation](https://nvidia.github.io/cuda-python/cuda-core/latest/) for more details.

## Installing

Please refer to the [Installation page](https://nvidia.github.io/cuda-python/cuda-bindings/latest/install.html) for instructions and required/optional dependencies.
Please refer to the [Installation page](https://nvidia.github.io/cuda-python/cuda-core/latest/install.html) for instructions and required/optional dependencies.

## Developing

Expand Down
4 changes: 4 additions & 0 deletions cuda_core/docs/nv-versions.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@
"version": "latest",
"url": "https://nvidia.github.io/cuda-python/cuda-core/latest/"
},
{
"version": "0.7.0",
"url": "https://nvidia.github.io/cuda-python/cuda-core/0.7.0/"
},
{
"version": "0.6.0",
"url": "https://nvidia.github.io/cuda-python/cuda-core/0.6.0/"
Expand Down
26 changes: 26 additions & 0 deletions cuda_core/docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,32 @@ Each subclass exposes attributes unique to its operation type.
graph.SwitchNode


Graphics interoperability
-------------------------

.. autosummary::
:toctree: generated/

:template: autosummary/cyclass.rst

GraphicsResource


Tensor Memory Accelerator (TMA)
-------------------------------

.. autosummary::
:toctree: generated/

:template: autosummary/cyclass.rst

TensorMapDescriptor

:template: dataclass.rst

TensorMapDescriptorOptions


CUDA compilation toolchain
--------------------------

Expand Down
127 changes: 127 additions & 0 deletions cuda_core/docs/source/release/0.7.0-notes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
.. SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
.. SPDX-License-Identifier: Apache-2.0

.. currentmodule:: cuda.core

``cuda.core`` 0.7.0 Release Notes
=================================


Highlights
----------

- Introduced support for explicit graph construction. CUDA graphs can now be
built programmatically by adding nodes and edges, and their topology can be
modified after construction.
- Added CUDA-Graphics (OpenGL) interoperability support, enabling zero-copy
Comment thread
leofang marked this conversation as resolved.
Outdated
sharing of GPU memory between CUDA compute kernels and OpenGL renderers.
- Added :class:`TensorMapDescriptor` for Hopper+ TMA (Tensor Memory Accelerator)
bulk data movement, with automatic kernel argument integration.
- :class:`~utils.StridedMemoryView` now supports DLPack export via
``__dlpack__`` / ``__dlpack_device__`` and the C exchange API.
Comment thread
leofang marked this conversation as resolved.
Outdated


Breaking Changes
----------------

- Building ``cuda.core`` from source now requires ``cuda-bindings`` >= 12.9.0, due to Cython-level
dependencies on the NVVM and nvJitLink bindings (``cynvvm``, ``cynvjitlink``). Pre-built wheels
are unaffected. The previous minimum was 12.8.0.


Comment thread
leofang marked this conversation as resolved.
Outdated
New features
------------

- Added the :mod:`cuda.core.graph` public module containing
:class:`~graph.GraphDef` for explicit graph construction, typed node
subclasses, and supporting types. :class:`~graph.GraphBuilder` (stream
capture) also moves into this module.

- Added :meth:`~graph.GraphBuilder.callback` for CPU callbacks during stream
capture, mirroring the existing :meth:`~graph.GraphDef.callback` API.

- Added :class:`GraphicsResource` for CUDA-OpenGL interoperability.
Factory classmethods :meth:`~GraphicsResource.from_gl_buffer` and
:meth:`~GraphicsResource.from_gl_image` register OpenGL objects for CUDA
access, and mapping returns a :class:`Buffer` for zero-copy kernel use.

- Added :class:`TensorMapDescriptor` wrapping the CUDA driver's ``CUtensorMap``
for Hopper+ TMA (Tensor Memory Accelerator) bulk data movement. Supports tiled
and im2col descriptor creation via :meth:`~TensorMapDescriptor.from_tiled` and
:meth:`~TensorMapDescriptor.from_im2col`, with automatic dtype inference, stride
computation, and first-class kernel argument integration.
Comment thread
leofang marked this conversation as resolved.

- Added DLPack export support to :class:`~utils.StridedMemoryView` via
``__dlpack__`` and ``__dlpack_device__``, complementing the existing import
path. The DLPack C exchange API (``__dlpack_c_exchange_api__``) is also
exposed. The vendored ``dlpack.h`` has been updated to DLPack v1.3.
Comment thread
leofang marked this conversation as resolved.
Outdated

- Added NVRTC precompiled header (PCH) runtime APIs to :class:`Program`:
:meth:`~Program.get_pch_create_status`, :meth:`~Program.get_pch_heap_size_required`,
:meth:`~Program.get_pch_heap_size` (static), and :meth:`~Program.set_pch_heap_size`
(static). Requires NVRTC 12.8+.

- Added ``preferred_location_type`` option to :class:`ManagedMemoryResourceOptions`
for explicit control over the preferred location kind (``"device"``,
``"host"``, or ``"host_numa"``). This enables NUMA-aware managed memory
pool placement. The existing ``preferred_location`` parameter retains full
backwards compatibility when ``preferred_location_type`` is not set.

- Added :attr:`ManagedMemoryResource.preferred_location` property to query the
resolved preferred location of a managed memory pool. Returns ``None`` for no
preference, or a tuple such as ``("device", 0)``, ``("host", None)``, or
``("host_numa", 3)``.

- Added ``numa_id`` option to :class:`PinnedMemoryResourceOptions` for explicit
control over host NUMA node placement. When ``ipc_enabled=True`` and
``numa_id`` is not set, the NUMA node is automatically derived from the
current CUDA device.

- Added :attr:`PinnedMemoryResource.numa_id` property to query the host NUMA
node ID used for pool placement. Returns ``-1`` for OS-managed placement.

- Added support for CUDA 13.2.


New examples
------------

- ``gl_interop_plasma.py``: Real-time plasma effect demonstrating CUDA-OpenGL
interoperability via :class:`GraphicsResource`.
- ``tma_tensor_map.py``: TMA bulk data movement using
:class:`TensorMapDescriptor` on Hopper+ GPUs.


Fixes and enhancements
----------------------

- Fixed managed memory buffers being misclassified as ``kDLCUDAHost`` in DLPack
device mapping. They are now correctly reported as ``kDLCUDAManaged``.
(:issue:`1863`)
- Fixed IPC-enabled pinned memory pools using a hardcoded NUMA node ID of ``0``
instead of the NUMA node closest to the active CUDA device. On multi-NUMA
systems where the device is attached to a non-zero host NUMA node, this could
cause pool creation or allocation failures. (:issue:`1603`)
- Fixed :attr:`DeviceMemoryResource.peer_accessible_by` returning stale results when wrapping
a non-owned (default) memory pool. The property now always queries the CUDA driver for
non-owned pools, so multiple wrappers around the same pool see consistent state. (:issue:`1720`)
- Fixed a bare ``except`` clause in stream acceptance that silently swallowed all exceptions,
including ``KeyboardInterrupt`` and ``SystemExit``. Only the expected "protocol not
supported" case is now caught. (:issue:`1631`)
- :class:`~utils.StridedMemoryView` now validates strides at construction time so unsupported
layouts fail immediately instead of on first metadata access. (:issue:`1429`)
- IPC file descriptor cleanup now uses a C++ ``shared_ptr`` with a POSIX deleter, avoiding
cryptic errors when a :class:`DeviceMemoryResource` is destroyed during Python shutdown.
- Improved error message when ``ManagedMemoryResource()`` is called without options on platforms
that lack a default managed memory pool (e.g. WSL2). (:issue:`1617`)
- Handle properties on core API objects now return ``None`` during Python shutdown instead of
crashing.
- Reduced Python overhead in :class:`Program` and :class:`Linker` by moving compilation and
linking operations to the C level and releasing the GIL during backend calls. This benefits
workloads that create many programs or linkers, and enables concurrent compilation in
multithreaded applications.
- Error enum explanations are now derived from ``cuda-bindings`` docstrings when available
(bindings 12.9.6+ or 13.2.0+), with frozen tables as fallback for older versions.
- Improved optional dependency handling for NVVM and nvJitLink imports so that only genuinely
missing optional modules are treated as unavailable; unrelated import failures now surface
normally, and ``cuda.core`` now depends directly on ``cuda-pathfinder``.
76 changes: 0 additions & 76 deletions cuda_core/docs/source/release/0.7.x-notes.rst

This file was deleted.

2 changes: 1 addition & 1 deletion cuda_core/pixi.toml
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ examples = { features = ["cu13", "examples", "local-deps"], solve-group = "examp
# TODO: check if these can be extracted from pyproject.toml
[package]
name = "cuda-core"
version = "0.6.0"
version = "0.7.0"

[package.build]
backend = { name = "pixi-build-python", version = "*" }
Expand Down
2 changes: 1 addition & 1 deletion cuda_core/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ dynamic = [
"readme",
]
requires-python = '>=3.10'
description = "cuda.core: (experimental) pythonic CUDA module"
description = "cuda.core: pythonic CUDA module"
authors = [
{ name = "NVIDIA Corporation" }
]
Expand Down