Skip to content

Commit 974ed1d

Browse files
authored
Add recent cuda.core examples and docs pixi workflow (#1865)
Cover newer graph, memory resource, and StridedMemoryView APIs with runnable examples, and make the Sphinx docs environment reproducible so examples and docs are easier to iterate locally. Made-with: Cursor
1 parent a56ff12 commit 974ed1d

File tree

9 files changed

+3943
-109
lines changed

9 files changed

+3943
-109
lines changed

cuda_core/docs/README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,13 @@
55
3. Build the docs with `./build_docs.sh`.
66
4. The html artifacts should be available under both `./build/html/latest` and `./build/html/<version>`.
77

8+
For local development, `cuda_core/pixi.toml` now includes a dedicated `docs`
9+
environment that mirrors the CI Sphinx dependencies:
10+
11+
- From `cuda_core/`, run `pixi run docs-build` to build the full versioned docs output.
12+
- Run `pixi run docs-build-latest` to iterate on just the `latest` docs.
13+
- Run `pixi run docs-debug` for a serial, verbose Sphinx build that is easier to debug.
14+
815
Alternatively, we can build all the docs at once by running [`cuda_python/docs/build_all_docs.sh`](../../cuda_python/docs/build_all_docs.sh).
916

1017
To publish the docs with the built version, it is important to note that the html files of older versions

cuda_core/docs/build_docs.sh

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@
55

66
set -ex
77

8+
SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)
9+
cd "${SCRIPT_DIR}"
10+
811
if [[ "$#" == "0" ]]; then
912
LATEST_ONLY="0"
1013
elif [[ "$#" == "1" && "$1" == "latest-only" ]]; then
@@ -21,13 +24,11 @@ if [[ -z "${SPHINX_CUDA_CORE_VER}" ]]; then
2124
| awk -F'+' '{print $1}')
2225
fi
2326

24-
# build the docs (in parallel)
25-
SPHINXOPTS="-j 4 -d build/.doctrees" make html
26-
27-
# for debugging/developing (conf.py), please comment out the above line and
28-
# use the line below instead, as we must build in serial to avoid getting
29-
# obsecure Sphinx errors
30-
#SPHINXOPTS="-v" make html
27+
# build the docs. Allow callers to override SPHINXOPTS for serial/debug runs.
28+
if [[ -z "${SPHINXOPTS:-}" ]]; then
29+
SPHINXOPTS="-j 4 -d build/.doctrees"
30+
fi
31+
make html
3132

3233
# to support version dropdown menu
3334
cp ./versions.json build/html

cuda_core/docs/source/interoperability.rst

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -68,13 +68,16 @@ a few iterations to ensure correctness.
6868

6969
``cuda.core`` offers a :func:`~utils.args_viewable_as_strided_memory` decorator for
7070
extracting the metadata (such as pointer address, shape, strides, and dtype) from any
71-
Python objects supporting either CAI or DLPack and returning a :class:`~utils.StridedMemoryView` object, see the
72-
`strided_memory_view.py <https://github.com/NVIDIA/cuda-python/blob/main/cuda_core/examples/strided_memory_view.py>`_
73-
example. Alternatively, a :class:`~utils.StridedMemoryView` object can be explicitly
74-
constructed without using the decorator. This provides a *concrete implementation* to both
75-
protocols that is **array-library-agnostic**, so that all Python projects can just rely on this
76-
without either re-implementing (the consumer-side of) the protocols or tying to any particular
77-
array libraries.
71+
Python objects supporting either CAI or DLPack and returning a :class:`~utils.StridedMemoryView`
72+
object. See the
73+
`strided_memory_view_constructors.py <https://github.com/NVIDIA/cuda-python/blob/main/cuda_core/examples/strided_memory_view_constructors.py>`_
74+
example for the explicit constructors, or
75+
`strided_memory_view_cpu.py <https://github.com/NVIDIA/cuda-python/blob/main/cuda_core/examples/strided_memory_view_cpu.py>`_
76+
and
77+
`strided_memory_view_gpu.py <https://github.com/NVIDIA/cuda-python/blob/main/cuda_core/examples/strided_memory_view_gpu.py>`_
78+
for decorator-based workflows. This provides a *concrete implementation* to both protocols that is
79+
**array-library-agnostic**, so that all Python projects can just rely on this without either
80+
re-implementing (the consumer-side of) the protocols or tying to any particular array libraries.
7881

7982
The :attr:`~utils.StridedMemoryView.is_device_accessible` attribute can be used to check
8083
whether or not the underlying buffer can be accessed on GPU.

cuda_core/examples/graph_update.py

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
# ################################################################################
6+
#
7+
# This example demonstrates Graph.update() by reusing the same executable graph
8+
# with a new capture that has the same topology but different kernel arguments.
9+
#
10+
# ################################################################################
11+
12+
# /// script
13+
# dependencies = ["cuda_bindings", "cuda_core", "nvidia-cuda-nvrtc", "numpy>=2.1"]
14+
# ///
15+
16+
import sys
17+
18+
import numpy as np
19+
20+
from cuda.core import (
21+
Device,
22+
LaunchConfig,
23+
LegacyPinnedMemoryResource,
24+
Program,
25+
ProgramOptions,
26+
launch,
27+
)
28+
29+
code = """
30+
extern "C" __global__ void add_one(int* value) {
31+
if (threadIdx.x == 0 && blockIdx.x == 0) {
32+
*value += 1;
33+
}
34+
}
35+
"""
36+
37+
38+
def build_increment_graph(device, kernel, target_ptr):
39+
builder = device.create_graph_builder().begin_building()
40+
config = LaunchConfig(grid=1, block=1)
41+
launch(builder, config, kernel, target_ptr)
42+
launch(builder, config, kernel, target_ptr)
43+
return builder.end_building()
44+
45+
46+
def main():
47+
if np.lib.NumpyVersion(np.__version__) < "2.1.0":
48+
print("This example requires NumPy 2.1.0 or later", file=sys.stderr)
49+
sys.exit(1)
50+
51+
device = Device()
52+
device.set_current()
53+
stream = device.create_stream()
54+
pinned_mr = LegacyPinnedMemoryResource()
55+
buffer = None
56+
initial_capture = None
57+
update_capture = None
58+
graph = None
59+
60+
try:
61+
options = ProgramOptions(std="c++17", arch=f"sm_{device.arch}")
62+
program = Program(code, code_type="c++", options=options)
63+
module = program.compile("cubin")
64+
kernel = module.get_kernel("add_one")
65+
66+
buffer = pinned_mr.allocate(2 * np.dtype(np.int32).itemsize)
67+
values = np.from_dlpack(buffer).view(np.int32)
68+
values[:] = 0
69+
70+
initial_capture = build_increment_graph(device, kernel, values[0:].ctypes.data)
71+
update_capture = build_increment_graph(device, kernel, values[1:].ctypes.data)
72+
graph = initial_capture.complete()
73+
74+
graph.upload(stream)
75+
graph.launch(stream)
76+
stream.sync()
77+
assert tuple(values) == (2, 0)
78+
79+
graph.update(update_capture)
80+
graph.upload(stream)
81+
graph.launch(stream)
82+
stream.sync()
83+
assert tuple(values) == (2, 2)
84+
85+
print("Graph.update() reused the executable graph with a new target pointer.")
86+
print(f"Final host values: {tuple(values)}")
87+
finally:
88+
if graph is not None:
89+
graph.close()
90+
if update_capture is not None:
91+
update_capture.close()
92+
if initial_capture is not None:
93+
initial_capture.close()
94+
if buffer is not None:
95+
buffer.close()
96+
stream.close()
97+
98+
99+
if __name__ == "__main__":
100+
main()
Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
# ################################################################################
6+
#
7+
# This example demonstrates the newer memory-pool APIs by combining
8+
# PinnedMemoryResource, ManagedMemoryResource, and GraphMemoryResource in one
9+
# workflow.
10+
#
11+
# ################################################################################
12+
13+
# /// script
14+
# dependencies = ["cuda_bindings", "cuda_core", "nvidia-cuda-nvrtc", "numpy>=2.1"]
15+
# ///
16+
17+
import sys
18+
19+
import numpy as np
20+
21+
from cuda.core import (
22+
Device,
23+
GraphMemoryResource,
24+
LaunchConfig,
25+
ManagedMemoryResource,
26+
ManagedMemoryResourceOptions,
27+
PinnedMemoryResource,
28+
PinnedMemoryResourceOptions,
29+
Program,
30+
ProgramOptions,
31+
launch,
32+
)
33+
34+
code = """
35+
extern "C" __global__ void scale_and_bias(float* data, size_t size, float scale, float bias) {
36+
const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
37+
const unsigned int stride = blockDim.x * gridDim.x;
38+
for (size_t i = tid; i < size; i += stride) {
39+
data[i] = data[i] * scale + bias;
40+
}
41+
}
42+
"""
43+
44+
45+
def main():
46+
if np.lib.NumpyVersion(np.__version__) < "2.1.0":
47+
print("This example requires NumPy 2.1.0 or later", file=sys.stderr)
48+
sys.exit(1)
49+
50+
device = Device()
51+
device.set_current()
52+
stream = device.create_stream()
53+
54+
managed_mr = None
55+
pinned_mr = None
56+
graph_mr = None
57+
managed_buffer = None
58+
pinned_buffer = None
59+
graph_capture = None
60+
graph = None
61+
62+
try:
63+
options = ProgramOptions(std="c++17", arch=f"sm_{device.arch}")
64+
program = Program(code, code_type="c++", options=options)
65+
module = program.compile("cubin")
66+
kernel = module.get_kernel("scale_and_bias")
67+
68+
size = 256
69+
dtype = np.float32
70+
nbytes = size * dtype().itemsize
71+
config = LaunchConfig(grid=(size + 127) // 128, block=128)
72+
73+
managed_options = ManagedMemoryResourceOptions(
74+
preferred_location=device.device_id,
75+
preferred_location_type="device",
76+
)
77+
managed_mr = ManagedMemoryResource(options=managed_options)
78+
79+
pinned_options = {"ipc_enabled": False}
80+
host_numa_id = getattr(device.properties, "host_numa_id", -1)
81+
if host_numa_id >= 0:
82+
pinned_options["numa_id"] = host_numa_id
83+
pinned_mr = PinnedMemoryResource(options=PinnedMemoryResourceOptions(**pinned_options))
84+
85+
graph_mr = GraphMemoryResource(device)
86+
87+
managed_buffer = managed_mr.allocate(nbytes, stream=stream)
88+
pinned_buffer = pinned_mr.allocate(nbytes, stream=stream)
89+
90+
managed_array = np.from_dlpack(managed_buffer).view(np.float32)
91+
pinned_array = np.from_dlpack(pinned_buffer).view(np.float32)
92+
93+
managed_array[:] = np.arange(size, dtype=dtype)
94+
managed_original = managed_array.copy()
95+
stream.sync()
96+
97+
managed_buffer.copy_to(pinned_buffer, stream=stream)
98+
stream.sync()
99+
assert np.array_equal(pinned_array, managed_original)
100+
101+
graph_builder = device.create_graph_builder().begin_building("relaxed")
102+
scratch_buffer = graph_mr.allocate(nbytes, stream=graph_builder)
103+
scratch_buffer.copy_from(managed_buffer, stream=graph_builder)
104+
launch(graph_builder, config, kernel, scratch_buffer, np.uint64(size), np.float32(2.0), np.float32(1.0))
105+
managed_buffer.copy_from(scratch_buffer, stream=graph_builder)
106+
scratch_buffer.close()
107+
108+
graph_capture = graph_builder.end_building()
109+
graph = graph_capture.complete()
110+
graph.upload(stream)
111+
graph.launch(stream)
112+
stream.sync()
113+
114+
np.testing.assert_allclose(managed_array, managed_original * 2 + 1)
115+
managed_buffer.copy_to(pinned_buffer, stream=stream)
116+
stream.sync()
117+
np.testing.assert_allclose(pinned_array, managed_original * 2 + 1)
118+
119+
print(f"PinnedMemoryResource numa_id: {pinned_mr.numa_id}")
120+
print(f"ManagedMemoryResource preferred_location: {managed_mr.preferred_location}")
121+
print(f"GraphMemoryResource reserved high watermark: {graph_mr.attributes.reserved_mem_high}")
122+
finally:
123+
if graph is not None:
124+
graph.close()
125+
if graph_capture is not None:
126+
graph_capture.close()
127+
if pinned_buffer is not None:
128+
pinned_buffer.close(stream)
129+
if managed_buffer is not None:
130+
managed_buffer.close(stream)
131+
if graph_mr is not None:
132+
graph_mr.close()
133+
if pinned_mr is not None:
134+
pinned_mr.close()
135+
if managed_mr is not None:
136+
managed_mr.close()
137+
stream.close()
138+
139+
140+
if __name__ == "__main__":
141+
main()
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
# ################################################################################
6+
#
7+
# This example demonstrates the explicit StridedMemoryView constructors for
8+
# __array_interface__, DLPack, __cuda_array_interface__, and Buffer objects.
9+
#
10+
# ################################################################################
11+
12+
# /// script
13+
# dependencies = ["cuda_bindings", "cuda_core", "cupy-cuda13x", "numpy>=2.1"]
14+
# ///
15+
16+
import sys
17+
18+
import cupy as cp
19+
import numpy as np
20+
21+
from cuda.core import Device
22+
from cuda.core.utils import StridedMemoryView
23+
24+
25+
def dense_c_strides(shape):
26+
if not shape:
27+
return ()
28+
29+
strides = [1] * len(shape)
30+
for index in range(len(shape) - 2, -1, -1):
31+
strides[index] = strides[index + 1] * shape[index + 1]
32+
return tuple(strides)
33+
34+
35+
def main():
36+
if np.lib.NumpyVersion(np.__version__) < "2.1.0":
37+
print("This example requires NumPy 2.1.0 or later", file=sys.stderr)
38+
sys.exit(1)
39+
40+
device = Device()
41+
device.set_current()
42+
stream = device.create_stream()
43+
buffer = None
44+
45+
try:
46+
host_array = np.arange(12, dtype=np.int16).reshape(3, 4)
47+
host_view = StridedMemoryView.from_array_interface(host_array)
48+
host_dlpack_view = StridedMemoryView.from_dlpack(host_array, stream_ptr=-1)
49+
50+
assert host_view.shape == host_array.shape
51+
assert host_view.size == host_array.size
52+
assert not host_view.is_device_accessible
53+
assert np.array_equal(np.from_dlpack(host_view), host_array)
54+
assert np.array_equal(np.from_dlpack(host_dlpack_view), host_array)
55+
56+
gpu_array = cp.arange(12, dtype=cp.float32).reshape(3, 4)
57+
dlpack_view = StridedMemoryView.from_dlpack(gpu_array, stream_ptr=stream.handle)
58+
cai_view = StridedMemoryView.from_cuda_array_interface(gpu_array, stream_ptr=stream.handle)
59+
60+
cp.testing.assert_array_equal(cp.from_dlpack(dlpack_view), gpu_array)
61+
cp.testing.assert_array_equal(cp.from_dlpack(cai_view), gpu_array)
62+
63+
buffer = device.memory_resource.allocate(gpu_array.nbytes, stream=stream)
64+
buffer_array = cp.from_dlpack(buffer).view(dtype=cp.float32).reshape(gpu_array.shape)
65+
buffer_array[...] = gpu_array
66+
device.sync()
67+
68+
buffer_view = StridedMemoryView.from_buffer(
69+
buffer,
70+
shape=gpu_array.shape,
71+
strides=dense_c_strides(gpu_array.shape),
72+
dtype=np.dtype(np.float32),
73+
)
74+
cp.testing.assert_array_equal(cp.from_dlpack(buffer_view), gpu_array)
75+
76+
print("Constructed StridedMemoryView objects from array, DLPack, CAI, and Buffer inputs.")
77+
finally:
78+
if buffer is not None:
79+
buffer.close(stream)
80+
stream.close()
81+
82+
83+
if __name__ == "__main__":
84+
main()

0 commit comments

Comments
 (0)