Skip to content

Commit 7d5297f

Browse files
committed
Add recent cuda.core examples and docs pixi workflow
Cover newer graph, memory resource, and StridedMemoryView APIs with runnable examples, and make the Sphinx docs environment reproducible so examples and docs are easier to iterate locally. Made-with: Cursor
1 parent c2f79a1 commit 7d5297f

File tree

9 files changed

+3943
-109
lines changed

9 files changed

+3943
-109
lines changed

cuda_core/docs/README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,13 @@
55
3. Build the docs with `./build_docs.sh`.
66
4. The html artifacts should be available under both `./build/html/latest` and `./build/html/<version>`.
77

8+
For local development, `cuda_core/pixi.toml` now includes a dedicated `docs`
9+
environment that mirrors the CI Sphinx dependencies:
10+
11+
- From `cuda_core/`, run `pixi run docs-build` to build the full versioned docs output.
12+
- Run `pixi run docs-build-latest` to iterate on just the `latest` docs.
13+
- Run `pixi run docs-debug` for a serial, verbose Sphinx build that is easier to debug.
14+
815
Alternatively, we can build all the docs at once by running [`cuda_python/docs/build_all_docs.sh`](../../cuda_python/docs/build_all_docs.sh).
916

1017
To publish the docs with the built version, it is important to note that the html files of older versions

cuda_core/docs/build_docs.sh

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@
55

66
set -ex
77

8+
SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)
9+
cd "${SCRIPT_DIR}"
10+
811
if [[ "$#" == "0" ]]; then
912
LATEST_ONLY="0"
1013
elif [[ "$#" == "1" && "$1" == "latest-only" ]]; then
@@ -21,13 +24,11 @@ if [[ -z "${SPHINX_CUDA_CORE_VER}" ]]; then
2124
| awk -F'+' '{print $1}')
2225
fi
2326

24-
# build the docs (in parallel)
25-
SPHINXOPTS="-j 4 -d build/.doctrees" make html
26-
27-
# for debugging/developing (conf.py), please comment out the above line and
28-
# use the line below instead, as we must build in serial to avoid getting
29-
# obsecure Sphinx errors
30-
#SPHINXOPTS="-v" make html
27+
# build the docs. Allow callers to override SPHINXOPTS for serial/debug runs.
28+
if [[ -z "${SPHINXOPTS:-}" ]]; then
29+
SPHINXOPTS="-j 4 -d build/.doctrees"
30+
fi
31+
make html
3132

3233
# to support version dropdown menu
3334
cp ./versions.json build/html

cuda_core/docs/source/interoperability.rst

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -68,13 +68,16 @@ a few iterations to ensure correctness.
6868

6969
``cuda.core`` offers a :func:`~utils.args_viewable_as_strided_memory` decorator for
7070
extracting the metadata (such as pointer address, shape, strides, and dtype) from any
71-
Python objects supporting either CAI or DLPack and returning a :class:`~utils.StridedMemoryView` object, see the
72-
`strided_memory_view.py <https://github.com/NVIDIA/cuda-python/blob/main/cuda_core/examples/strided_memory_view.py>`_
73-
example. Alternatively, a :class:`~utils.StridedMemoryView` object can be explicitly
74-
constructed without using the decorator. This provides a *concrete implementation* to both
75-
protocols that is **array-library-agnostic**, so that all Python projects can just rely on this
76-
without either re-implementing (the consumer-side of) the protocols or tying to any particular
77-
array libraries.
71+
Python objects supporting either CAI or DLPack and returning a :class:`~utils.StridedMemoryView`
72+
object. See the
73+
`strided_memory_view_constructors.py <https://github.com/NVIDIA/cuda-python/blob/main/cuda_core/examples/strided_memory_view_constructors.py>`_
74+
example for the explicit constructors, or
75+
`strided_memory_view_cpu.py <https://github.com/NVIDIA/cuda-python/blob/main/cuda_core/examples/strided_memory_view_cpu.py>`_
76+
and
77+
`strided_memory_view_gpu.py <https://github.com/NVIDIA/cuda-python/blob/main/cuda_core/examples/strided_memory_view_gpu.py>`_
78+
for decorator-based workflows. This provides a *concrete implementation* to both protocols that is
79+
**array-library-agnostic**, so that all Python projects can just rely on this without either
80+
re-implementing (the consumer-side of) the protocols or tying to any particular array libraries.
7881

7982
The :attr:`~utils.StridedMemoryView.is_device_accessible` attribute can be used to check
8083
whether or not the underlying buffer can be accessed on GPU.

cuda_core/examples/graph_update.py

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
# ################################################################################
6+
#
7+
# This example demonstrates Graph.update() by reusing the same executable graph
8+
# with a new capture that has the same topology but different kernel arguments.
9+
#
10+
# ################################################################################
11+
12+
# /// script
13+
# dependencies = ["cuda_bindings", "cuda_core", "nvidia-cuda-nvrtc", "numpy>=2.1"]
14+
# ///
15+
16+
import sys
17+
18+
import numpy as np
19+
20+
from cuda.core import (
21+
Device,
22+
LaunchConfig,
23+
LegacyPinnedMemoryResource,
24+
Program,
25+
ProgramOptions,
26+
launch,
27+
)
28+
29+
code = """
30+
extern "C" __global__ void add_one(int* value) {
31+
if (threadIdx.x == 0 && blockIdx.x == 0) {
32+
*value += 1;
33+
}
34+
}
35+
"""
36+
37+
38+
def build_increment_graph(device, kernel, target_ptr):
39+
builder = device.create_graph_builder().begin_building()
40+
config = LaunchConfig(grid=1, block=1)
41+
launch(builder, config, kernel, target_ptr)
42+
launch(builder, config, kernel, target_ptr)
43+
return builder.end_building()
44+
45+
46+
def main():
47+
if np.lib.NumpyVersion(np.__version__) < "2.1.0":
48+
print("This example requires NumPy 2.1.0 or later", file=sys.stderr)
49+
sys.exit(1)
50+
51+
device = Device()
52+
device.set_current()
53+
stream = device.create_stream()
54+
pinned_mr = LegacyPinnedMemoryResource()
55+
buffer = None
56+
initial_capture = None
57+
update_capture = None
58+
graph = None
59+
60+
try:
61+
options = ProgramOptions(std="c++17", arch=f"sm_{device.arch}")
62+
program = Program(code, code_type="c++", options=options)
63+
module = program.compile("cubin")
64+
kernel = module.get_kernel("add_one")
65+
66+
buffer = pinned_mr.allocate(2 * np.dtype(np.int32).itemsize)
67+
values = np.from_dlpack(buffer).view(np.int32)
68+
values[:] = 0
69+
70+
initial_capture = build_increment_graph(device, kernel, values[0:].ctypes.data)
71+
update_capture = build_increment_graph(device, kernel, values[1:].ctypes.data)
72+
graph = initial_capture.complete()
73+
74+
graph.upload(stream)
75+
graph.launch(stream)
76+
stream.sync()
77+
assert tuple(values) == (2, 0)
78+
79+
graph.update(update_capture)
80+
graph.upload(stream)
81+
graph.launch(stream)
82+
stream.sync()
83+
assert tuple(values) == (2, 2)
84+
85+
print("Graph.update() reused the executable graph with a new target pointer.")
86+
print(f"Final host values: {tuple(values)}")
87+
finally:
88+
if graph is not None:
89+
graph.close()
90+
if update_capture is not None:
91+
update_capture.close()
92+
if initial_capture is not None:
93+
initial_capture.close()
94+
if buffer is not None:
95+
buffer.close()
96+
stream.close()
97+
98+
99+
if __name__ == "__main__":
100+
main()
Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
# ################################################################################
6+
#
7+
# This example demonstrates the newer memory-pool APIs by combining
8+
# PinnedMemoryResource, ManagedMemoryResource, and GraphMemoryResource in one
9+
# workflow.
10+
#
11+
# ################################################################################
12+
13+
# /// script
14+
# dependencies = ["cuda_bindings", "cuda_core", "nvidia-cuda-nvrtc", "numpy>=2.1"]
15+
# ///
16+
17+
import sys
18+
19+
import numpy as np
20+
21+
from cuda.core import (
22+
Device,
23+
GraphMemoryResource,
24+
LaunchConfig,
25+
ManagedMemoryResource,
26+
ManagedMemoryResourceOptions,
27+
PinnedMemoryResource,
28+
PinnedMemoryResourceOptions,
29+
Program,
30+
ProgramOptions,
31+
launch,
32+
)
33+
34+
code = """
35+
extern "C" __global__ void scale_and_bias(float* data, size_t size, float scale, float bias) {
36+
const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
37+
const unsigned int stride = blockDim.x * gridDim.x;
38+
for (size_t i = tid; i < size; i += stride) {
39+
data[i] = data[i] * scale + bias;
40+
}
41+
}
42+
"""
43+
44+
45+
def main():
46+
if np.lib.NumpyVersion(np.__version__) < "2.1.0":
47+
print("This example requires NumPy 2.1.0 or later", file=sys.stderr)
48+
sys.exit(1)
49+
50+
device = Device()
51+
device.set_current()
52+
stream = device.create_stream()
53+
54+
managed_mr = None
55+
pinned_mr = None
56+
graph_mr = None
57+
managed_buffer = None
58+
pinned_buffer = None
59+
graph_capture = None
60+
graph = None
61+
62+
try:
63+
options = ProgramOptions(std="c++17", arch=f"sm_{device.arch}")
64+
program = Program(code, code_type="c++", options=options)
65+
module = program.compile("cubin")
66+
kernel = module.get_kernel("scale_and_bias")
67+
68+
size = 256
69+
dtype = np.float32
70+
nbytes = size * dtype().itemsize
71+
config = LaunchConfig(grid=(size + 127) // 128, block=128)
72+
73+
managed_options = ManagedMemoryResourceOptions(
74+
preferred_location=device.device_id,
75+
preferred_location_type="device",
76+
)
77+
managed_mr = ManagedMemoryResource(options=managed_options)
78+
79+
pinned_options = {"ipc_enabled": False}
80+
host_numa_id = getattr(device.properties, "host_numa_id", -1)
81+
if host_numa_id >= 0:
82+
pinned_options["numa_id"] = host_numa_id
83+
pinned_mr = PinnedMemoryResource(options=PinnedMemoryResourceOptions(**pinned_options))
84+
85+
graph_mr = GraphMemoryResource(device)
86+
87+
managed_buffer = managed_mr.allocate(nbytes, stream=stream)
88+
pinned_buffer = pinned_mr.allocate(nbytes, stream=stream)
89+
90+
managed_array = np.from_dlpack(managed_buffer).view(np.float32)
91+
pinned_array = np.from_dlpack(pinned_buffer).view(np.float32)
92+
93+
managed_array[:] = np.arange(size, dtype=dtype)
94+
managed_original = managed_array.copy()
95+
stream.sync()
96+
97+
managed_buffer.copy_to(pinned_buffer, stream=stream)
98+
stream.sync()
99+
assert np.array_equal(pinned_array, managed_original)
100+
101+
graph_builder = device.create_graph_builder().begin_building("relaxed")
102+
scratch_buffer = graph_mr.allocate(nbytes, stream=graph_builder)
103+
scratch_buffer.copy_from(managed_buffer, stream=graph_builder)
104+
launch(graph_builder, config, kernel, scratch_buffer, np.uint64(size), np.float32(2.0), np.float32(1.0))
105+
managed_buffer.copy_from(scratch_buffer, stream=graph_builder)
106+
scratch_buffer.close()
107+
108+
graph_capture = graph_builder.end_building()
109+
graph = graph_capture.complete()
110+
graph.upload(stream)
111+
graph.launch(stream)
112+
stream.sync()
113+
114+
np.testing.assert_allclose(managed_array, managed_original * 2 + 1)
115+
managed_buffer.copy_to(pinned_buffer, stream=stream)
116+
stream.sync()
117+
np.testing.assert_allclose(pinned_array, managed_original * 2 + 1)
118+
119+
print(f"PinnedMemoryResource numa_id: {pinned_mr.numa_id}")
120+
print(f"ManagedMemoryResource preferred_location: {managed_mr.preferred_location}")
121+
print(f"GraphMemoryResource reserved high watermark: {graph_mr.attributes.reserved_mem_high}")
122+
finally:
123+
if graph is not None:
124+
graph.close()
125+
if graph_capture is not None:
126+
graph_capture.close()
127+
if pinned_buffer is not None:
128+
pinned_buffer.close(stream)
129+
if managed_buffer is not None:
130+
managed_buffer.close(stream)
131+
if graph_mr is not None:
132+
graph_mr.close()
133+
if pinned_mr is not None:
134+
pinned_mr.close()
135+
if managed_mr is not None:
136+
managed_mr.close()
137+
stream.close()
138+
139+
140+
if __name__ == "__main__":
141+
main()
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
# ################################################################################
6+
#
7+
# This example demonstrates the explicit StridedMemoryView constructors for
8+
# __array_interface__, DLPack, __cuda_array_interface__, and Buffer objects.
9+
#
10+
# ################################################################################
11+
12+
# /// script
13+
# dependencies = ["cuda_bindings", "cuda_core", "cupy-cuda13x", "numpy>=2.1"]
14+
# ///
15+
16+
import sys
17+
18+
import cupy as cp
19+
import numpy as np
20+
21+
from cuda.core import Device
22+
from cuda.core.utils import StridedMemoryView
23+
24+
25+
def dense_c_strides(shape):
26+
if not shape:
27+
return ()
28+
29+
strides = [1] * len(shape)
30+
for index in range(len(shape) - 2, -1, -1):
31+
strides[index] = strides[index + 1] * shape[index + 1]
32+
return tuple(strides)
33+
34+
35+
def main():
36+
if np.lib.NumpyVersion(np.__version__) < "2.1.0":
37+
print("This example requires NumPy 2.1.0 or later", file=sys.stderr)
38+
sys.exit(1)
39+
40+
device = Device()
41+
device.set_current()
42+
stream = device.create_stream()
43+
buffer = None
44+
45+
try:
46+
host_array = np.arange(12, dtype=np.int16).reshape(3, 4)
47+
host_view = StridedMemoryView.from_array_interface(host_array)
48+
host_dlpack_view = StridedMemoryView.from_dlpack(host_array, stream_ptr=-1)
49+
50+
assert host_view.shape == host_array.shape
51+
assert host_view.size == host_array.size
52+
assert not host_view.is_device_accessible
53+
assert np.array_equal(np.from_dlpack(host_view), host_array)
54+
assert np.array_equal(np.from_dlpack(host_dlpack_view), host_array)
55+
56+
gpu_array = cp.arange(12, dtype=cp.float32).reshape(3, 4)
57+
dlpack_view = StridedMemoryView.from_dlpack(gpu_array, stream_ptr=stream.handle)
58+
cai_view = StridedMemoryView.from_cuda_array_interface(gpu_array, stream_ptr=stream.handle)
59+
60+
cp.testing.assert_array_equal(cp.from_dlpack(dlpack_view), gpu_array)
61+
cp.testing.assert_array_equal(cp.from_dlpack(cai_view), gpu_array)
62+
63+
buffer = device.memory_resource.allocate(gpu_array.nbytes, stream=stream)
64+
buffer_array = cp.from_dlpack(buffer).view(dtype=cp.float32).reshape(gpu_array.shape)
65+
buffer_array[...] = gpu_array
66+
device.sync()
67+
68+
buffer_view = StridedMemoryView.from_buffer(
69+
buffer,
70+
shape=gpu_array.shape,
71+
strides=dense_c_strides(gpu_array.shape),
72+
dtype=np.dtype(np.float32),
73+
)
74+
cp.testing.assert_array_equal(cp.from_dlpack(buffer_view), gpu_array)
75+
76+
print("Constructed StridedMemoryView objects from array, DLPack, CAI, and Buffer inputs.")
77+
finally:
78+
if buffer is not None:
79+
buffer.close(stream)
80+
stream.close()
81+
82+
83+
if __name__ == "__main__":
84+
main()

0 commit comments

Comments
 (0)