Skip to content

Commit f80b085

Browse files
authored
docs: audit example coverage in published docs (#1866)
* docs: audit example coverage and add pixi docs builds (#1680) Make every current cuda.core and cuda.bindings example discoverable from the published docs, and add a reproducible pixi-based docs workflow so documentation changes can be built and debugged locally. Made-with: Cursor * docs: address example link review feedback (#1680) Point the new example links at package-aware refs instead of always targeting main, and drop the overlapping pixi docs workflow changes so this PR stays scoped to the documentation coverage audit. Made-with: Cursor * chore: refresh pixi lockfiles after rebasing Keep the pixi lockfiles aligned with the current manifests after rebasing onto the merged docs workflow and verifying the docs tasks. Made-with: Cursor
1 parent 974ed1d commit f80b085

File tree

13 files changed

+204
-30
lines changed

13 files changed

+204
-30
lines changed

cuda_bindings/docs/build_docs.sh

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,10 @@ if [[ -z "${SPHINX_CUDA_BINDINGS_VER}" ]]; then
2525
| awk -F'+' '{print $1}')
2626
fi
2727

28+
if [[ "${LATEST_ONLY}" == "1" && -z "${BUILD_PREVIEW:-}" && -z "${BUILD_LATEST:-}" ]]; then
29+
export BUILD_LATEST=1
30+
fi
31+
2832
# build the docs (in parallel)
2933
SPHINXOPTS="-j 4 -d build/.doctrees" make html
3034

cuda_bindings/docs/source/conf.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,15 @@
2626
release = os.environ["SPHINX_CUDA_BINDINGS_VER"]
2727

2828

29+
def _github_examples_ref():
30+
if int(os.environ.get("BUILD_PREVIEW", 0)) or int(os.environ.get("BUILD_LATEST", 0)):
31+
return "main"
32+
return f"v{release}"
33+
34+
35+
GITHUB_EXAMPLES_REF = _github_examples_ref()
36+
37+
2938
# -- General configuration ---------------------------------------------------
3039

3140
# Add any Sphinx extension module names here, as strings. They can be
@@ -99,6 +108,10 @@
99108
# skip cmdline prompts
100109
copybutton_exclude = ".linenos, .gp"
101110

111+
rst_epilog = f"""
112+
.. |cuda_bindings_github_ref| replace:: {GITHUB_EXAMPLES_REF}
113+
"""
114+
102115
intersphinx_mapping = {
103116
"python": ("https://docs.python.org/3/", None),
104117
"numpy": ("https://numpy.org/doc/stable/", None),
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
.. SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
3+
4+
Examples
5+
========
6+
7+
This page links to the ``cuda.bindings`` examples shipped in the
8+
`cuda-python repository <https://github.com/NVIDIA/cuda-python/tree/|cuda_bindings_github_ref|/cuda_bindings/examples>`_.
9+
Use it as a quick index when you want a runnable sample for a specific API area
10+
or CUDA feature.
11+
12+
Introduction
13+
------------
14+
15+
- `clock_nvrtc.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/0_Introduction/clock_nvrtc.py>`_
16+
uses NVRTC-compiled CUDA code and the device clock to time a reduction
17+
kernel.
18+
- `simple_cubemap_texture.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/0_Introduction/simple_cubemap_texture.py>`_
19+
demonstrates cubemap texture sampling and transformation.
20+
- `simple_p2p.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/0_Introduction/simple_p2p.py>`_
21+
shows peer-to-peer memory access and transfers between multiple GPUs.
22+
- `simple_zero_copy.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/0_Introduction/simple_zero_copy.py>`_
23+
uses zero-copy mapped host memory for vector addition.
24+
- `system_wide_atomics.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/0_Introduction/system_wide_atomics.py>`_
25+
demonstrates system-wide atomic operations on managed memory.
26+
- `vector_add_drv.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/0_Introduction/vector_add_drv.py>`_
27+
uses the CUDA Driver API and unified virtual addressing for vector addition.
28+
- `vector_add_mmap.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/0_Introduction/vector_add_mmap.py>`_
29+
uses virtual memory management APIs such as ``cuMemCreate`` and
30+
``cuMemMap`` for vector addition.
31+
32+
Concepts and techniques
33+
-----------------------
34+
35+
- `stream_ordered_allocation.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/2_Concepts_and_Techniques/stream_ordered_allocation.py>`_
36+
demonstrates ``cudaMallocAsync`` and ``cudaFreeAsync`` together with
37+
memory-pool release thresholds.
38+
39+
CUDA features
40+
-------------
41+
42+
- `global_to_shmem_async_copy.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/3_CUDA_Features/global_to_shmem_async_copy.py>`_
43+
compares asynchronous global-to-shared-memory copy strategies in matrix
44+
multiplication kernels.
45+
- `simple_cuda_graphs.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/3_CUDA_Features/simple_cuda_graphs.py>`_
46+
shows both manual CUDA graph construction and stream-capture-based replay.
47+
48+
Libraries and tools
49+
-------------------
50+
51+
- `conjugate_gradient_multi_block_cg.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/4_CUDA_Libraries/conjugate_gradient_multi_block_cg.py>`_
52+
implements a conjugate-gradient solver with cooperative groups and
53+
multi-block synchronization.
54+
- `nvidia_smi.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/4_CUDA_Libraries/nvidia_smi.py>`_
55+
uses NVML to implement a Python subset of ``nvidia-smi``.
56+
57+
Advanced and interoperability
58+
-----------------------------
59+
60+
- `iso_fd_modelling.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/extra/iso_fd_modelling.py>`_
61+
runs isotropic finite-difference wave propagation across multiple GPUs with
62+
peer-to-peer halo exchange.
63+
- `jit_program.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/extra/jit_program.py>`_
64+
JIT-compiles a SAXPY kernel with NVRTC and launches it through the Driver
65+
API.
66+
- `numba_emm_plugin.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/extra/numba_emm_plugin.py>`_
67+
shows how to back Numba's EMM interface with the NVIDIA CUDA Python Driver
68+
API.

cuda_bindings/docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
release
1212
install
1313
overview
14+
examples
1415
motivation
1516
environment_variables
1617
api

cuda_bindings/docs/source/overview.rst

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,8 @@ API <http://docs.nvidia.com/cuda/cuda-driver-api/index.html>`_, manually create
3131
CUDA context and all required resources on the GPU, then launch the compiled
3232
CUDA C++ code and retrieve the results from the GPU. Now that you have an
3333
overview, jump into a commonly used example for parallel programming:
34-
`SAXPY <https://developer.nvidia.com/blog/six-ways-saxpy/>`_.
34+
`SAXPY <https://developer.nvidia.com/blog/six-ways-saxpy/>`_. For more
35+
end-to-end samples, see the :doc:`examples` page.
3536

3637
The first thing to do is import the `Driver
3738
API <https://docs.nvidia.com/cuda/cuda-driver-api/index.html>`_ and
@@ -520,7 +521,10 @@ CUDA objects
520521

521522
Certain CUDA kernels use native CUDA types as their parameters such as ``cudaTextureObject_t``. These types require special handling since they're neither a primitive ctype nor a custom user type. Since ``cuda.bindings`` exposes each of them as Python classes, they each implement ``getPtr()`` and ``__int__()``. These two callables used to support the NumPy and ctypes approach. The difference between each call is further described under `Tips and Tricks <https://nvidia.github.io/cuda-python/cuda-bindings/latest/tips_and_tricks.html#>`_.
522523

523-
For this example, lets use the ``transformKernel`` from `examples/0_Introduction/simpleCubemapTexture_test.py <https://github.com/NVIDIA/cuda-python/blob/main/cuda_bindings/examples/0_Introduction/simpleCubemapTexture_test.py>`_:
524+
For this example, lets use the ``transformKernel`` from
525+
`simple_cubemap_texture.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/0_Introduction/simple_cubemap_texture.py>`_.
526+
The :doc:`examples` page links to more samples covering textures, graphs,
527+
memory mapping, and multi-GPU workflows.
524528

525529
.. code-block:: python
526530

cuda_bindings/pixi.lock

Lines changed: 6 additions & 6 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

cuda_core/docs/build_docs.sh

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,10 @@ if [[ -z "${SPHINX_CUDA_CORE_VER}" ]]; then
2424
| awk -F'+' '{print $1}')
2525
fi
2626

27+
if [[ "${LATEST_ONLY}" == "1" && -z "${BUILD_PREVIEW:-}" && -z "${BUILD_LATEST:-}" ]]; then
28+
export BUILD_LATEST=1
29+
fi
30+
2731
# build the docs. Allow callers to override SPHINXOPTS for serial/debug runs.
2832
if [[ -z "${SPHINXOPTS:-}" ]]; then
2933
SPHINXOPTS="-j 4 -d build/.doctrees"

cuda_core/docs/source/conf.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,15 @@
2626
release = os.environ["SPHINX_CUDA_CORE_VER"]
2727

2828

29+
def _github_examples_ref():
30+
if int(os.environ.get("BUILD_PREVIEW", 0)) or int(os.environ.get("BUILD_LATEST", 0)):
31+
return "main"
32+
return f"cuda-core-v{release}"
33+
34+
35+
GITHUB_EXAMPLES_REF = _github_examples_ref()
36+
37+
2938
# -- General configuration ---------------------------------------------------
3039

3140
# Add any Sphinx extension module names here, as strings. They can be
@@ -97,6 +106,10 @@
97106
# skip cmdline prompts
98107
copybutton_exclude = ".linenos, .gp"
99108

109+
rst_epilog = f"""
110+
.. |cuda_core_github_ref| replace:: {GITHUB_EXAMPLES_REF}
111+
"""
112+
100113
intersphinx_mapping = {
101114
"python": ("https://docs.python.org/3/", None),
102115
"numpy": ("https://numpy.org/doc/stable/", None),

cuda_core/docs/source/examples.rst

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
.. SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
.. SPDX-License-Identifier: Apache-2.0
3+
4+
Examples
5+
========
6+
7+
This page links to the ``cuda.core`` examples shipped in the
8+
`cuda-python repository <https://github.com/NVIDIA/cuda-python/tree/|cuda_core_github_ref|/cuda_core/examples>`_.
9+
Use it as a quick index when you want a runnable starting point for a specific
10+
workflow.
11+
12+
Compilation and kernel launch
13+
-----------------------------
14+
15+
- `vector_add.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_core_github_ref|/cuda_core/examples/vector_add.py>`_
16+
compiles and launches a simple vector-add kernel with CuPy arrays.
17+
- `saxpy.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_core_github_ref|/cuda_core/examples/saxpy.py>`_
18+
JIT-compiles a templated SAXPY kernel and launches both float and double
19+
instantiations.
20+
- `pytorch_example.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_core_github_ref|/cuda_core/examples/pytorch_example.py>`_
21+
launches a CUDA kernel with PyTorch tensors and a wrapped PyTorch stream.
22+
23+
Multi-device and advanced launch configuration
24+
----------------------------------------------
25+
26+
- `simple_multi_gpu_example.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_core_github_ref|/cuda_core/examples/simple_multi_gpu_example.py>`_
27+
compiles and launches kernels across multiple GPUs.
28+
- `thread_block_cluster.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_core_github_ref|/cuda_core/examples/thread_block_cluster.py>`_
29+
demonstrates thread block cluster launch configuration on Hopper-class GPUs.
30+
- `tma_tensor_map.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_core_github_ref|/cuda_core/examples/tma_tensor_map.py>`_
31+
demonstrates Tensor Memory Accelerator descriptors and TMA-based bulk copies.
32+
33+
Linking and graphs
34+
------------------
35+
36+
- `jit_lto_fractal.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_core_github_ref|/cuda_core/examples/jit_lto_fractal.py>`_
37+
uses JIT link-time optimization to link user-provided device code into a
38+
fractal workflow at runtime.
39+
- `cuda_graphs.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_core_github_ref|/cuda_core/examples/cuda_graphs.py>`_
40+
captures and replays a multi-kernel CUDA graph to reduce launch overhead.
41+
42+
Interoperability and memory access
43+
----------------------------------
44+
45+
- `memory_ops.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_core_github_ref|/cuda_core/examples/memory_ops.py>`_
46+
covers memory resources, pinned memory, device transfers, and DLPack interop.
47+
- `strided_memory_view_cpu.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_core_github_ref|/cuda_core/examples/strided_memory_view_cpu.py>`_
48+
uses ``StridedMemoryView`` with JIT-compiled CPU code via ``cffi``.
49+
- `strided_memory_view_gpu.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_core_github_ref|/cuda_core/examples/strided_memory_view_gpu.py>`_
50+
uses ``StridedMemoryView`` with JIT-compiled GPU code and foreign GPU buffers.
51+
- `gl_interop_plasma.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_core_github_ref|/cuda_core/examples/gl_interop_plasma.py>`_
52+
renders a CUDA-generated plasma effect through OpenGL interop without CPU
53+
copies.
54+
55+
System inspection
56+
-----------------
57+
58+
- `show_device_properties.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_core_github_ref|/cuda_core/examples/show_device_properties.py>`_
59+
prints a detailed report of the CUDA devices available on the system.

cuda_core/docs/source/getting-started.rst

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,9 @@ Example: Compiling and Launching a CUDA kernel
3232
----------------------------------------------
3333

3434
To get a taste for ``cuda.core``, let's walk through a simple example that compiles and launches a vector addition kernel.
35-
You can find the complete example in `vector_add.py <https://github.com/NVIDIA/cuda-python/tree/main/cuda_core/examples/vector_add.py>`_.
35+
You can find the complete example in `vector_add.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_core_github_ref|/cuda_core/examples/vector_add.py>`_
36+
and browse the :doc:`examples page <examples>` for the rest of the shipped
37+
workflows.
3638

3739
First, we define a string containing the CUDA C++ kernel. Note that this is a templated kernel:
3840

@@ -76,8 +78,10 @@ Note the use of the ``name_expressions`` parameter to the :meth:`Program.compile
7678
mod = prog.compile("cubin", name_expressions=("vector_add<float>",))
7779
7880
Next, we retrieve the compiled kernel from the CUBIN and prepare the arguments and kernel configuration.
79-
We're using `CuPy <https://cupy.dev/>`_ arrays as inputs for this example, but you can use PyTorch tensors too
80-
(we show how to do this in one of our `examples <https://github.com/NVIDIA/cuda-python/tree/main/cuda_core/examples>`_).
81+
We're using `CuPy <https://cupy.dev/>`_ arrays as inputs for this example, but
82+
you can use PyTorch tensors too (see
83+
`pytorch_example.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_core_github_ref|/cuda_core/examples/pytorch_example.py>`_
84+
and the :doc:`examples page <examples>`).
8185

8286
.. code-block:: python
8387
@@ -108,7 +112,9 @@ Note the clean, Pythonic interface, and absence of any direct calls to the CUDA
108112
Examples and Recipes
109113
--------------------
110114

111-
As we mentioned before, ``cuda.core`` can do much more than just compile and launch kernels.
115+
As we mentioned before, ``cuda.core`` can do much more than just compile and
116+
launch kernels.
112117

113-
The best way to explore and learn the different features ``cuda.core`` is through
114-
our `examples <https://github.com/NVIDIA/cuda-python/tree/main/cuda_core/examples>`_. Find one that matches your use-case, and modify it to fit your needs!
118+
Browse the :doc:`examples page <examples>` for direct links to every shipped
119+
example, including multi-GPU workflows, CUDA graphs, memory utilities, and
120+
interop-focused recipes.

0 commit comments

Comments
 (0)