Skip to content

Commit 3b0730e

Browse files
committed
Merge remote-tracking branch 'origin/main' into ci/suppress-pip-warnings
2 parents e7b7766 + c93623b commit 3b0730e

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+13087
-569
lines changed

.github/workflows/test-wheel-linux.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,12 +70,12 @@ jobs:
7070
echo "OLD_BRANCH=${OLD_BRANCH}" >> "$GITHUB_OUTPUT"
7171
7272
test:
73-
name: py${{ matrix.PY_VER }}, ${{ matrix.CUDA_VER }}, ${{ (matrix.LOCAL_CTK == '1' && 'local') || 'wheels' }}, ${{ matrix.GPU }}${{ matrix.GPU_COUNT != '1' && format('(x{0})', matrix.GPU_COUNT) || '' }}
73+
name: py${{ matrix.PY_VER }}, ${{ matrix.CUDA_VER }}, ${{ (matrix.LOCAL_CTK == '1' && 'local') || 'wheels' }}, ${{ matrix.GPU }}${{ matrix.GPU_COUNT != '1' && format('(x{0})', matrix.GPU_COUNT) || '' }}${{ matrix.FLAVOR && format(', {0}', matrix.FLAVOR) || '' }}
7474
needs: compute-matrix
7575
strategy:
7676
fail-fast: false
7777
matrix: ${{ fromJSON(needs.compute-matrix.outputs.MATRIX) }}
78-
runs-on: "linux-${{ matrix.ARCH }}-gpu-${{ matrix.GPU }}-${{ matrix.DRIVER }}-${{ matrix.GPU_COUNT }}"
78+
runs-on: "${{ matrix.FLAVOR || 'linux' }}-${{ matrix.ARCH }}-gpu-${{ matrix.GPU }}-${{ matrix.DRIVER }}-${{ matrix.GPU_COUNT }}"
7979
# The build stage could fail but we want the CI to keep moving.
8080
if: ${{ github.repository_owner == 'nvidia' && !cancelled() }}
8181
# Our self-hosted runners require a container

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,7 @@ instance/
120120
# Sphinx documentation
121121
docs_src/_build/
122122
*/docs/source/generated/
123+
*/docs/source/module/generated/
123124

124125
# PyBuilder
125126
.pybuilder/

ci/test-matrix.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,8 @@ linux:
6060
- { ARCH: 'amd64', PY_VER: '3.13', CUDA_VER: '13.2.0', LOCAL_CTK: '1', GPU: 'h100', GPU_COUNT: '1', DRIVER: 'latest' }
6161
- { ARCH: 'amd64', PY_VER: '3.14', CUDA_VER: '13.2.0', LOCAL_CTK: '1', GPU: 't4', GPU_COUNT: '2', DRIVER: 'latest' }
6262
- { ARCH: 'amd64', PY_VER: '3.14t', CUDA_VER: '13.2.0', LOCAL_CTK: '1', GPU: 'h100', GPU_COUNT: '2', DRIVER: 'latest' }
63+
- { ARCH: 'amd64', PY_VER: '3.11', CUDA_VER: '12.9.1', LOCAL_CTK: '0', GPU: 't4', GPU_COUNT: '1', DRIVER: 'latest', FLAVOR: 'wsl' }
64+
- { ARCH: 'amd64', PY_VER: '3.12', CUDA_VER: '13.2.0', LOCAL_CTK: '0', GPU: 'rtx4090', GPU_COUNT: '1', DRIVER: 'latest', FLAVOR: 'wsl' }
6365
nightly: []
6466

6567
windows:

cuda_bindings/docs/build_docs.sh

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,10 @@ if [[ -z "${SPHINX_CUDA_BINDINGS_VER}" ]]; then
2525
| awk -F'+' '{print $1}')
2626
fi
2727

28+
if [[ "${LATEST_ONLY}" == "1" && -z "${BUILD_PREVIEW:-}" && -z "${BUILD_LATEST:-}" ]]; then
29+
export BUILD_LATEST=1
30+
fi
31+
2832
# build the docs (in parallel)
2933
SPHINXOPTS="-j 4 -d build/.doctrees" make html
3034

cuda_bindings/docs/source/conf.py

Lines changed: 53 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99

1010
# -- Path setup --------------------------------------------------------------
1111

12+
import inspect
1213
import os
1314
import sys
1415
from pathlib import Path
@@ -26,6 +27,15 @@
2627
release = os.environ["SPHINX_CUDA_BINDINGS_VER"]
2728

2829

30+
def _github_examples_ref():
31+
if int(os.environ.get("BUILD_PREVIEW", 0)) or int(os.environ.get("BUILD_LATEST", 0)):
32+
return "main"
33+
return f"v{release}"
34+
35+
36+
GITHUB_EXAMPLES_REF = _github_examples_ref()
37+
38+
2939
# -- General configuration ---------------------------------------------------
3040

3141
# Add any Sphinx extension module names here, as strings. They can be
@@ -94,11 +104,15 @@
94104
# Add any paths that contain custom static files (such as style sheets) here,
95105
# relative to this directory. They are copied after the builtin static files,
96106
# so a file named "default.css" will overwrite the builtin "default.css".
97-
html_static_path = ["_static"]
107+
html_static_path = [] # ["_static"] does not exist in our environment
98108

99109
# skip cmdline prompts
100110
copybutton_exclude = ".linenos, .gp"
101111

112+
rst_epilog = f"""
113+
.. |cuda_bindings_github_ref| replace:: {GITHUB_EXAMPLES_REF}
114+
"""
115+
102116
intersphinx_mapping = {
103117
"python": ("https://docs.python.org/3/", None),
104118
"numpy": ("https://numpy.org/doc/stable/", None),
@@ -107,7 +121,45 @@
107121
"cufile": ("https://docs.nvidia.com/gpudirect-storage/api-reference-guide/", None),
108122
}
109123

124+
125+
def _sanitize_generated_docstring(lines):
126+
doc_lines = inspect.cleandoc("\n".join(lines)).splitlines()
127+
if not doc_lines:
128+
return
129+
130+
if "(" in doc_lines[0] and ")" in doc_lines[0]:
131+
doc_lines = doc_lines[1:]
132+
while doc_lines and not doc_lines[0].strip():
133+
doc_lines.pop(0)
134+
135+
if not doc_lines:
136+
lines[:] = []
137+
return
138+
139+
lines[:] = [".. code-block:: text", ""]
140+
lines.extend(f" {line}" if line else " " for line in doc_lines)
141+
142+
143+
def autodoc_process_docstring(app, what, name, obj, options, lines):
144+
if name.startswith("cuda.bindings."):
145+
_sanitize_generated_docstring(lines)
146+
147+
148+
def rewrite_source(app, docname, source):
149+
text = source[0]
150+
151+
if docname.startswith("release/"):
152+
text = text.replace(".. module:: cuda.bindings\n\n", "", 1)
153+
154+
source[0] = text
155+
156+
110157
suppress_warnings = [
111158
# for warnings about multiple possible targets, see NVIDIA/cuda-python#152
112159
"ref.python",
113160
]
161+
162+
163+
def setup(app):
164+
app.connect("autodoc-process-docstring", autodoc_process_docstring)
165+
app.connect("source-read", rewrite_source)

cuda_bindings/docs/source/contribute.rst

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,17 @@
44
Contributing
55
============
66

7-
Thank you for your interest in contributing to ``cuda-bindings``! Based on the type of contribution, it will fall into two categories:
8-
9-
1. You want to report a bug, feature request, or documentation issue
10-
- File an `issue <https://github.com/NVIDIA/cuda-python/issues/new/choose>`_ describing what you encountered or what you want to see changed.
11-
- The NVIDIA team will evaluate the issues and triage them, scheduling
12-
them for a release. If you believe the issue needs priority attention
13-
comment on the issue to notify the team.
14-
2. You want to implement a feature, improvement, or bug fix:
15-
- At this time we do not accept code contributions.
7+
Thank you for your interest in contributing to ``cuda-bindings``! Based on the
8+
type of contribution, it will fall into two categories:
9+
10+
1. You want to report a bug, feature request, or documentation issue.
11+
12+
File an `issue <https://github.com/NVIDIA/cuda-python/issues/new/choose>`_
13+
describing what you encountered or what you want to see changed. The NVIDIA
14+
team will evaluate the issue, triage it, and schedule it for a release. If
15+
you believe the issue needs priority attention, comment on the issue to
16+
notify the team.
17+
18+
2. You want to implement a feature, improvement, or bug fix.
19+
20+
At this time we do not accept code contributions.
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
.. SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
3+
4+
Examples
5+
========
6+
7+
This page links to the ``cuda.bindings`` examples shipped in the
8+
`cuda-python repository <https://github.com/NVIDIA/cuda-python/tree/|cuda_bindings_github_ref|/cuda_bindings/examples>`_.
9+
Use it as a quick index when you want a runnable sample for a specific API area
10+
or CUDA feature.
11+
12+
Introduction
13+
------------
14+
15+
- `clock_nvrtc.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/0_Introduction/clock_nvrtc.py>`_
16+
uses NVRTC-compiled CUDA code and the device clock to time a reduction
17+
kernel.
18+
- `simple_cubemap_texture.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/0_Introduction/simple_cubemap_texture.py>`_
19+
demonstrates cubemap texture sampling and transformation.
20+
- `simple_p2p.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/0_Introduction/simple_p2p.py>`_
21+
shows peer-to-peer memory access and transfers between multiple GPUs.
22+
- `simple_zero_copy.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/0_Introduction/simple_zero_copy.py>`_
23+
uses zero-copy mapped host memory for vector addition.
24+
- `system_wide_atomics.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/0_Introduction/system_wide_atomics.py>`_
25+
demonstrates system-wide atomic operations on managed memory.
26+
- `vector_add_drv.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/0_Introduction/vector_add_drv.py>`_
27+
uses the CUDA Driver API and unified virtual addressing for vector addition.
28+
- `vector_add_mmap.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/0_Introduction/vector_add_mmap.py>`_
29+
uses virtual memory management APIs such as ``cuMemCreate`` and
30+
``cuMemMap`` for vector addition.
31+
32+
Concepts and techniques
33+
-----------------------
34+
35+
- `stream_ordered_allocation.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/2_Concepts_and_Techniques/stream_ordered_allocation.py>`_
36+
demonstrates ``cudaMallocAsync`` and ``cudaFreeAsync`` together with
37+
memory-pool release thresholds.
38+
39+
CUDA features
40+
-------------
41+
42+
- `global_to_shmem_async_copy.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/3_CUDA_Features/global_to_shmem_async_copy.py>`_
43+
compares asynchronous global-to-shared-memory copy strategies in matrix
44+
multiplication kernels.
45+
- `simple_cuda_graphs.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/3_CUDA_Features/simple_cuda_graphs.py>`_
46+
shows both manual CUDA graph construction and stream-capture-based replay.
47+
48+
Libraries and tools
49+
-------------------
50+
51+
- `conjugate_gradient_multi_block_cg.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/4_CUDA_Libraries/conjugate_gradient_multi_block_cg.py>`_
52+
implements a conjugate-gradient solver with cooperative groups and
53+
multi-block synchronization.
54+
- `nvidia_smi.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/4_CUDA_Libraries/nvidia_smi.py>`_
55+
uses NVML to implement a Python subset of ``nvidia-smi``.
56+
57+
Advanced and interoperability
58+
-----------------------------
59+
60+
- `iso_fd_modelling.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/extra/iso_fd_modelling.py>`_
61+
runs isotropic finite-difference wave propagation across multiple GPUs with
62+
peer-to-peer halo exchange.
63+
- `jit_program.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/extra/jit_program.py>`_
64+
JIT-compiles a SAXPY kernel with NVRTC and launches it through the Driver
65+
API.
66+
- `numba_emm_plugin.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/extra/numba_emm_plugin.py>`_
67+
shows how to back Numba's EMM interface with the NVIDIA CUDA Python Driver
68+
API.

cuda_bindings/docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
release
1212
install
1313
overview
14+
examples
1415
motivation
1516
environment_variables
1617
api

cuda_bindings/docs/source/install.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ Installing from Source
7878
----------------------
7979

8080
Requirements
81-
^^^^^^^^^^^^
81+
~~~~~~~~~~~~
8282

8383
* CUDA Toolkit headers[^1]
8484
* CUDA Runtime static library[^2]
@@ -100,7 +100,7 @@ See `Environment Variables <environment_variables.rst>`_ for a description of ot
100100
Only ``cydriver``, ``cyruntime`` and ``cynvrtc`` are impacted by the header requirement.
101101

102102
Editable Install
103-
^^^^^^^^^^^^^^^^
103+
~~~~~~~~~~~~~~~~
104104

105105
You can use:
106106

cuda_bindings/docs/source/overview.rst

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -25,13 +25,14 @@ code into
2525
`PTX <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html>`_ and
2626
then extract the function to be called at a later point in the application. You
2727
construct your device code in the form of a string and compile it with
28-
`NVRTC <http://docs.nvidia.com/cuda/nvrtc/index.html>`_, a runtime compilation
28+
`NVRTC <https://docs.nvidia.com/cuda/nvrtc/index.html>`_, a runtime compilation
2929
library for CUDA C++. Using the NVIDIA `Driver
30-
API <http://docs.nvidia.com/cuda/cuda-driver-api/index.html>`_, manually create a
30+
API <https://docs.nvidia.com/cuda/cuda-driver-api/index.html>`_, manually create a
3131
CUDA context and all required resources on the GPU, then launch the compiled
3232
CUDA C++ code and retrieve the results from the GPU. Now that you have an
3333
overview, jump into a commonly used example for parallel programming:
34-
`SAXPY <https://developer.nvidia.com/blog/six-ways-saxpy/>`_.
34+
`SAXPY <https://developer.nvidia.com/blog/six-ways-saxpy/>`_. For more
35+
end-to-end samples, see the :doc:`examples` page.
3536

3637
The first thing to do is import the `Driver
3738
API <https://docs.nvidia.com/cuda/cuda-driver-api/index.html>`_ and
@@ -427,7 +428,7 @@ Putting it all together:
427428
)
428429
429430
The final step is to construct a ``kernelParams`` argument that fulfills all of the launch API conditions. This is made easy because each array object comes
430-
with a `ctypes <https://numpy.org/doc/stable/reference/generated/numpy.ndarray.ctypes.html#numpy.ndarray.ctypes>`_ data attribute that returns the underlying ``void*`` pointer value.
431+
with NumPy's `ctypes data attribute <https://numpy.org/doc/stable/reference/generated/numpy.ndarray.ctypes.html#numpy.ndarray.ctypes>`_ that returns the underlying ``void*`` pointer value.
431432

432433
By having the final array object contain all pointers, we fulfill the contiguous array requirement:
433434

@@ -520,7 +521,10 @@ CUDA objects
520521

521522
Certain CUDA kernels use native CUDA types as their parameters such as ``cudaTextureObject_t``. These types require special handling since they're neither a primitive ctype nor a custom user type. Since ``cuda.bindings`` exposes each of them as Python classes, they each implement ``getPtr()`` and ``__int__()``. These two callables used to support the NumPy and ctypes approach. The difference between each call is further described under `Tips and Tricks <https://nvidia.github.io/cuda-python/cuda-bindings/latest/tips_and_tricks.html#>`_.
522523

523-
For this example, lets use the ``transformKernel`` from `examples/0_Introduction/simpleCubemapTexture_test.py <https://github.com/NVIDIA/cuda-python/blob/main/cuda_bindings/examples/0_Introduction/simpleCubemapTexture_test.py>`_:
524+
For this example, lets use the ``transformKernel`` from
525+
`simple_cubemap_texture.py <https://github.com/NVIDIA/cuda-python/blob/|cuda_bindings_github_ref|/cuda_bindings/examples/0_Introduction/simple_cubemap_texture.py>`_.
526+
The :doc:`examples` page links to more samples covering textures, graphs,
527+
memory mapping, and multi-GPU workflows.
524528

525529
.. code-block:: python
526530

0 commit comments

Comments
 (0)