Skip to content

Latest commit

 

History

History
213 lines (159 loc) · 8.27 KB

File metadata and controls

213 lines (159 loc) · 8.27 KB

cuda_example

Gitter

CI status
pip builds Pip Actions Status
wheels Wheels Actions Status

An example project built with pybind11, CUDA, and scikit-build-core. Python 3.9+.

The extension renders the Mandelbrot set two ways — once on the CPU and once on the GPU — so you can read both side by side and compare their performance. The two implementations are written the same way on purpose:

  • src/mandelbrot_cpu.cpp — a plain nested loop over every pixel
  • src/mandelbrot.cu — the same logic as a CUDA kernel, one thread per pixel

Both return a (height, width) int32 NumPy array of escape counts. Building requires the CUDA Toolkit (nvcc): the CMake project declares CUDA as a required language, so configuration fails without it. The CUDA runtime is linked statically, so the resulting wheels do not depend on libcudart and stay importable on machines without a GPU — calling mandelbrot_gpu there raises, but cuda_available() lets you check first.

Installation

  • Clone this repository
  • pip install ./cuda_example

The CUDA Toolkit (nvcc) must be installed and discoverable by CMake.

Test call

import cuda_example

# (height, width) int32 array of escape counts
image = cuda_example.mandelbrot_cpu(width=800, height=600, max_iterations=100)

if cuda_example.cuda_available():
    image = cuda_example.mandelbrot_gpu(width=800, height=600, max_iterations=100)

You can view the result with any plotting library, e.g.:

import matplotlib.pyplot as plt

plt.imshow(image, extent=(-2, 1, -1.5, 1.5), cmap="twilight_shifted")
plt.show()

Comparing CPU and GPU

Because both functions take the same arguments and return identical arrays, you can run them back to back and time them (on a machine with a GPU):

import time
import cuda_example

size = {"width": 2000, "height": 1500, "max_iterations": 200}

start = time.perf_counter()
cpu = cuda_example.mandelbrot_cpu(**size)
print(f"CPU: {time.perf_counter() - start:.3f}s")

start = time.perf_counter()
gpu = cuda_example.mandelbrot_gpu(**size)
print(f"GPU: {time.perf_counter() - start:.3f}s")

assert (cpu == gpu).all()  # identical results, very different runtimes

Building CUDA wheels

The Wheels workflow builds CUDA-enabled Linux wheels with cibuildwheel, using the custom manylinux images that ship the CUDA Toolkit (see pypa/cibuildwheel#2896). The images are configured in pyproject.toml:

[tool.cibuildwheel]
manylinux-x86_64-image = "quay.io/manylinux_cuda/manylinux_2_28_x86_64_cuda13_1:latest"
manylinux-aarch64-image = "quay.io/manylinux_cuda/manylinux_2_28_aarch64_cuda13_1:latest"

To target a different CUDA version (e.g. an older cuda12_9 to support older drivers) without editing pyproject.toml, override the images with environment variables when running cibuildwheel:

export CIBW_MANYLINUX_X86_64_IMAGE=quay.io/manylinux_cuda/manylinux_2_28_x86_64_cuda12_9:latest
export CIBW_MANYLINUX_AARCH64_IMAGE=quay.io/manylinux_cuda/manylinux_2_28_aarch64_cuda12_9:latest
cibuildwheel

The available images are listed in the cibuildwheel docs; the manylinux_2_28/manylinux_2_34 base and cuda12_9/cuda13_1 version can be mixed and matched.

The CUDA runtime is linked statically (CUDA_RUNTIME_LIBRARY Static), so the resulting wheels do not depend on libcudart. GitHub-hosted runners have no GPU, so the wheels are compiled and imported, but the kernels themselves only run on a machine with a CUDA device.

Testing the CUDA build locally with Docker

You don't need a GPU (or even a Linux machine) to compile and import the CUDA build — the manylinux images ship the CUDA Toolkit, so nvcc runs inside the container. The kernels are compiled and the wheel is imported; they just can't execute on the GPU without a device (those tests are skipped).

Pick the image matching your host architecture (the aarch64 image runs natively on Apple Silicon; on x86_64 use the x86_64 image):

# Apple Silicon / arm64 host:
IMAGE=quay.io/manylinux_cuda/manylinux_2_28_aarch64_cuda13_1:latest
# x86_64 host:
# IMAGE=quay.io/manylinux_cuda/manylinux_2_28_x86_64_cuda13_1:latest

mkdir -p wheelhouse
docker run --rm \
  -v "$PWD":/io:ro \
  -v "$PWD/wheelhouse":/wheelhouse \
  "$IMAGE" bash -lc '
    PY=/opt/python/cp312-cp312/bin/python
    cp -r /io /tmp/src && cd /tmp/src
    $PY -m pip install --upgrade pip build pytest
    $PY -m build --wheel --outdir /wheelhouse .   # compiles src/mandelbrot.cu with nvcc
    $PY -m pip install /wheelhouse/*.whl
    $PY -m pytest                                  # GPU tests skip (no device)
  '

The compiled wheel is written to ./wheelhouse/ on the host, so you can inspect or install it afterwards. Because the container has no GPU, cuda_available() returns False and the mandelbrot_gpu test is skipped (the mandelbrot_cpu tests still run). The same flow runs in CI in the cuda job of .github/workflows/pip.yml.

Files

This example has several files that are a good idea, but aren't strictly necessary. The necessary files are:

  • pyproject.toml: The Python project file
  • CMakeLists.txt: The CMake configuration file, which requires the CUDA language
  • src/main.cpp: The pybind11 bindings (turns the results into NumPy arrays)
  • src/mandelbrot_cpu.cpp: The CPU implementation
  • src/mandelbrot.cu: The CUDA kernel and runtime device query
  • src/mandelbrot.h: The shared declarations
  • src/cuda_example/__init__.py: The Python portion of the module. The root of the module needs to be <package_name>, src/<package_name>, or python/<package_name> to be auto-discovered.

These files are also expected and highly recommended:

  • .gitignore: Git's ignore list, also used by scikit-build-core to select files for the SDist
  • README.md: The source for the PyPI description
  • LICENSE: The license file

There are also several completely optional directories:

And some optional files:

  • .pre-commit-config.yaml: Configuration for the fantastic static-check runner pre-commit.
  • noxfile.py: Configuration for the nox task runner, which helps make setup easier for contributors.

This is a simplified version of the recommendations in the Scientific-Python Development Guide, which is a highly recommended read for anyone interested in Python package development (Scientific or not). The guide also has a cookiecutter that includes scikit-build-core and pybind11 as a backend choice.

CI Examples

There are examples for CI in .github/workflows. The "wheels.yml" file builds CUDA-enabled binary "wheels" for Linux (x86_64 and aarch64) using cibuildwheel, and "pip.yml" does a quick build-and-import check in the CUDA containers.

License

pybind11 is provided under a BSD-style license that can be found in the LICENSE file. By using, distributing, or contributing to this project, you agree to the terms and conditions of this license.