Skip to content

Commit 9076f80

Browse files
committed
update formating
1 parent 071c85d commit 9076f80

1 file changed

Lines changed: 65 additions & 23 deletions

File tree

content/blog/2026-rsc-goes-nanobind.md

Lines changed: 65 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,48 @@
1+
+++
2+
title = "rapids-singlecell 0.15.0: Prebuilt CUDA Wheels and Compiled Kernels"
3+
date = 2026-04-30T00:00:05+01:00
4+
description = "rapids-singlecell 0.15.0 ships GPU kernels as precompiled wheels — no more runtime compilation."
5+
author = "Severin Dicks"
6+
draft = false
7+
+++
8+
19
# rapids-singlecell 0.15.0: Prebuilt CUDA Wheels and Compiled Kernels
210

3-
*rapids-singlecell 0.15.0 now ships GPU kernels as precompiled extensions instead of being compiled at runtime. Here's what that means for you.*
11+
*rapids-singlecell 0.15.0 now ships GPU kernels as precompiled extensions instead of being compiled at runtime.
12+
Here's what that means for you.*
413

514
---
615

716
## Why the packaging changed
817

9-
In earlier versions of rapids-singlecell, all GPU kernels were written as CuPy RawKernels. These were compiled the first time you called them — in your environment, on your machine. That worked, but it came with friction:
18+
In earlier versions of rapids-singlecell, all GPU kernels were written as CuPy RawKernels.
19+
These were compiled the first time you called them — in your environment, on your machine.
20+
That worked, but it came with friction:
1021

11-
- **First-call latency.** The initial invocation of a kernel-backed function could take several seconds while nvrtc compiled the CUDA source.
12-
- **Silent dtype/layout mismatches.** A RawKernel receives raw pointers. If the input array had the wrong dtype or wasn't C-contiguous, the kernel might silently produce garbage rather than raising an error.
13-
- **CUDA code trapped in Python strings.** RawKernels are defined as CUDA source inside Python string literals. That means no syntax highlighting, no autocomplete, and no compiler warnings in your editor — debugging C++ code buried in a Python string is nobody's idea of a good time.
22+
- **First-call latency.**
23+
The initial invocation of a kernel-backed function could take several seconds while nvrtc compiled the CUDA source.
24+
- **Silent dtype/layout mismatches.**
25+
A RawKernel receives raw pointers.
26+
If the input array had the wrong dtype or wasn't C-contiguous, the kernel might silently produce garbage rather than raising an error.
27+
- **CUDA code trapped in Python strings.**
28+
RawKernels are defined as CUDA source inside Python string literals.
29+
That means no syntax highlighting, no autocomplete, and no compiler warnings in your editor — debugging C++ code buried in a Python string is nobody's idea of a good time.
1430

15-
Starting with 0.15.0, these kernels are compiled once at build time and shipped as nanobind/CUDA C++ extension modules inside the wheel. The result is a more conventional compiled-extension workflow: you `pip install` the package and every kernel is ready immediately.
31+
Starting with 0.15.0, these kernels are compiled once at build time and shipped as nanobind/CUDA C++ extension modules inside the wheel.
32+
The result is a more conventional compiled-extension workflow: you `pip install` the package and every kernel is ready immediately.
1633

1734
## What changed
1835

19-
The GPU kernels that were previously CuPy RawKernels are now nanobind C++ extensions built with `scikit-build-core` and CMake. This gives us:
36+
The GPU kernels that were previously CuPy RawKernels are now nanobind C++ extensions built with `scikit-build-core` and CMake.
37+
This gives us:
2038

2139
- **No runtime compilation** for any migrated kernel — the compiled code is in the wheel.
22-
- **Typed bindings at the Python/C++ boundary.** nanobind enforces dtype (e.g. float32 vs float64) and memory layout (C-contiguous vs F-contiguous) before the kernel launches, so mismatches raise a clear `TypeError` instead of producing wrong results.
23-
- **A conventional C++/CUDA project structure** with headers, shared helpers, and room for larger fused or fully C++ GPU routines. Harmony2, shipping in this release, is the first example of a more complex function built on this foundation.
24-
- **CUDA-versioned wheel packaging.** CI builds separate wheels for each CUDA major version — `rapids-singlecell-cu12` and `rapids-singlecell-cu13` — each with a `[rapids]` dependency extra that pulls in the matching RAPIDS and CuPy packages.
40+
- **Typed bindings at the Python/C++ boundary.**
41+
nanobind enforces dtype (e.g. float32 vs float64) and memory layout (C-contiguous vs F-contiguous) before the kernel launches, so mismatches raise a clear `TypeError` instead of producing wrong results.
42+
- **A conventional C++/CUDA project structure** with headers, shared helpers, and room for larger fused or fully C++ GPU routines.
43+
Harmony2, shipping in this release, is the first example of a more complex function built on this foundation.
44+
- **CUDA-versioned wheel packaging.**
45+
CI builds separate wheels for each CUDA major version — `rapids-singlecell-cu12` and `rapids-singlecell-cu13` — each with a `[rapids]` dependency extra that pulls in the matching RAPIDS and CuPy packages.
2546

2647
The Python API and import name are unchanged:
2748

@@ -33,7 +54,8 @@ Your existing analysis scripts should work without modification.
3354

3455
## CUDA-specific wheels
3556

36-
Because the kernels are now compiled binaries, we need to ship one wheel per CUDA major version. (Python wheel tags don't encode CUDA version, so we encode it in the package name — the same approach used by CuPy, PyTorch, and other CUDA-dependent packages.)
57+
Because the kernels are now compiled binaries, we need to ship one wheel per CUDA major version.
58+
(Python wheel tags don't encode CUDA version, so we encode it in the package name — the same approach used by CuPy, PyTorch, and other CUDA-dependent packages.)
3759

3860
| Package name | Compiled with | Runtime CUDA support | Blackwell GPUs |
3961
|---|---|---|---|
@@ -42,7 +64,8 @@ Because the kernels are now compiled binaries, we need to ship one wheel per CUD
4264

4365
Both wheels are available for **x86_64** and **aarch64** on Linux.
4466

45-
If you have a Blackwell GPU (B200, GB200) and want the best out-of-the-box performance, the CUDA 13 wheel includes native binaries for Blackwell architectures. The CUDA 12 wheel still supports Blackwell through PTX just-in-time compilation, so it will work, but the first kernel launch on Blackwell will be slightly slower while the driver JIT-compiles the PTX.
67+
If you have a Blackwell GPU (B200, GB200) and want the best out-of-the-box performance, the CUDA 13 wheel includes native binaries for Blackwell architectures.
68+
The CUDA 12 wheel still supports Blackwell through PTX just-in-time compilation, so it will work, but the first kernel launch on Blackwell will be slightly slower while the driver JIT-compiles the PTX.
4669

4770
## How to install
4871

@@ -55,7 +78,8 @@ pip install rapids-singlecell-cu13 # CUDA 13
5578
pip install rapids-singlecell-cu12 # CUDA 12
5679
```
5780

58-
This installs rapids-singlecell with precompiled kernels, but does **not** pull in the RAPIDS stack (cupy, cuml, cudf, etc.). If you manage those dependencies separately — for example, through conda — this is all you need.
81+
This installs rapids-singlecell with precompiled kernels, but does **not** pull in the RAPIDS stack (cupy, cuml, cudf, etc.).
82+
If you manage those dependencies separately — for example, through conda — this is all you need.
5983

6084
### Prebuilt wheel with RAPIDS dependencies
6185

@@ -66,7 +90,9 @@ pip install 'rapids-singlecell-cu13[rapids]' --extra-index-url=https://pypi.nvid
6690
pip install 'rapids-singlecell-cu12[rapids]' --extra-index-url=https://pypi.nvidia.com
6791
```
6892

69-
Note: on the prebuilt wheels, the dependency extra is always `[rapids]`. The CUDA version is determined by which package name you install — `rapids-singlecell-cu12` or `rapids-singlecell-cu13`. If you're building from source instead, the extras are `[rapids-cu12]` and `[rapids-cu13]`.
93+
Note: on the prebuilt wheels, the dependency extra is always `[rapids]`.
94+
The CUDA version is determined by which package name you install — `rapids-singlecell-cu12` or `rapids-singlecell-cu13`.
95+
If you're building from source instead, the extras are `[rapids-cu12]` and `[rapids-cu13]`.
7096

7197
### Conda / Mamba
7298

@@ -99,13 +125,19 @@ apptainer run --nv rsc.sif
99125

100126
For most users, upgrading is straightforward:
101127

102-
1. **Change your pip install command.** Replace `pip install rapids-singlecell` with `pip install rapids-singlecell-cu12` or `rapids-singlecell-cu13`, depending on your CUDA version.
103-
2. **No code changes needed.** The `import rapids_singlecell as rsc` import and all public APIs remain the same.
104-
3. **Check your CUDA version.** Run `nvidia-smi` or `nvcc --version` to confirm whether you're on CUDA 12.x or CUDA 13.x, and install the matching wheel. If you're using conda, make sure the CUDA runtime library version in your environment matches the wheel you install — e.g., `cuda-cudart` from the `nvidia` channel should be 12.x for the cu12 wheel or 13.x for the cu13 wheel.
128+
1. **Change your pip install command.**
129+
Replace `pip install rapids-singlecell` with `pip install rapids-singlecell-cu12` or `rapids-singlecell-cu13`, depending on your CUDA version.
130+
2. **No code changes needed.**
131+
The `import rapids_singlecell as rsc` import and all public APIs remain the same.
132+
3. **Check your CUDA version.**
133+
Run `nvidia-smi` or `nvcc --version` to confirm whether you're on CUDA 12.x or CUDA 13.x, and install the matching wheel.
134+
If you're using conda, make sure the CUDA runtime library version in your environment matches the wheel you install — e.g., `cuda-cudart` from the `nvidia` channel should be 12.x for the cu12 wheel or 13.x for the cu13 wheel.
105135

106136
## What about `pip install rapids-singlecell`?
107137

108-
The plain install — `pip install rapids-singlecell`, without the `-cu12` or `-cu13` suffix — still works. It will compile the CUDA extensions from source during installation. This is perfectly functional, but please be aware of what that means: you need a CUDA toolkit with nvcc, CMake ≥ 3.24, and a compatible C++ compiler already present in your environment, and the build will take longer than downloading a prebuilt wheel.
138+
The plain install — `pip install rapids-singlecell`, without the `-cu12` or `-cu13` suffix — still works.
139+
It will compile the CUDA extensions from source during installation.
140+
This is perfectly functional, but please be aware of what that means: you need a CUDA toolkit with nvcc, CMake ≥ 3.24, and a compatible C++ compiler already present in your environment, and the build will take longer than downloading a prebuilt wheel.
109141

110142
When building from source, you can install the matching RAPIDS dependencies with the `[rapids-cu12]` or `[rapids-cu13]` extra:
111143

@@ -115,7 +147,9 @@ pip install 'rapids-singlecell[rapids-cu12]' --extra-index-url=https://pypi.nvid
115147

116148
Or install the RAPIDS stack separately before or after the build.
117149

118-
For most users, we recommend the prebuilt CUDA wheels. They're faster to install and don't require a local compiler toolchain. For more details on source builds — including how to target custom GPU architectures — see the [installation docs](https://rapids-singlecell.readthedocs.io/en/latest/installation.html).
150+
For most users, we recommend the prebuilt CUDA wheels.
151+
They're faster to install and don't require a local compiler toolchain.
152+
For more details on source builds — including how to target custom GPU architectures — see the [installation docs](https://rapids-singlecell.readthedocs.io/en/latest/installation.html).
119153

120154
Source builds are the right choice if you are:
121155

@@ -132,11 +166,17 @@ Beyond packaging, this release includes a substantial set of algorithmic and per
132166

133167
### Harmony2 and C++ harmony
134168

135-
Harmony was rewritten as a C++ nanobind kernel ([#578](https://github.com/scverse/rapids-singlecell/pull/578)), making it significantly faster and more memory-efficient. On top of that, we implemented three algorithmic improvements from the Harmony2 paper (Patikas et al. 2026): a stabilized diversity penalty, dynamic per-cluster-per-batch ridge regularization, and automatic batch pruning to prevent overintegration in biologically heterogeneous datasets ([#625](https://github.com/scverse/rapids-singlecell/pull/625)). This is also the first example of a more complex routine built on the new compiled-kernel infrastructure.
169+
Harmony was rewritten as a C++ nanobind kernel ([#578](https://github.com/scverse/rapids-singlecell/pull/578)), making it significantly faster and more memory-efficient.
170+
On top of that, we implemented three algorithmic improvements from the Harmony2 paper (Patikas et al. 2026): a stabilized diversity penalty, dynamic per-cluster-per-batch ridge regularization, and automatic batch pruning to prevent overintegration in biologically heterogeneous datasets ([#625](https://github.com/scverse/rapids-singlecell/pull/625)).
171+
This is also the first example of a more complex routine built on the new compiled-kernel infrastructure.
136172

137173
### Contrast-based energy distance
138174

139-
Perturbation experiments typically don't need a full k×k distance matrix between all groups — you want to compare each perturbation against one or two controls, possibly stratified by cell type. The new `contrast_distances()` API ([#603](https://github.com/scverse/rapids-singlecell/pull/603)) lets you express exactly that. You build a contrasts DataFrame — either with the `Distance.create_contrasts()` helper or by hand — where each row is a (target, reference) comparison, optionally stratified by `split_by` columns (e.g., cell type). Under the hood, the kernel deduplicates shared distance pairs across contrasts, subsets the embedding to only the referenced cells before transferring to GPU, and launches a single kernel call for all unique pairs. The result is a copy of your contrasts DataFrame with an `edistance` column appended.
175+
Perturbation experiments typically don't need a full k×k distance matrix between all groups — you want to compare each perturbation against one or two controls, possibly stratified by cell type.
176+
The new `contrast_distances()` API ([#603](https://github.com/scverse/rapids-singlecell/pull/603)) lets you express exactly that.
177+
You build a contrasts DataFrame — either with the `Distance.create_contrasts()` helper or by hand — where each row is a (target, reference) comparison, optionally stratified by `split_by` columns (e.g., cell type).
178+
Under the hood, the kernel deduplicates shared distance pairs across contrasts, subsets the embedding to only the referenced cells before transferring to GPU, and launches a single kernel call for all unique pairs.
179+
The result is a copy of your contrasts DataFrame with an `edistance` column appended.
140180

141181
```python
142182
from rapids_singlecell.pertpy_gpu import Distance
@@ -154,7 +194,8 @@ contrasts = Distance.create_contrasts(
154194
result = dist.contrast_distances(adata, contrasts=contrasts)
155195
```
156196

157-
`onesided_distances()` also now accepts a sequence of control group names via `selected_group`, returning a DataFrame with one column per control ([#601](https://github.com/scverse/rapids-singlecell/pull/601)). Both energy distance and co-occurrence kernels gained multi-GPU support ([#545](https://github.com/scverse/rapids-singlecell/pull/545), [#546](https://github.com/scverse/rapids-singlecell/pull/546)).
197+
`onesided_distances()` also now accepts a sequence of control group names via `selected_group`, returning a DataFrame with one column per control ([#601](https://github.com/scverse/rapids-singlecell/pull/601)).
198+
Both energy distance and co-occurrence kernels gained multi-GPU support ([#545](https://github.com/scverse/rapids-singlecell/pull/545), [#546](https://github.com/scverse/rapids-singlecell/pull/546)).
158199

159200
### More highlights
160201

@@ -172,4 +213,5 @@ For questions and bug reports, visit the [GitHub issue tracker](https://github.c
172213

173214
---
174215

175-
*rapids-singlecell is part of the [scverse](https://scverse.org) ecosystem. If you use it in your research, please cite the project.*
216+
*rapids-singlecell is part of the [scverse](https://scverse.org) ecosystem.
217+
If you use it in your research, please cite the project.*

0 commit comments

Comments
 (0)