|
| 1 | ++++ |
| 2 | +title = "rapids-singlecell 0.15.0: Prebuilt CUDA Wheels and Compiled Kernels" |
| 3 | +date = 2026-04-30T00:00:05+01:00 |
| 4 | +description = "Why we moved from CuPy RawKernels to nanobind C++ extensions and other release highlights." |
| 5 | +author = "Severin Dicks, Lukas Heumos" |
| 6 | +draft = false |
| 7 | ++++ |
| 8 | + |
| 9 | +# Rapids-singlecell release 0.15.0 |
| 10 | + |
| 11 | +We are proud to announce rapids-singlecell release 0.15.0 which comes with lots of new features but also changes to the installation process. |
| 12 | + |
| 13 | +## Why the packaging changes |
| 14 | + |
| 15 | +In earlier versions of rapids-singlecell, all GPU kernels were written as CuPy RawKernels. |
| 16 | +These were compiled the first time you called them — in your environment, on your machine. |
| 17 | +That worked, but it came with friction: |
| 18 | + |
| 19 | +- **First-call latency.** |
| 20 | + The initial invocation of a kernel-backed function could take several seconds while nvrtc compiled the CUDA source. |
| 21 | +- **Silent dtype/layout mismatches.** |
| 22 | + A RawKernel receives raw pointers. |
| 23 | + If the input array had the wrong dtype or wasn't C-contiguous, the kernel might silently produce garbage rather than raising an error. |
| 24 | +- **CUDA code trapped in Python strings.** |
| 25 | + RawKernels are defined as CUDA source inside Python string literals. |
| 26 | + That means no syntax highlighting, no autocomplete, and no compiler warnings in your editor — debugging C++ code buried in a Python string is nobody's idea of a good time. |
| 27 | + |
| 28 | +Starting with 0.15.0, these kernels are compiled once at build time and shipped as nanobind/CUDA C++ extension modules inside the wheel. |
| 29 | +The result is a more conventional compiled-extension workflow: you `pip install` the package and every kernel is ready immediately. |
| 30 | + |
| 31 | +### Packaging changes in detail |
| 32 | + |
| 33 | +The GPU kernels that were previously CuPy RawKernels are now nanobind C++ extensions built with `scikit-build-core` and CMake. |
| 34 | +This gives us: |
| 35 | + |
| 36 | +- **No runtime compilation** for any migrated kernel — the compiled code is in the wheel. |
| 37 | +- **Typed bindings at the Python/C++ boundary.** |
| 38 | + nanobind enforces dtype (e.g. float32 vs float64) and memory layout (C-contiguous vs F-contiguous) before the kernel launches, so mismatches raise a clear `TypeError` instead of producing wrong results. |
| 39 | +- **A conventional C++/CUDA project structure** with headers, shared helpers, and room for larger fused or fully C++ GPU routines. |
| 40 | + Harmony2, shipping in this release, is the first example of a more complex function built on this foundation. |
| 41 | +- **CUDA-versioned wheel packaging.** |
| 42 | + CI builds separate wheels for each CUDA major version — `rapids-singlecell-cu12` and `rapids-singlecell-cu13` — each with a `[rapids]` dependency extra that pulls in the matching RAPIDS and CuPy packages. |
| 43 | + |
| 44 | +The Python API and import name are unchanged: |
| 45 | + |
| 46 | +```python |
| 47 | +import rapids_singlecell as rsc |
| 48 | +``` |
| 49 | + |
| 50 | +Your existing analysis scripts should work without modification. |
| 51 | + |
| 52 | +### CUDA-specific wheels |
| 53 | + |
| 54 | +Because the kernels are now compiled binaries, we need to ship one wheel per CUDA major version. |
| 55 | +(Python wheel tags don't encode CUDA version, so we encode it in the package name — the same approach used by CuPy, PyTorch, and other CUDA-dependent packages.) |
| 56 | + |
| 57 | +| Package | Build CUDA | Runtime CUDA | Blackwell (B200, GB200) | |
| 58 | +| :----------------------- | :--------: | :----------: | :---------------------- | |
| 59 | +| `rapids-singlecell-cu12` | 12.2 | 12.2 – 12.9+ | Supported via PTX JIT | |
| 60 | +| `rapids-singlecell-cu13` | 13.0 | 13.0+ | Native binaries | |
| 61 | + |
| 62 | +Both wheels are available for **x86_64** and **aarch64** on Linux. |
| 63 | + |
| 64 | +If you have a Blackwell GPU (B200, GB200) and want the best out-of-the-box performance, the CUDA 13 wheel includes native binaries for Blackwell architectures. |
| 65 | +The CUDA 12 wheel still supports Blackwell through PTX just-in-time compilation, so it will work, but the first kernel launch on Blackwell will be slightly slower while the driver JIT-compiles the PTX. |
| 66 | + |
| 67 | +### How to install |
| 68 | + |
| 69 | +#### Prebuilt wheel (recommended) |
| 70 | + |
| 71 | +Pick the wheel that matches your CUDA version: |
| 72 | + |
| 73 | +```bash |
| 74 | +pip install rapids-singlecell-cu13 # CUDA 13 |
| 75 | +pip install rapids-singlecell-cu12 # CUDA 12 |
| 76 | +``` |
| 77 | + |
| 78 | +This installs rapids-singlecell with precompiled kernels, but does **not** pull in the RAPIDS stack (cupy, cuml, cudf, etc.). |
| 79 | +If you manage those dependencies separately — for example, through conda — this is all you need. |
| 80 | + |
| 81 | +#### Prebuilt wheel with RAPIDS dependencies |
| 82 | + |
| 83 | +If you want pip to also install the matching RAPIDS and CuPy packages: |
| 84 | + |
| 85 | +```bash |
| 86 | +pip install 'rapids-singlecell-cu13[rapids]' --extra-index-url=https://pypi.nvidia.com |
| 87 | +pip install 'rapids-singlecell-cu12[rapids]' --extra-index-url=https://pypi.nvidia.com |
| 88 | +``` |
| 89 | + |
| 90 | +Note: on the prebuilt wheels, the dependency extra is always `[rapids]`. |
| 91 | +The CUDA version is determined by which package name you install — `rapids-singlecell-cu12` or `rapids-singlecell-cu13`. |
| 92 | +If you're building from source instead, the extras are `[rapids-cu12]` and `[rapids-cu13]`. |
| 93 | + |
| 94 | +#### Conda / Mamba |
| 95 | + |
| 96 | +Environment files are provided in the repository: |
| 97 | + |
| 98 | +```bash |
| 99 | +conda env create -f conda/rsc_rapids_26.04_cuda13.yml # Python 3.14, CUDA 13 |
| 100 | +conda env create -f conda/rsc_rapids_26.04_cuda12.yml # Python 3.14, CUDA 12 |
| 101 | +``` |
| 102 | + |
| 103 | +> **Note:** RAPIDS currently does not support `channel_priority: strict`. Use `channel_priority: flexible` instead. |
| 104 | +
|
| 105 | +#### Docker / Apptainer |
| 106 | + |
| 107 | +Pre-built containers are available for both CUDA versions: |
| 108 | + |
| 109 | +```bash |
| 110 | +docker pull ghcr.io/scverse/rapids-singlecell-cu13:latest |
| 111 | +docker run --rm --gpus all ghcr.io/scverse/rapids-singlecell-cu13:latest |
| 112 | +``` |
| 113 | + |
| 114 | +For HPC clusters using Apptainer/Singularity: |
| 115 | + |
| 116 | +```bash |
| 117 | +apptainer pull rsc.sif docker://ghcr.io/scverse/rapids-singlecell-cu13:latest |
| 118 | +apptainer run --nv rsc.sif |
| 119 | +``` |
| 120 | + |
| 121 | +### Migration from 0.14.x |
| 122 | + |
| 123 | +For most users, upgrading is straightforward: |
| 124 | + |
| 125 | +1. **Change your pip install command.** |
| 126 | + Replace `pip install rapids-singlecell` with `pip install rapids-singlecell-cu12` or `rapids-singlecell-cu13`, depending on your CUDA version. |
| 127 | +2. **No code changes needed.** |
| 128 | + The `import rapids_singlecell as rsc` import and all public APIs remain the same. |
| 129 | +3. **Check your CUDA version.** |
| 130 | + Run `nvidia-smi` or `nvcc --version` to confirm whether you're on CUDA 12.x or CUDA 13.x, and install the matching wheel. |
| 131 | + If you're using conda, make sure the CUDA runtime library version in your environment matches the wheel you install — e.g., `cuda-cudart` from the `nvidia` channel should be 12.x for the cu12 wheel or 13.x for the cu13 wheel. |
| 132 | + |
| 133 | +### What about `pip install rapids-singlecell`? |
| 134 | + |
| 135 | +The plain install — `pip install rapids-singlecell`, without the `-cu12` or `-cu13` suffix — still works. |
| 136 | +It will compile the CUDA extensions from source during installation. |
| 137 | +This is perfectly functional, but please be aware of what that means: you need a CUDA toolkit with nvcc, CMake ≥ 3.24, and a compatible C++ compiler already present in your environment, and the build will take longer than downloading a prebuilt wheel. |
| 138 | + |
| 139 | +When building from source, you can install the matching RAPIDS dependencies with the `[rapids-cu12]` or `[rapids-cu13]` extra: |
| 140 | + |
| 141 | +```bash |
| 142 | +pip install 'rapids-singlecell[rapids-cu12]' --extra-index-url=https://pypi.nvidia.com |
| 143 | +``` |
| 144 | + |
| 145 | +Or install the RAPIDS stack separately before or after the build. |
| 146 | + |
| 147 | +For most users, we recommend the prebuilt CUDA wheels. |
| 148 | +They're faster to install and don't require a local compiler toolchain. |
| 149 | +For more details on source builds — including how to target custom GPU architectures — see the [installation docs](https://rapids-singlecell.readthedocs.io/en/latest/installation.html). |
| 150 | + |
| 151 | +Source builds are the right choice if you are: |
| 152 | + |
| 153 | +- **Contributing to rapids-singlecell** and need to iterate on C++ kernel code. |
| 154 | +- **Debugging CUDA extensions** and want to compile with debug flags or sanitizers. |
| 155 | +- **Targeting a custom GPU architecture** not covered by the prebuilt wheels (e.g. a future compute capability). |
| 156 | +- **On a platform we don't publish wheels for** (though we cover x86_64 and aarch64 Linux). |
| 157 | + |
| 158 | +If none of those apply to you, use the prebuilt wheel. |
| 159 | + |
| 160 | +## Other highlights in 0.15.0 |
| 161 | + |
| 162 | +Beyond packaging, this release includes a substantial set of algorithmic and performance improvements built up across the 0.15.0 development cycle: |
| 163 | + |
| 164 | +### Harmony2 and C++ harmony |
| 165 | + |
| 166 | +Harmony was rewritten as a C++ nanobind kernel ([#578](https://github.com/scverse/rapids-singlecell/pull/578)), making it significantly faster and more memory-efficient. |
| 167 | +On top of that, we implemented three algorithmic improvements from the Harmony2 paper (Patikas et al. 2026): a stabilized diversity penalty, dynamic per-cluster-per-batch ridge regularization, and automatic batch pruning to prevent overintegration in biologically heterogeneous datasets ([#625](https://github.com/scverse/rapids-singlecell/pull/625)). |
| 168 | +This is also the first example of a more complex routine built on the new compiled-kernel infrastructure. |
| 169 | + |
| 170 | +### Contrast-based energy distance |
| 171 | + |
| 172 | +Perturbation experiments typically don't need a full k×k distance matrix between all groups — you want to compare each perturbation against one or two controls, possibly stratified by cell type. |
| 173 | +The new `contrast_distances()` API ([#603](https://github.com/scverse/rapids-singlecell/pull/603)) lets you express exactly that. |
| 174 | +You build a contrasts DataFrame — either with the `Distance.create_contrasts()` helper or by hand — where each row is a (target, reference) comparison, optionally stratified by `split_by` columns (e.g., cell type). |
| 175 | +Under the hood, the kernel deduplicates shared distance pairs across contrasts, subsets the embedding to only the referenced cells before transferring to GPU, and launches a single kernel call for all unique pairs. |
| 176 | +The result is a copy of your contrasts DataFrame with an `edistance` column appended. |
| 177 | + |
| 178 | +```python |
| 179 | +from rapids_singlecell.pertpy_gpu import Distance |
| 180 | + |
| 181 | +dist = Distance("edistance") |
| 182 | + |
| 183 | +# Compare each perturbation against two controls, stratified by cell type |
| 184 | +contrasts = Distance.create_contrasts( |
| 185 | + adata, |
| 186 | + groupby="target_gene", |
| 187 | + selected_group=["Non_target", "Scramble"], |
| 188 | + split_by="cell_type", |
| 189 | +) |
| 190 | + |
| 191 | +result = dist.contrast_distances(adata, contrasts=contrasts) |
| 192 | +``` |
| 193 | + |
| 194 | +`onesided_distances()` also now accepts a sequence of control group names via `selected_group`, returning a DataFrame with one column per control ([#601](https://github.com/scverse/rapids-singlecell/pull/601)). |
| 195 | +Both energy distance and co-occurrence kernels gained multi-GPU support ([#545](https://github.com/scverse/rapids-singlecell/pull/545), [#546](https://github.com/scverse/rapids-singlecell/pull/546)). |
| 196 | + |
| 197 | +### More highlights |
| 198 | + |
| 199 | +- **RAPIDS 26.04 and Python 3.14 support** across all CI and conda environments. |
| 200 | +- **Dask support for `highly_variable_genes`** with the Seurat v3 flavor ([#616](https://github.com/scverse/rapids-singlecell/pull/616)). |
| 201 | +- **CUDA kernel error surfacing** — launch errors are now raised instead of silently continuing ([#619](https://github.com/scverse/rapids-singlecell/pull/619)). |
| 202 | +- **Additional tutorials** such as a Pertpy-GPU tutorial ([#645](https://github.com/scverse/rapids-singlecell/pull/645)) |
| 203 | + |
| 204 | +A big thank you to everyone who tested the pre-releases and helped surface issues before this release went out. |
| 205 | + |
| 206 | +For questions and bug reports, visit the [GitHub issue tracker](https://github.com/scverse/rapids_singlecell/issues). |
| 207 | + |
| 208 | +--- |
| 209 | + |
| 210 | +*rapids-singlecell is part of the [scverse](https://scverse.org) ecosystem. |
| 211 | +If you use it in your research, please cite the project.* |
0 commit comments