scverse · Intron7 · Aug 13, 2025 · Aug 13, 2025 · Aug 13, 2025 · Aug 13, 2025
diff --git a/README.md b/README.md
@@ -7,14 +7,11 @@
 
 # rapids-singlecell: GPU-Accelerated Single-Cell Analysis within scverse®
 
-rapids-singlecell offers enhanced single-cell data analysis as a near drop-in replacement predominantly for scanpy, while also incorporating select functionalities from squidpy and decoupler. Utilizing GPU computing with CuPy and NVIDIA RAPIDS, it emphasizes high computational efficiency. As part of the scverse ecosystem, rapids-singlecell continuously aims to maintain compatibility, adapting and growing through community collaboration.
+rapids-singlecell provides GPU-accelerated single-cell analysis with an AnnData-first API. It is largely compatible with Scanpy and includes selected functionality from Squidpy and decoupler. Computations use CuPy and NVIDIA RAPIDS for performance on large datasets.
 
-* **Broad GPU Optimization:** Facilitates accelerated processing of large datasets, with GPU-enabled AnnData objects.
-* **Selective scverse Library Integration:** Incorporates key functionalities from scanpy, with additional features from squidpy and decoupler.
-* **Easy Installation Process:** Available via Conda and PyPI, with detailed setup guidelines.
-* **Accessible Documentation:** Provides comprehensive guides and examples tailored for efficient application.
-
-Our commitment with rapids-singlecell is to deliver a powerful, user-centric tool that significantly enhances single-cell data analysis capabilities in bioinformatics.
+- **GPU acceleration**: Common single-cell workflows on `AnnData` run on the GPU.
+- **Ecosystem compatibility**: Works with Scanpy APIs; includes pieces from Squidpy and decoupler.
+- **Simple installation**: Available via Conda and PyPI.
 
 ## Documentation
 

diff --git a/docs/Installation.md b/docs/Installation.md
@@ -57,9 +57,8 @@ apptainer run --nv rsc.sif
 ```
 
 
-# GPU-Memory and System Requirements
+# System requirements
 
-*rapids-singlecell* relays for most computation on the GPU. A GPU with sufficient VRAM is therefore required to handle large datasets.
-With a RTX 3090 it's possible to analyze 200000 cells without any issues. With an A100 80GB it is even possible to analyze more than 1000000. For even larger datasets, {mod}`~rmm` is required to oversubscribe GPU memory into host memory, similar to SWAP memory. However, using `managed_memory` can result in a performance penalty, but this is still preferable to CPU runtimes.
+Most computations run on the GPU. See the Memory Management page for hardware guidance, managed memory, and known limits:
 
-The upper limit for GPU-based {class}`~anndata.AnnData` is a `.nnz` (non-zero elements) value of 2**31-1 (2147483647). This constraint is due to the maximum `indptr` (index pointer array for compressed sparse format) size that {mod}`~cupy` currently supports for sparse matrices.
+- {doc}`MM`
diff --git a/docs/MM.md b/docs/MM.md
@@ -2,28 +2,43 @@
 
 In rapids-singlecell, efficient memory management is crucial for handling large-scale datasets. This is facilitated by the integration of the RAPIDS Memory Manager ({mod}`rmm`). {mod}`rmm` is automatically invoked upon importing `rapids-singlecell`, modifying the default allocator for cupy. Integrating {mod}`rmm` with `rapids-singlecell` slightly modifies the execution speed of {mod}`cupy`. This change typically results in a minimal performance trade-off. However, it's crucial to be aware that certain specific functions, like {func}`~.pp.harmony_integrate`, might experience a more significant impact on performance efficiency due to this integration. Users can overwrite the default behavior with {func}`rmm.reinitialize`.
 
+## Quick start
+
+Pick one mode based on your dataset and hardware:
+
+- If your data fits in GPU VRAM: use the pool allocator for speed → see [Pool Allocator](#pool-allocator).
+- If your data is larger than VRAM: use managed memory to spill to host RAM → see [Managed Memory](#managed-memory).
+
+Why not both? Pool allocation and managed memory target different trade-offs. Pooling assumes you keep data in VRAM and optimizes allocation speed. Managed memory assumes you will exceed VRAM and optimizes for correctness by spilling to host RAM. Combining both can negate benefits and increase fragmentation, so choose one.
+
 ## Managed Memory
 
-In {mod}`rmm`, the `managed_memory` feature facilitates VRAM oversubscription, allowing for the processing of data structures larger than the default VRAM capacity. This effectively extends the memory limit up to twice the VRAM size. Leveraging managed memory will introduce a performance overhead. This is particularly evident with substantial oversubscription, as it necessitates increased dependency on the comparatively slower system memory, leading to slowdowns in data processing tasks.
+- Purpose: use datasets larger than GPU VRAM by spilling to host RAM.
+- How it works: VRAM oversubscription; data migrates between GPU and host as needed.
+- Trade-off: slower than fully-in-VRAM; slowdown grows with how much you spill.
+- Good for: very large datasets that otherwise OOM; exploratory or batch runs where correctness matters more than peak speed.
 
 ```
 # Enable `managed_memory`
 import rmm
+import cupy as cp
 from rmm.allocators.cupy import rmm_cupy_allocator
-rmm.reinitialize(
-    managed_memory=True,
-    pool_allocator=False,
-)
+
+rmm.reinitialize(managed_memory=True, pool_allocator=False)
 cp.cuda.set_allocator(rmm_cupy_allocator)
 ```
 
 ## Pool Allocator
 
-The `pool_allocator` functionality in {mod}`rmm` optimizes memory handling by pre-allocating a pool of memory, which can be swiftly accessed for GPU-related tasks. This approach, while being more memory-intensive, significantly boosts performance. It is particularly beneficial for operations that are heavy on memory usage, such as {func}`~.pp.harmony_integrate`, by minimizing the time spent on dynamic memory allocation during runtime.
+- Purpose: speed up allocations and reduce fragmentation when data fits in VRAM.
+- How it works: pre-allocates a pool; subsequent allocations come from the pool.
+- Trade-off: keeps memory reserved; needs sufficient VRAM.
+- Good for: allocation-heavy steps (e.g., neighbor graphs, harmony integration) and repeated runs.
 
 ```
 # Enable `pool_allocator`
 import rmm
+import cupy as cp
 from rmm.allocators.cupy import rmm_cupy_allocator
 rmm.reinitialize(
     managed_memory=False,
@@ -37,6 +52,24 @@ To achieve optimal memory management in rapids-singlecell, consider the followin
 
 * **Large-scale Data Analysis:** Utilize `managed_memory` for datasets exceeding your VRAM's capacity, keeping in mind the potential performance penalties.
 * **Performance-Critical Operations:** Choose `pool_allocator` when speed is critical and sufficient VRAM is available.
+* **Do not enable both together:** Pooling prefers staying in VRAM; managed memory prefers spilling when needed. Mixing them can lead to unexpected performance and memory fragmentation.
+
+### Troubleshooting
+
+- CUDA out-of-memory (OOM) while using the pool allocator → switch to [Managed Memory](#managed-memory) or reduce dataset size.
+- Very slow runtime with managed memory → reduce oversubscription or switch back to [Pool Allocator](#pool-allocator) if VRAM allows.
 
 ## Further Reading
 For a more in-depth understanding of rmm and its functionalities, refer to the [RAPIDS Memory Manager documentation](https://docs.rapids.ai/api/rmm/stable/python/).
+
+
+## System requirements and limits
+
+rapids-singlecell performs most computations on the GPU. Ensure your system has a CUDA-capable GPU with sufficient VRAM for your datasets.
+
+- With an RTX 3090, analyzing around 200,000 cells is typically feasible.
+- With an A100 80GB, analyses with 1,000,000+ cells are possible.
+
+For larger datasets, use {mod}`~rmm` managed memory to oversubscribe GPU memory to host RAM (similar to SWAP). This may introduce a performance penalty but can still outperform CPU-only runs. See the Managed Memory section above for how to enable it.
+
+Limit note: For GPU-backed {class}`~anndata.AnnData`, the upper limit is governed by the sparse matrix `.nnz` value of 2**31-1 (2,147,483,647). This is due to the maximum `indptr` size currently supported by {mod}`~cupy` for sparse matrices.
diff --git a/docs/Out_of_core.md b/docs/Out_of_core.md
@@ -0,0 +1,147 @@
+# Out-of-core with Dask (GPU)
+
+Process datasets larger than GPU memory by chunking work with Dask while keeping arrays on the GPU via CuPy. Chunking also mitigates the CuPy sparse limit of `.nnz ≤ 2**31-1` by operating on smaller blocks.
+
+
+## Start a Dask CUDA cluster
+
+```python
+from dask.distributed import Client
+from dask_cuda import LocalCUDACluster
+
+# Single or multi-GPU: set CUDA_VISIBLE_DEVICES accordingly (e.g., "0,1")
+# Use one thread per worker for GPU tasks to avoid contention and VRAM spikes
+cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES="0", threads_per_worker=1)
+client = Client(cluster)
+```
+
+Notes:
+- `threads_per_worker=1` is recommended for GPU workloads. More threads can be faster but often increase temporary allocations, causing VRAM spikes/overflows; some dask-cuda releases also showed leaks with multi-threaded workers. With row chunks around ~20,000, 4–5 threads can still work on many GPUs.
+- For capacity over speed, enable RMM managed memory (see {doc}`MM`). For highest peer‑to‑peer (NVLink) performance, prefer the RMM pool allocator and avoid managed memory.
+- Multi‑GPU transport: use UCX (`protocol="ucx"`) to enable NVLink. UCX typically uses more memory and can appear leaky; TCP is more stable but slower.
+- UCX is not compatible with CUDA managed memory. For UCX/NVLink, disable managed memory. TCP can be used with managed memory.
+
+```python
+# Configure RMM on all workers
+import cupy as cp
+import rmm
+from rmm.allocators.cupy import rmm_cupy_allocator
+
+def set_mem_pool():
+    # Prefer pool allocator for performance and NVLink (managed memory can degrade P2P)
+    rmm.reinitialize(managed_memory=False, pool_allocator=True)
+    cp.cuda.set_allocator(rmm_cupy_allocator)
+
+client.run(set_mem_pool)
+```
+
+UCX example (optional):
+
+```python
+# Use UCX transport (NVLink capable) instead of TCP
+cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES="0,1", threads_per_worker=1, protocol="ucx")
+client = Client(cluster)
+```
+
+## Loading AnnData lazily from Zarr (from the multi-GPU notebook)
+
+Load `AnnData` from a Zarr store with `X` as a Dask array, and `obs/var` read eagerly. Chunk by rows.
+
+```python
+import anndata as ad
+from packaging.version import parse as parse_version
+
+if parse_version(ad.__version__) < parse_version("0.12.0rc1"):
+    from anndata.experimental import read_elem_as_dask as read_dask
+else:
+    from anndata.experimental import read_elem_lazy as read_dask
+
+import zarr
+
+SPARSE_CHUNK_SIZE = 20_000
+data_pth = "zarr/cell_atlas.zarr"  # example zarr path
+
+f = zarr.open(data_pth)
+X = f["X"]
+shape = X.attrs["shape"]
+
+adata = ad.AnnData(
+    X=read_dask(X, (SPARSE_CHUNK_SIZE, shape[1])),
+    obs=ad.io.read_elem(f["obs"]),
+    var=ad.io.read_elem(f["var"]),
+)
+```
+
+## Example: out-of-core preprocessing pipeline
+
+```python
+import rapids_singlecell as rsc
+rsc.get.anndata_to_GPU(adata)
+# Normalize and transform
+rsc.pp.normalize_total(adata)
+rsc.pp.log1p(adata)
+
+# HVG selection
+rsc.pp.highly_variable_genes(adata)
+adata = adata[:, adata.var["highly_variable"]].copy()
+
+# Scale and PCA
+rsc.pp.scale(adata, zero_center=True, max_value=10)
+rsc.pp.pca(adata, n_comps=50)
+```
+
+Most functions operate lazily; use `.compute()` only when you need concrete values on the client. Operations with reductions (e.g., scaling, HVG selection, PCA) synchronize and may call `compute()` internally.
+
+## Computing results explicitly
+
+```python
+# Dense dask+cupy matrix → cupy
+X_gpu = adata.X.compute()
+
+```
+
+## Persist and chunk sizes
+
+- Persist after major transformations or filtering to materialize results in worker memory and shorten later graphs.
+- Recompute chunk sizes to help Dask plan evenly across workers.
+
+```python
+# After filtering or transformations
+adata.X = adata.X.persist()
+adata.X.compute_chunk_sizes()
+```
+
+Persisting loads data into GPU memory across workers. This can quickly cause OOM if the dataset does not fit. On sufficiently large clusters, persisting can be extremely fast and effective.
+
+
+## Multi-GPU notes
+
+- Use `LocalCUDACluster(CUDA_VISIBLE_DEVICES="0,1,2,3")` to scale across GPUs.
+- Ensure chunks are large enough to amortize scheduling but small enough to fit per-worker VRAM.
+- Combine with RMM pool allocator for speed, or managed memory for capacity (see {doc}`MM`).
+- NVLink: peer-to-peer performance is best with the RMM pool allocator. Managed memory can reduce or prevent effective NVLink use.
+
+## Functions that support Dask
+
+The functions below are implemented to run on Dask‑backed `AnnData` with GPU arrays. Most steps are lazy; reduction steps may synchronize internally. This covers the most common out‑of‑core workflows and will expand over time.
+
+- `pp.calculate_qc_metrics`
+- `pp.normalize_total`
+- `pp.log1p`
+- `pp.highly_variable_genes` (flavors: `cell_ranger`, `seurat_v3`)
+- `pp.scale`
+- `pp.pca`
+- `tl.score_genes`
+- `tl.rank_genes_groups_logreg`
+- `get.aggregate`
+
+## Troubleshooting
+
+- CUDA OOM while running: reduce chunk size, enable RMM managed memory, or filter earlier.
+- VRAM spikes or leaks: set `threads_per_worker=1`; limit task concurrency; consider TCP instead of UCX; restart workers to clear allocator state if needed.
+
+## References
+
+- Dask-CUDA docs: https://docs.rapids.ai/api/dask-cuda/stable/
+- Dask Array: https://docs.dask.org/en/stable/array.html
+- CuPy Sparse: https://docs.cupy.dev/en/stable/reference/sparse.html
diff --git a/docs/basic.md b/docs/basic.md
@@ -1,17 +1,15 @@
 # Welcome to the rapids-singlecell documentation
 
-rapids-singlecell offers enhanced single-cell data analysis as a near drop-in replacement predominantly for scanpy, while also incorporating select functionalities from squidpy and decoupler. Utilizing GPU computing with CuPy and NVIDIA’s RAPIDS, it emphasizes high computational efficiency. As part of the scverse ecosystem, rapids-singlecell continuously aims to maintain compatibility, adapting and growing through community collaboration.
+rapids-singlecell provides GPU-accelerated single-cell analysis with an AnnData-first API. It is largely compatible with Scanpy and includes selected functionality from Squidpy and decoupler. Computations use CuPy and NVIDIA RAPIDS for performance on large datasets.
 
-* **Broad GPU Optimization:** Facilitates accelerated processing of large datasets, with GPU-enabled AnnData objects.
-* **Selective scverse Library Integration:** Incorporates key functionalities from scanpy, with additional features from squidpy and decoupler.
-* **Easy Installation Process:** Available via Conda and PyPI, with detailed setup guidelines.
-* **Accessible Documentation:** Provides comprehensive guides and examples tailored for efficient application.
+- **GPU acceleration**: Common single-cell workflows on `AnnData` run on the GPU.
+- **Ecosystem compatibility**: Works with Scanpy APIs; includes pieces from Squidpy and decoupler.
+- **Simple installation**: Available via Conda and PyPI.
 
-Our commitment with rapids-singlecell is to deliver a powerful, user-centric tool that significantly enhances single-cell data analysis capabilities in bioinformatics.
 
 [//]: # (numfocus-fiscal-sponsor-attribution)
 
-Rapids-singlecell is part of the scverse® project ([website](https://scverse.org), [governance](https://scverse.org/about/roles)) and is fiscally sponsored by [NumFOCUS](https://numfocus.org/).
+rapids-singlecell is part of the scverse® project ([website](https://scverse.org), [governance](https://scverse.org/about/roles)) and is fiscally sponsored by [NumFOCUS](https://numfocus.org/).
 If you like scverse® and want to support our mission, please consider making a tax-deductible [donation](https://numfocus.org/donate-to-scverse) to help the project pay for developer time, professional services, travel, workshops, and a variety of other needs.
 
 <div align="center">
@@ -23,8 +21,47 @@ If you like scverse® and want to support our mission, please consider making a
 </a>
 </div>
 
+```{div}
+:style: height: 0.5rem
+```
+
+::::{grid} 1 2 2 2
+:gutter: 2
+
+:::{grid-item-card} Installation {octicon}`plug;1em;`
+:link: Installation
+:link-type: doc
+
+New to *rapids-singlecell*? Check out the installation guide.
+:::
+
+:::{grid-item-card} Tutorials {octicon}`play;1em;`
+:link: notebooks
+:link-type: doc
+
+The tutorials walk you through real-world applications of rapids-singlecell.
+:::
+
+:::{grid-item-card} API reference {octicon}`book;1em;`
+:link: api/index
+:link-type: doc
+
+The API reference contains a detailed description of
+the rapids-singelcell API.
+:::
+
+:::{grid-item-card} GitHub {octicon}`mark-github;1em;`
+:link: https://github.com/scverse/rapids_singlecell
+
+Find a bug? Interested in improving rapids-singlecell? Checkout our GitHub for the latest developments.
+:::
+::::
+
+
 ## News
 
+* 28.07.25 *rapids-singlecell* is now an [scverse® core package](https://scverse.org/blog/2025-core-expansion/)
+* 1.07.25 *rapids-singlecell* was highlighted in an other NVIDIA [technical blog post](https://developer.nvidia.com/blog/driving-toward-billion-cell-analysis-and-biological-breakthroughs-with-rapids-singlecell/)
 * 07.08.23 *rapids-singlecell* is now part of scverse® ecosystem.
 * 04.08.23 Thanks to the great team at [scverse](https://www.scverse.org)® *rapids-singlecell* is now automatically tested with CI
 * 27.06.23 I'm very honored to announce that I was invited to co-author a [technical blog post](https://developer.nvidia.com/blog/gpu-accelerated-single-cell-rna-analysis-with-rapids-singlecell/) that demonstrates the capabilities and performance of *rapids-singlecell* for NVIDIA.
diff --git a/docs/conf.py b/docs/conf.py
@@ -41,6 +41,7 @@
 # They can be extensions coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
 extensions = [
     "myst_nb",
+    "sphinx_design",
     "sphinx.ext.autodoc",
     "sphinx.ext.doctest",
     "sphinx.ext.coverage",

diff --git a/docs/index.md b/docs/index.md
@@ -23,6 +23,7 @@ Installation.md
 Usage_Principles.md
 api/index.md
 MM.md
+Out_of_core.md
 release-notes/index.md
 references.md
 notebooks.rst

diff --git a/docs/release-notes/0.13.1.md b/docs/release-notes/0.13.1.md
@@ -1,4 +1,4 @@
-### 0.13.1 {small}`the-future`
+### 0.13.1 {small}`2025-08-13`
 
 ```{rubric} Features
 ```
@@ -18,3 +18,4 @@
 ```
 * refactors `testing_utils` {pr}`427` {smaller}`S Dicks`
 * updates to `RAPIDS-25.08` {pr}`428` {smaller}`S Dicks`
+* expand documentation on out-of-core workloads and Dask {pr}`432` {smaller}`S Dicks`
diff --git a/pyproject.toml b/pyproject.toml
@@ -34,6 +34,7 @@ doc = [
     "sphinx-copybutton",
     "nbsphinx>=0.8.12",
     "myst-nb",
+    "sphinx-design",
     "scanpydoc[typehints,theme]>=0.9.4",
     "readthedocs-sphinx-ext",
     "sphinx_copybutton",