Skip to content

Commit 8c6c9bc

Browse files
ilan-goldpre-commit-ci[bot]flying-sheepIntron7
authored
(feat): add dask documentation (#216)
* (feat): dask docs * (chore): ad chunk size docs * (chore): note about sparse * (chore): remove unnecessary link + formatting * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * (fix): remove array api from intersphinx mapping * (fix): remove warning * (fix): increase cache number * (fix): try busting cache back * (chore): small clean ups * remove duplication * Replace fake headers with real headers * fix link * fix typos * Ilan’s remaining hatch.toml change * (feat): more tutorial info * Update docs/how-to-dask.md Co-authored-by: Severin Dicks <37635888+Intron7@users.noreply.github.com> * Update docs/how-to-dask.md Co-authored-by: Severin Dicks <37635888+Intron7@users.noreply.github.com> --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Philipp A. <flying-sheep@web.de> Co-authored-by: Severin Dicks <37635888+Intron7@users.noreply.github.com>
1 parent 39a14f1 commit 8c6c9bc

5 files changed

Lines changed: 114 additions & 3 deletions

File tree

.github/workflows/execute-nbs.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ jobs:
4848
activate-environment: tutorials
4949
channel-priority: flexible
5050
environment-file: environment.yml
51-
miniforge-variant: Mambaforge
51+
miniforge-variant: Miniforge3
5252
miniforge-version: latest
5353
use-mamba: true
5454
# some important packages are not available as .tar.bz2 anymore

docs/conf.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@
5353
"sphinx.ext.intersphinx",
5454
"sphinx.ext.autosummary",
5555
"sphinx.ext.napoleon",
56+
"sphinx_issues",
5657
"sphinxcontrib.bibtex",
5758
"sphinx_autodoc_typehints",
5859
"sphinx.ext.mathjax",
@@ -88,9 +89,13 @@
8889

8990
intersphinx_mapping = {
9091
"python": ("https://docs.python.org/3", None),
91-
"anndata": ("https://anndata.readthedocs.io/en/stable/", None),
92+
"anndata": ("https://anndata.readthedocs.io/en/latest/", None), # TODO: change back to stable after 0.12 release
9293
"numpy": ("https://numpy.org/doc/stable/", None),
9394
"scanpy": ("https://scanpy.readthedocs.io/en/stable/", None),
95+
"fast-array-utils": ("https://icb-fast-array-utils.readthedocs-hosted.com/en/stable", None),
96+
"dask": ("https://docs.dask.org/en/stable", None),
97+
"scipy": ("https://docs.scipy.org/doc/scipy", None),
98+
"rapids-singlecell": ("https://rapids-singlecell.readthedocs.io/en/stable/", None),
9499
}
95100

96101
# List of patterns, relative to source directory, that match files and

docs/how-to-dask.md

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# Dask Q&A
2+
3+
Here we will go through some common questions and answers about `dask`, with a special focus on its integration with `scanpy` and `anndata`. For more comprehensive tutorials or other topics like {doc}`launching a cluster <dask:deploying>`, head over their documentation.
4+
5+
## Quickstart
6+
7+
### How do I monitor the {doc}`dask dashboard <dask:dashboard>`?
8+
9+
If you are in a jupyter notebook, when you render the `repr` of your `client`, you will see a link, usually something like `http://localhost:8787/status`.
10+
If you are working locally, this link alone should suffice.
11+
12+
If you are working on some sort of remote notebook from a web browser, you will need to replace `http://localhost` by the root url of the notebook.
13+
14+
If you are in vscode, there is a [`dask` extension] which will allow you to monitor there.
15+
16+
### How do I know how to allocate resources?
17+
18+
In `dask`, every worker will receive an equal share of the memory available.
19+
So if you request e.g., a slurm job with 256GB of RAM, and then start 8 workers, each will have 32 GB of memory.
20+
21+
`dask` distributes jobs to each worker generally based on the chunking of the array.
22+
So if you have dense chunks of `(30_000, 30_000)` with 32 bit integers, you will need to be have 3.6 GB for each worker, at the minimum to even load the data.
23+
Then if you do something like matrix multiplication, you will need double or even more, as an example.
24+
25+
### How do I read my data into a `dask` array?
26+
27+
{func}`anndata.experimental.read_elem_lazy` or {func}`anndata.experimental.read_lazy` can help you if you already have data on-disk that was written to the `anndata` file format.
28+
If you use {func}`dask.array.to_zarr`, the data _cannot_ be read in using `anndata`'s functionality as `anndata` will look for its {doc}`specified file format metadata <anndata:fileformat-prose>`.
29+
30+
If you need to implement custom io, generally we found that using {func}`dask.array.map_blocks` provides a nice way.
31+
See [our custom h5 io code] for an example.
32+
33+
## Advanced use and how-to-contribute
34+
35+
### How do `scanpy` and `anndata` handle sparse matrices?
36+
37+
While there is some {class}`scipy.sparse.csr_matrix` and {class}`scipy.sparse.csc_matrix` support for `dask`, it is not comprehensive and missing key functions like summation, mean etc.
38+
We have implemented custom functionality, much of which lives in {mod}`fast_array_utils`, although we have also had to implement custom algorithms like `pca` for sparse-in-dask.
39+
In the future, an [`array-api`] compatible sparse matrix like [`finch`] would help us considerably as `dask` supports the [`array-api`].
40+
41+
Therefore, if you run into a puzzling error after trying to run a function like {func}`numpy.sum` (or similar) on a sparse-in-dask array, consider checking {mod}`fast_array_utils`.
42+
If you need to implement the function yourself, see the next point.
43+
44+
### Custom block-wise array operations
45+
46+
Sometimes you may want to do an operation on a an array that is implemented nowhere.
47+
Generally, we have found {func}`dask.array.map_blocks` to be versatile enough that most operations can be expressed on it. Click on the link to see `dask`'s own tutorial about the function.
48+
49+
Take this (simplified) example of calculating a gram matrix from {func}`scanpy.pp.pca` for sparse-in-dask:
50+
51+
```python
52+
def gram_block(x_part):
53+
gram_matrix = x_part.T @ x_part
54+
return gram_matrix[None, ...]
55+
56+
gram_matrix_dask = da.map_blocks(
57+
gram_block,
58+
x,
59+
new_axis=(1,),
60+
chunks=((1,) * x.blocks.size, (x.shape[1],), (x.shape[1],)),
61+
meta=np.array([], dtype=x.dtype),
62+
dtype=x.dtype,
63+
).sum(axis=0)
64+
```
65+
66+
This algorithm goes through every `chunk_size` number of rows and calculates the gram matrix for those rows producing a collection of `(n_vars,n_vars)` size matrix.
67+
These are the summed together to produce a single `(n_vars,n_vars)` matrix, which is the gram matrix.
68+
69+
Because `dask` does not implement matrix multiplication for sparse-in-dask, we do it ourselves.
70+
We use `map_blocks` over a CSR sparse-in-dask array where the chunking looks something like `(chunk_size, n_vars)`.
71+
When we compute the individual block's gram matrix, we add an axis via `[None, ...]` so that we can sum over that axis i.e., the `da.map_blocks` call produces a `(n_obs // chunk_size, n_vars, n_vars)` sized-matrix which is summed over the first dimension.
72+
However, to make this work, we need to be very specific about how `da.map_blocks` expects its result to look like, done via `new_axis` and `chunks`.
73+
`new_axis` indicates that we are adding a single new axis at the front.
74+
The `chunks` argument specifies that the output of `da.map_blocks` should have `x.blocks.size` number of `(1, n_vars, n_vars)` matrixes.
75+
This `chunks` argument thus allows the inferral of the shape of the output.
76+
77+
While this example is a bit complicated it shows how you can go from a matrix of one shape and chunking to another by operating in a clean way over blocks.
78+
79+
## FAQ
80+
81+
### What is `persist` used for in RSC notebooks?
82+
83+
In the {doc}`multi-gpu showcase notebook for rapids-singlecell <rapids-singlecell:notebooks/06-multi_gpu_show>`, {meth}`dask.array.Array.persist` appears across the notebook.
84+
This loads the entire dataset into memory while keeping the representation as a dask array.
85+
Thus, lazy computation still works but only necessitates a single read into memory.
86+
The catch is that you need to have enough memory to use `persist`, but if you do it greatly speeds up the computation.
87+
88+
### I'm out of memory, what now?
89+
90+
You can always reduce the number of workers you use, which will cause more memory to be allocated per worker.
91+
Some algorithms may have limitations with loading all data onto a single node; see {issue}`dask/dask-ml#985` for an example.
92+
93+
### How do I choose chunk sizes?
94+
95+
Have a look at the {doc}`dask docs for chunking <dask:array-chunks>`, however the general rule of thumb there is to use larger chunks in memory than on disk.
96+
In this sense, it is probably a good idea to use the largest chunk size in memory allowable by your memory limits (and the algorithms you use) in order to maximize any thread-level parallelization in algorithms to its fullest.
97+
For sparse data, where the chunks in-memory do not map to those on disk, maxing out the memory available by choosing a large chunk size becomes more imperative.
98+
99+
[`dask` extension]: https://marketplace.visualstudio.com/items?itemName=joyceerhl.vscode-das
100+
[our custom h5 io code]: https://github.com/scverse/anndata/blob/089ed929393a02200b389395f278b7c920e5bc4a/src/anndata/_io/specs/lazy_methods.py#L179-L205
101+
[`array-api`]: https://data-apis.org/array-api/latest/index.html
102+
[`finch`]: https://github.com/finch-tensor/finch-tensor-python

docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,6 @@ notebooks/tutorial_axes_anndata_mudata
1212
notebooks/scverse_data_backed
1313
notebooks/scverse_data_interoperability
1414
notebooks/tutorial_concatenation_anndata_mudata
15-
15+
how-to-dask.md
1616
references.md
1717
```

pyproject.toml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ registry = [
3333
docs = [
3434
"sphinx>=7",
3535
"sphinx-book-theme>=1.1.0",
36+
"sphinx-issues>=5.0.1",
3637
"myst-nb>=1.1.0",
3738
"sphinxcontrib-bibtex>=1.0.0",
3839
"sphinx-autodoc-typehints",
@@ -45,6 +46,7 @@ docs = [
4546

4647
[tool.hatch.envs.default]
4748
installer = "uv"
49+
features = ["dev"]
4850

4951
[tool.hatch.envs.registry]
5052
features = ["registry"]
@@ -60,6 +62,8 @@ extra-dependencies = [
6062
]
6163
[tool.hatch.envs.docs.scripts]
6264
build = "sphinx-build -M html docs docs/_build {args}"
65+
open = "python3 -m webbrowser -t docs/_build/html/index.html"
66+
clean = "git clean -fdX -- {args:docs}"
6367

6468
[tool.hatch.build.targets.wheel]
6569
bypass-selection = true # This is not a package

0 commit comments

Comments
 (0)