-
Notifications
You must be signed in to change notification settings - Fork 16
(feat): add dask documentation #216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 9 commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
6c65320
(feat): dask docs
ilan-gold 4411544
(chore): ad chunk size docs
ilan-gold c54449a
(chore): note about sparse
ilan-gold ac35835
(chore): remove unnecessary link + formatting
ilan-gold 7cc14db
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 2d8144b
(fix): remove array api from intersphinx mapping
ilan-gold e0c4562
(fix): remove warning
ilan-gold 1accc43
(fix): increase cache number
ilan-gold ec17fc5
(fix): try busting cache back
ilan-gold 6781eb2
(chore): small clean ups
ilan-gold 40e0b0d
remove duplication
flying-sheep f94130d
Replace fake headers with real headers
flying-sheep c3f449f
fix link
flying-sheep 11ab7a3
fix typos
flying-sheep badc862
Ilan’s remaining hatch.toml change
flying-sheep 6e2ee8b
Merge branch 'main' into ig/dask-docs
ilan-gold f1bbe89
(feat): more tutorial info
ilan-gold 7e709d6
Merge branch 'main' into ig/dask-docs
ilan-gold e15814d
Update docs/how-to-dask.md
ilan-gold 7d3804b
Update docs/how-to-dask.md
ilan-gold File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,9 +17,6 @@ __pycache__/ | |
| /docs/generated/ | ||
| /docs/_build/ | ||
|
|
||
| # User configuration | ||
| /hatch.toml | ||
|
|
||
| # IDEs | ||
| /.idea/ | ||
| /.vscode/ | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,100 @@ | ||
| # Dask Q&A | ||
|
|
||
| Here we will go through some common questions and answers about `dask`, with a special focus on its integration with `scanpy` and `anndata`. | ||
|
|
||
| ## Quickstart | ||
|
|
||
| **How do I monitor the {doc}`dask dashboard <dask:dashboard>`?** | ||
|
|
||
| If you are in a jupyter notebook, when you render the `repr` of your `client`, you will see a link, usually something like `http://localhost:8787/status`. | ||
| If you are working locally, this link alone should suffice. | ||
|
|
||
| If you are working on some sort of remote notebook from a web browser, you will need to replace `http://localhost:8787` by the root url of the notebook. | ||
|
|
||
| If you are in vscode, there is an [`dask` extension] which will allow you to monitor there. | ||
|
|
||
| **How do I know how to allocate resources?** | ||
|
|
||
| In `dask`, every worker will receive an equal share of the memory available. | ||
| So if you request e.g., a slurm job with 256GB of RAM, and then start 8 workers, each will have 32 GB of memory. | ||
|
|
||
| `dask` distributes jobs to each worker generally based on the chunking of the array. | ||
| So if you have dense chunks of `(30_000, 30_000)` with 32 bit integers, you will need to be have 3.6 GB for each worker, at the minimum to even load the data. | ||
| Then if you do something like matrix multiplication, you will need double or even more, as an example. | ||
|
|
||
| **How do I read my data into a `dask` array?** | ||
|
|
||
| {func}`anndata.experimental.read_elem_lazy` or {func}`anndata.experimental.read_lazy` can help you if you already have data on-disk that was written to the `anndata` file format. | ||
| If you use {func}`dask.array.to_zarr`, the data _cannot_ be read in using `anndata`'s functionality as `anndata` will look for its {doc}`specified file format metadata <anndata:fileformat-prose>`. | ||
|
|
||
| If you need to implement custom io, generally we found that using {func}`dask.array.map_blocks` provides a nice way. | ||
| See [our custom h5 io code] for an example. | ||
|
|
||
| ## Advanced use and how-to-contribute | ||
|
|
||
| **How do `scanpy` and `anndata` handle sparse matrices?** | ||
|
|
||
| While there is some {class}`scipy.sparse.csr_matrix` and {class}`scipy.sparse.csc_matrix` support for `dask`, it is not comprehensive and missing key functions like summation, mean etc. | ||
| We have implemented custom functionality, much of which lives in {mod}`fast_array_utils`, although we have also had to implement custom algorithms like `pca` for sparse-in-dask. | ||
| In the future, an [`array-api`] compatible sparse matrix like [`finch`] would help us considerably as `dask` supports the [`array-api`]. | ||
|
|
||
| Therefore, if you run into a puzzling error after trying to run a function like {func}`numpy.sum` (or similar) on a sparse-in-dask array, consider checking {mod}`fast_array_utils`. | ||
| If you need to implement the function yourself, see the next point. | ||
|
|
||
| **Custom block-wise array operations** | ||
|
|
||
| Sometimes you may want to do an operation on a an array that is implemented nowhere. | ||
| Generally, we have found {func}`dask.array.map_blocks` to be versatile enough that most operations can be expressed on it. | ||
| Take this (simplified) example of calculating a gram matrix from {func}`scanpy.pp.pca` for sparse-in-dask: | ||
|
|
||
| ```python | ||
| def gram_block(x_part): | ||
| gram_matrix = x_part.T @ x_part | ||
| return gram_matrix[None, ...] | ||
|
|
||
| gram_matrix_dask = da.map_blocks( | ||
| gram_block, | ||
| x, | ||
| new_axis=(1,), | ||
| chunks=((1,) * x.blocks.size, (x.shape[1],), (x.shape[1],)), | ||
| meta=np.array([], dtype=x.dtype), | ||
| dtype=x.dtype, | ||
| ).sum(axis=0) | ||
| ``` | ||
|
|
||
| This algorithm goes through every `chunk_size` number of rows and calculates the gram matrix for those rows producing a collection of `(n_vars,n_vars)` size matrix. | ||
| These are the summed together to produce a single `(n_vars,n_vars)` matrix, which is the gram amtrix. | ||
|
|
||
| Because `dask` does not implement matrix multiplication for sparse-in-dask, we do it ourselves. | ||
| We use `map_blocks` over a CSR sparse-in-dask array where the chunking looks something like `(chunk_size, n_vars)`. | ||
| When we compute the invdividual block's gram matrix, we add an axis via `[None, ...]` so that we can sum over that axis i.e., the `da.map_blocks` call produces a `(n_obs // chunk_size, n_vars, n_vars)` sized-matrix which is summed over the first dimension. | ||
| However, to make this work, we need to be very specific about how `da.map_blocks` expects its result to look like, done via `new_axis` and `chunks` `new_axis` indicates that we are adding a single new axis at the front. | ||
| The `chunks` argument specifices that the output of `da.map_blocks` should have `x.blocks.size` number of `(1, n_vars, n_vars)` matrixes. | ||
| This `chunks` argument thus allows the inferral of the shape of the output. | ||
|
|
||
| While this example is a bit complicated it shows how you can go from a matrix of one shape and chunking to another by operating in a clean way over blocks. | ||
|
|
||
| ## FAQ | ||
|
|
||
| **What is `persist` for in RSC noteboooks?** | ||
|
|
||
| In the {doc}`multi-gpu showcase notebook for rapids-singlecell <rapids-singlecell:notebooks/06-multi_gpu_show>`, {meth}`dask.array.Array.persist` appears across the notebook. | ||
| This loads the entire dataset into memory while keeping the representation as a dask array. | ||
| Thus, lazy computation still works but only necessitates a single read into memory. | ||
| The catch is that you have enough memory to use `persist`. | ||
|
ilan-gold marked this conversation as resolved.
Outdated
|
||
|
|
||
| **I'm out of memory, what now?** | ||
|
|
||
| You can alawys reduce the number of workers you use, which will cause more memory to be allocated per worker. | ||
| Some algorithms may have limitations with loading all data onto a single node; see {issue}`dask/dask-ml#985` for an example. | ||
|
|
||
| **How do I choose chunk sizes?** | ||
|
|
||
| Have a look at the {doc}`dask docs for chunking <dask:array-chunks>`, however the general rule of thumb there is to use larger chunks in memory than on disk. | ||
| In this sense, it is probably a good idea to use the largest chunk size in memory allowable by your memory limits (and the algorithms you use) in order to maximize any thread-level parallelization in algorithms to its fullest. | ||
| For sparse data, where the chunks in-memory do not map to those on disk, maxing out the memory available by choosing a large chunk size becomes more imperative. | ||
|
|
||
| [`dask` extension]: https://marketplace.visualstudio.com/items?itemName=joyceerhl.vscode-das | ||
| [our custom h5 io code]: https://github.com/scverse/anndata/blob/089ed929393a02200b389395f278b7c920e5bc4a/src/anndata/_io/specs/lazy_methods.py#L179-L20 | ||
| [`array-api`]: https://data-apis.org/array-api/latest/index.html | ||
| [`finch`]: https://github.com/finch-tensor/finch-tensor-python | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| [envs.default] | ||
| installer = "uv" | ||
| features = [ "dev" ] | ||
|
|
||
| [envs.docs] | ||
| features = [ "docs" ] | ||
| scripts.build = "sphinx-build -M html docs docs/_build -W --keep-going {args}" | ||
| scripts.open = "python3 -m webbrowser -t docs/_build/html/index.html" | ||
| scripts.clean = "git clean -fdX -- {args:docs}" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Intron7 can you review this section specifically? feel free to do more, but this one would be great if you only have a bit of time