You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/how-to-dask.md
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ Here we will go through some common questions and answers about `dask`, with a s
4
4
5
5
## Quickstart
6
6
7
-
**How do I monitor the {doc}`dask dashboard <dask:dashboard>`?**
7
+
### How do I monitor the {doc}`dask dashboard <dask:dashboard>`?
8
8
9
9
If you are in a jupyter notebook, when you render the `repr` of your `client`, you will see a link, usually something like `http://localhost:8787/status`.
10
10
If you are working locally, this link alone should suffice.
@@ -13,7 +13,7 @@ If you are working on some sort of remote notebook from a web browser, you will
13
13
14
14
If you are in vscode, there is a [`dask` extension] which will allow you to monitor there.
15
15
16
-
**How do I know how to allocate resources?**
16
+
### How do I know how to allocate resources?
17
17
18
18
In `dask`, every worker will receive an equal share of the memory available.
19
19
So if you request e.g., a slurm job with 256GB of RAM, and then start 8 workers, each will have 32 GB of memory.
@@ -22,7 +22,7 @@ So if you request e.g., a slurm job with 256GB of RAM, and then start 8 workers,
22
22
So if you have dense chunks of `(30_000, 30_000)` with 32 bit integers, you will need to be have 3.6 GB for each worker, at the minimum to even load the data.
23
23
Then if you do something like matrix multiplication, you will need double or even more, as an example.
24
24
25
-
**How do I read my data into a `dask` array?**
25
+
### How do I read my data into a `dask` array?
26
26
27
27
{func}`anndata.experimental.read_elem_lazy` or {func}`anndata.experimental.read_lazy` can help you if you already have data on-disk that was written to the `anndata` file format.
28
28
If you use {func}`dask.array.to_zarr`, the data _cannot_ be read in using `anndata`'s functionality as `anndata` will look for its {doc}`specified file format metadata <anndata:fileformat-prose>`.
@@ -32,7 +32,7 @@ See [our custom h5 io code] for an example.
32
32
33
33
## Advanced use and how-to-contribute
34
34
35
-
**How do `scanpy` and `anndata` handle sparse matrices?**
35
+
### How do `scanpy` and `anndata` handle sparse matrices?
36
36
37
37
While there is some {class}`scipy.sparse.csr_matrix` and {class}`scipy.sparse.csc_matrix` support for `dask`, it is not comprehensive and missing key functions like summation, mean etc.
38
38
We have implemented custom functionality, much of which lives in {mod}`fast_array_utils`, although we have also had to implement custom algorithms like `pca` for sparse-in-dask.
@@ -41,7 +41,7 @@ In the future, an [`array-api`] compatible sparse matrix like [`finch`] would he
41
41
Therefore, if you run into a puzzling error after trying to run a function like {func}`numpy.sum` (or similar) on a sparse-in-dask array, consider checking {mod}`fast_array_utils`.
42
42
If you need to implement the function yourself, see the next point.
43
43
44
-
**Custom block-wise array operations**
44
+
### Custom block-wise array operations
45
45
46
46
Sometimes you may want to do an operation on a an array that is implemented nowhere.
47
47
Generally, we have found {func}`dask.array.map_blocks` to be versatile enough that most operations can be expressed on it.
@@ -77,19 +77,19 @@ While this example is a bit complicated it shows how you can go from a matrix of
77
77
78
78
## FAQ
79
79
80
-
**What is `persist` for in RSC noteboooks?**
80
+
### What is `persist` for in RSC noteboooks?
81
81
82
82
In the {doc}`multi-gpu showcase notebook for rapids-singlecell <rapids-singlecell:notebooks/06-multi_gpu_show>`, {meth}`dask.array.Array.persist` appears across the notebook.
83
83
This loads the entire dataset into memory while keeping the representation as a dask array.
84
84
Thus, lazy computation still works but only necessitates a single read into memory.
85
85
The catch is that you have enough memory to use `persist`.
86
86
87
-
**I'm out of memory, what now?**
87
+
### I'm out of memory, what now?
88
88
89
89
You can alawys reduce the number of workers you use, which will cause more memory to be allocated per worker.
90
90
Some algorithms may have limitations with loading all data onto a single node; see {issue}`dask/dask-ml#985` for an example.
91
91
92
-
**How do I choose chunk sizes?**
92
+
### How do I choose chunk sizes?
93
93
94
94
Have a look at the {doc}`dask docs for chunking <dask:array-chunks>`, however the general rule of thumb there is to use larger chunks in memory than on disk.
95
95
In this sense, it is probably a good idea to use the largest chunk size in memory allowable by your memory limits (and the algorithms you use) in order to maximize any thread-level parallelization in algorithms to its fullest.
0 commit comments