Skip to content

Commit 785de91

Browse files
Correct Scaling docs. (#890)
* Add ProcessPoolExecutor to check if it fails in the tests * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * strict xfail the ProcessPoolExecutor * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add warnings, recommendations and references * Remove ProcessPoolExec from main text --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent ba1f464 commit 785de91

2 files changed

Lines changed: 18 additions & 4 deletions

File tree

docs/scaling.md

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -97,12 +97,15 @@ VirtualiZarr comes with a small selection of executors you can choose from when
9797
The simplest executor is the [`SerialExecutor`][virtualizarr.parallel.SerialExecutor], which executes all the [`open_virtual_dataset`][virtualizarr.open_virtual_dataset] calls in serial, not in parallel.
9898
It is the default executor.
9999

100-
### Threads or Processes
100+
### Threads
101101

102-
One way to parallelize creating virtual references from a single machine is to across multiple threads or processes.
103-
For this you can use the [`ThreadPoolExecutor`][concurrent.futures.ThreadPoolExecutor] or [`ProcessPoolExecutor`][concurrent.futures.ProcessPoolExecutor] class from the [`concurrent.futures`][] module in the python standard library.
102+
One way to parallelize creating virtual references from a single machine is to use multiple threads.
103+
For this you can use the [`ThreadPoolExecutor`][concurrent.futures.ThreadPoolExecutor] class from the [`concurrent.futures`][] module in the python standard library.
104104
You simply pass the executor class directly via the `parallel` kwarg to [`open_virtual_mfdataset`][virtualizarr.open_virtual_mfdataset].
105105

106+
!!! note
107+
We are also working on adding support for [`ProcessPoolExecutor`][concurrent.futures.ProcessPoolExecutor], see [PR #889](https://github.com/zarr-developers/VirtualiZarr/pull/889).
108+
106109
```python
107110
from concurrent.futures import ThreadPoolExecutor
108111

@@ -111,6 +114,10 @@ combined_vds = vz.open_virtual_mfdataset(urls, registry=registry, parallel=Threa
111114

112115
This can work well when virtualizing files in remote object storage because it parallelizes the issuing of HTTP GET requests for each file.
113116

117+
!!! warning
118+
Some file parsers, such as the [`HDFParser`][virtualizarr.parsers.HDFParser], rely on C libraries (e.g. HDF5) that hold a process-level lock, which means `ThreadPoolExecutor` will effectively run in serial despite using multiple threads.
119+
If you need true parallelism with such parsers, consider using `parallel='lithops'` or `parallel='dask'` instead. If no lithops config file is present (see the [Lithops](#lithops) section), lithops will default to using the [localhost executor](https://lithops-cloud.github.io/docs/source/api_futures.html#lithops.executors.LocalhostExecutor) on the current host, which spawns separate processes that bypass the GIL limitation. These are currently your best options when the file parser is not thread-safe.
120+
114121
### Dask Delayed
115122

116123
You can parallelize using `dask.delayed` automatically by passing `parallel='dask'`.

virtualizarr/tests/test_xarray.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import functools
22
from collections.abc import Mapping
3-
from concurrent.futures import ThreadPoolExecutor
3+
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
44
from pathlib import Path
55
from typing import Callable
66

@@ -891,6 +891,13 @@ def test_invalid_parallel_kwarg(
891891
[
892892
False,
893893
ThreadPoolExecutor,
894+
pytest.param(
895+
ProcessPoolExecutor,
896+
marks=pytest.mark.xfail(
897+
reason="See https://github.com/zarr-developers/VirtualiZarr/pull/889",
898+
strict=True,
899+
),
900+
),
894901
pytest.param("dask", marks=requires_dask),
895902
pytest.param("lithops", marks=requires_lithops),
896903
],

0 commit comments

Comments
 (0)