You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Add ProcessPoolExecutor to check if it fails in the tests
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* strict xfail the ProcessPoolExecutor
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add warnings, recommendations and references
* Remove ProcessPoolExec from main text
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Copy file name to clipboardExpand all lines: docs/scaling.md
+10-3Lines changed: 10 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -97,12 +97,15 @@ VirtualiZarr comes with a small selection of executors you can choose from when
97
97
The simplest executor is the [`SerialExecutor`][virtualizarr.parallel.SerialExecutor], which executes all the [`open_virtual_dataset`][virtualizarr.open_virtual_dataset] calls in serial, not in parallel.
98
98
It is the default executor.
99
99
100
-
### Threads or Processes
100
+
### Threads
101
101
102
-
One way to parallelize creating virtual references from a single machine is to across multiple threads or processes.
103
-
For this you can use the [`ThreadPoolExecutor`][concurrent.futures.ThreadPoolExecutor]or [`ProcessPoolExecutor`][concurrent.futures.ProcessPoolExecutor]class from the [`concurrent.futures`][] module in the python standard library.
102
+
One way to parallelize creating virtual references from a single machine is to use multiple threads.
103
+
For this you can use the [`ThreadPoolExecutor`][concurrent.futures.ThreadPoolExecutor] class from the [`concurrent.futures`][] module in the python standard library.
104
104
You simply pass the executor class directly via the `parallel` kwarg to [`open_virtual_mfdataset`][virtualizarr.open_virtual_mfdataset].
105
105
106
+
!!! note
107
+
We are also working on adding support for [`ProcessPoolExecutor`][concurrent.futures.ProcessPoolExecutor], see [PR #889](https://github.com/zarr-developers/VirtualiZarr/pull/889).
This can work well when virtualizing files in remote object storage because it parallelizes the issuing of HTTP GET requests for each file.
113
116
117
+
!!! warning
118
+
Some file parsers, such as the [`HDFParser`][virtualizarr.parsers.HDFParser], rely on C libraries (e.g. HDF5) that hold a process-level lock, which means `ThreadPoolExecutor` will effectively run in serial despite using multiple threads.
119
+
If you need true parallelism with such parsers, consider using `parallel='lithops'` or `parallel='dask'` instead. If no lithops config file is present (see the [Lithops](#lithops) section), lithops will default to using the [localhost executor](https://lithops-cloud.github.io/docs/source/api_futures.html#lithops.executors.LocalhostExecutor) on the current host, which spawns separate processes that bypass the GIL limitation. These are currently your best options when the file parser is not thread-safe.
120
+
114
121
### Dask Delayed
115
122
116
123
You can parallelize using `dask.delayed` automatically by passing `parallel='dask'`.
0 commit comments