You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our ZarrParser unfortunately has to list every chunk in the object store (see #850 (comment)). But I think we can make this a lot faster and less memory-intensive.
Currently we use a vendored function from zarr-python
use obstore.list to do the concurrent loop in rust instead of python
pass return_arrow=True to get a stream of PyArrow RecordBatches back
construct the python ChunkManifest object's numpy arrays1 directly from the Arrow arrays, minimizing memory copies (i.e. the opposite of what I did in Pass manifests to icechunk as pyarrow arrays #861)
I haven't benchmarked the current approach but I'm pretty sure this would be waaay faster.
Footnotes
Unfortunately we can't just keep the manifests as arrow arrays the whole way through because of the potential need to concatenate manifests along arbitrary dims. ↩
Our
ZarrParserunfortunately has to list every chunk in the object store (see #850 (comment)). But I think we can make this a lot faster and less memory-intensive.Currently we use a vendored function from zarr-python
VirtualiZarr/virtualizarr/vendor/zarr/core/common.py
Line 17 in 785de91
and use it to call
store.getsizeconcurrently on all the (possible) keys of a zarr array.VirtualiZarr/virtualizarr/parsers/zarr.py
Line 118 in 785de91
Then the results go into a python dict
VirtualiZarr/virtualizarr/parsers/zarr.py
Line 125 in 785de91
which becomes the chunk manifest for that array.
Instead what we could do is:
obstore.listto do the concurrent loop in rust instead of pythonreturn_arrow=Trueto get a stream of PyArrow RecordBatches backChunkManifestobject's numpy arrays1 directly from the Arrow arrays, minimizing memory copies (i.e. the opposite of what I did in Pass manifests to icechunk as pyarrow arrays #861)I haven't benchmarked the current approach but I'm pretty sure this would be waaay faster.
Footnotes
Unfortunately we can't just keep the manifests as arrow arrays the whole way through because of the potential need to concatenate manifests along arbitrary dims. ↩