Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
aa93b8b
updates zarr-parser to use obstore list_async instead of concurrent_map
norlandrhagen Feb 26, 2026
37dff68
removes the zarr vendor code
norlandrhagen Feb 26, 2026
2fa25a7
adds arro3-core to zarr group
norlandrhagen Feb 26, 2026
626d0b9
adds _from_arrow method
norlandrhagen Feb 27, 2026
9d6a312
adds type_checking for pa type hint + import in _from_arrow
norlandrhagen Feb 27, 2026
bab147d
extra import removed
norlandrhagen Feb 27, 2026
17e35cc
adds zarr to test-py31* test group
norlandrhagen Feb 27, 2026
6cbb7c0
Update virtualizarr/manifests/manifest.py
norlandrhagen Feb 27, 2026
b400a34
updates _from_arrow method to have paths, offsets, lengths and opt[sh…
norlandrhagen Feb 27, 2026
19122a7
merge w/ main
norlandrhagen Mar 6, 2026
e22981f
update releases.md
norlandrhagen Mar 6, 2026
fda8ce6
mypy
norlandrhagen Mar 6, 2026
bbd6a1f
mypy-2
norlandrhagen Mar 6, 2026
9cba9e8
update pyproj
norlandrhagen Mar 6, 2026
f50b724
adds new zarr parser deps and fix to acccessor
norlandrhagen Mar 6, 2026
1be91cc
Merge branch 'kerchunk_parquet_writer_pyarrow_fx' into zarr-parser-ob…
norlandrhagen Mar 6, 2026
4ed8295
fix double pyproj def
norlandrhagen Mar 6, 2026
9114613
adds requires pyarrow decorator to the test_zarr so mins deps are ok
norlandrhagen Mar 6, 2026
31c8ed0
add strange pyarrow pandas context override to more test_kerchunk.py …
norlandrhagen Mar 6, 2026
e0ddfc2
mypy again
norlandrhagen Mar 6, 2026
d96d5c5
incorporate feedback
norlandrhagen Mar 6, 2026
716a0bb
removed seperator normalization and added a method to get chunk seper…
norlandrhagen Mar 9, 2026
7e76088
Merge branch 'main' into zarr-parser-obstore-list
norlandrhagen Mar 9, 2026
5df7705
de-dup pyproj
norlandrhagen Mar 9, 2026
08232a8
mypy
norlandrhagen Mar 9, 2026
3d7ebfc
Merge branch 'main' into zarr-parser-obstore-list
norlandrhagen Mar 9, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/releases.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,10 @@ This release moves the `ObjectStoreRegistry` to a separate package `obspec_utils

### New Features

- Improved `ZarrParser` performance.
([#892](https://github.com/zarr-developers/VirtualiZarr/pull/892)).
By [Raphael Hagen](https://github.com/norlandrhagen).

- Added `reader_factory` parameter to `HDFParser` to allow customizing how files are read
([#844](https://github.com/zarr-developers/VirtualiZarr/pull/844)).
By [Max Jones](https://github.com/maxrjones).
Expand Down
12 changes: 9 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,8 @@ hdf = [
"imagecodecs-numcodecs==2024.6.1",
]

zarr = ["arro3-core", "pyarrow"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to disregard, but ideally you shouldn't need to depend on both dependencies. Pyarrow is very large. I looked recently and it looks like it's gotten even larger

Image

50MB compressed is huge.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for the feedback @kylebarron. Good to know about pyarrow.


# kerchunk-based parsers
netcdf3 = [
"virtualizarr[remote]",
Expand All @@ -76,13 +78,14 @@ all_parsers = [
"virtualizarr[fits]",
"virtualizarr[kerchunk_parquet]",
"virtualizarr[tiff]",
"virtualizarr[zarr]"
]

# writers
icechunk = [
"icechunk>=1.1.2",
]
zarr = ["arro3-core", "pyarrow"]


kerchunk = ["fastparquet", "pandas"]

Expand Down Expand Up @@ -203,14 +206,17 @@ run-tests-html-cov = { cmd = "pytest -n auto --run-network-tests --verbose --cov
min-deps = ["dev", "test", "hdf", "hdf5-lib"] # VirtualiZarr/conftest.py using h5py, so the minimum set of dependencies for testing still includes hdf libs
# Inherit from min-deps to get all the test commands, along with optional dependencies
test = ["dev", "test", "remote", "hdf", "netcdf3", "fits", "icechunk", "kerchunk", "kerchunk_parquet", "hdf5-lib", "tiff", "py313"]
test-py311 = ["dev", "test", "remote", "hdf", "netcdf3", "fits", "icechunk", "kerchunk", "kerchunk_parquet", "hdf5-lib", "tiff", "py311"] # test against python 3.11
test-py312 = ["dev", "test", "remote", "hdf", "netcdf3", "fits", "icechunk", "kerchunk", "kerchunk_parquet", "hdf5-lib", "tiff", "py312"] # test against python 3.12
test-py311 = ["dev", "test", "remote", "hdf", "netcdf3", "fits", "icechunk", "kerchunk", "kerchunk_parquet", "hdf5-lib", "tiff", "zarr", "py311"] # test against python 3.11
test-py312 = ["dev", "test", "remote", "hdf", "netcdf3", "fits", "icechunk", "kerchunk", "kerchunk_parquet", "hdf5-lib", "tiff", "zarr", "py312"] # test against python 3.12
minio = ["dev", "remote", "hdf", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "tiff", "py312", "minio"]
minimum-versions = ["dev", "test", "remote", "hdf", "netcdf3", "fits", "icechunk", "kerchunk", "kerchunk_parquet", "tiff", "hdf5-lib", "minimum-versions"]
upstream = ["dev", "test", "hdf", "hdf5-lib", "netcdf3", "upstream", "icechunk-dev", "py313"]
all = ["dev", "test", "remote", "hdf", "netcdf3", "fits", "icechunk", "kerchunk", "kerchunk_parquet", "hdf5-lib", "tiff", "all_parsers", "all_writers", "py313"]
docs = ["docs", "dev", "remote", "hdf", "netcdf3", "fits", "icechunk", "kerchunk", "kerchunk_parquet", "hdf5-lib", "tiff", "py313"]

[tool.pixi.dependencies]
pytest = "*"

# Define commands to run within the docs environment
[tool.pixi.feature.docs.tasks]
serve-docs = { cmd = "mkdocs serve" }
Expand Down
56 changes: 55 additions & 1 deletion virtualizarr/manifests/manifest.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
from __future__ import annotations

import re
from collections.abc import (
Callable,
Expand All @@ -8,13 +10,16 @@
ValuesView,
)
from pathlib import PosixPath
from typing import Any, NewType, TypedDict, cast
from typing import TYPE_CHECKING, Any, NewType, TypedDict, cast

import numpy as np

from virtualizarr.manifests.utils import construct_chunk_pattern, parse_manifest_index
from virtualizarr.types import ChunkKey

if TYPE_CHECKING:
import pyarrow as pa # type: ignore[import-untyped,import-not-found]
Copy link
Copy Markdown
Contributor

@kylebarron kylebarron Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd strongly suggest not tying to pyarrow

  1. You should very easily be able to make your code generic and not tied to pyarrow
  2. pyarrow doesn't have any internal type checking, so typing as pa.StringArray or pa.Uint64Array means absolutely nothing to the user (it might mean something to the developer)


# doesn't guarantee that writers actually handle these
VALID_URI_PREFIXES = {
"s3://",
Expand Down Expand Up @@ -322,6 +327,55 @@ def from_arrays(

return obj

@classmethod
def _from_arrow(
cls,
*,
paths: "pa.StringArray",
offsets: "pa.UInt64Array",
lengths: "pa.UInt64Array",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest typing these as ArrowArrayExportable, and then using an arrow library of choice to import the data, such as passing input to pyarrow.array or arro3.core.Array.from_arrow().

Then this API will automatically support any arrow input, including polars, duckdb, arro3, etc apache/arrow#39195 (comment)

shape: tuple[int, ...],
) -> "ChunkManifest":
"""
Create a ChunkManifest from flat 1D PyArrow arrays.

Avoids intermediate Python dicts by converting Arrow arrays directly
to the numpy arrays used internally by ChunkManifest.

Parameters
----------
paths
Full paths to chunks, as a PyArrow StringArray. Nulls represent missing chunks.
offsets
Byte offsets of chunks, as a PyArrow UInt64Array. Nulls represent missing chunks.
lengths
Byte lengths of chunks, as a PyArrow UInt64Array. Nulls represent missing chunks.
shape
Shape to reshape the flat arrays into.
"""
import pyarrow as pa # type: ignore[import-untyped,import-not-found]
import pyarrow.compute as pc # type: ignore[import-untyped,import-not-found]

arrow_paths = pc.if_else(pc.is_null(paths), "", paths)
arrow_offsets = pc.if_else(
pc.is_null(offsets), pa.scalar(0, pa.uint64()), offsets
)
arrow_lengths = pc.if_else(
pc.is_null(lengths), pa.scalar(0, pa.uint64()), lengths
)
Comment on lines +359 to +365
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requiring a pyarrow dependency just for these three lines is not worth it IMO. Much better to just document that the users must remove any null values before passing in arguments.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then you can probably use arro3-core for all your needs and save the big pyarrow dependency.


np_paths = arrow_paths.to_numpy(zero_copy_only=False).astype(
np.dtypes.StringDType()
)
np_offsets = arrow_offsets.to_numpy(zero_copy_only=False)
np_lengths = arrow_lengths.to_numpy(zero_copy_only=False)

return cls.from_arrays(
paths=np_paths.reshape(shape),
offsets=np_offsets.reshape(shape),
lengths=np_lengths.reshape(shape),
)

@property
def ndim_chunk_grid(self) -> int:
"""
Expand Down
Loading
Loading