Skip to content

Commit 1d4e3cb

Browse files
ilan-goldLDeakin
andauthored
(feat): full v2 compat via python fallback (#84)
* chore(deps): bump zarr to 3.0.0rc1 * fmt * (feat): python fallack * (fix): dtypes * (fix): `object` dtypes + `v2` tests * (fix): `object` dtypes + `v2` tests * (fix): `object` dtypes + `v2` tests * (fix): `object` dtypes in rust * (fix): blosc support * (refactor): handle `None` fill-value more gracefully * fix: V2 codec pipeline creation * fix: zfpy/pcodec metadata handling * (fix): fall back for unsupported codecs * (fix): our decode codec pipeline does not support vlen * (fix): string dtype test to match zarr-python * (chore): add note * (fix): ruff * (fix): rustfmt * (fix): `pyi` * (fix): try removing zarr main branch dep * fix: use upstream implicit fill values * fix: use upstream metadata handling There is a lot of additional logic already taken care of by `zarrs`, like handling multiple versions of codec metadata. * fix: cleanup fill value handling for string dtype * Revert "fix: cleanup fill value handling for string dtype" This reverts commit 6ff6c2b. * fix: cleanup fill value handling for string dtype * fix: fmt and clippy warnings * fix: zarr-python 0 fill value handling --------- Co-authored-by: Lachlan Deakin <ljdgit@gmail.com>
1 parent 5558c5e commit 1d4e3cb

14 files changed

Lines changed: 694 additions & 250 deletions

File tree

Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ pyo3-stub-gen = "0.6.2"
2222
opendal = { version = "0.51.0", features = ["services-http"] }
2323
tokio = { version = "1.41.1", features = ["rt-multi-thread"] }
2424
zarrs_opendal = "0.5.0"
25+
zarrs_metadata = "0.3.3" # require recent zarr-python compatibility fixes (remove with zarrs 0.20)
2526

2627
[profile.release]
2728
lto = true

README.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ You can then use your `zarr` as normal (with some caveats)!
2121

2222
## API
2323

24-
We export a `ZarrsCodecPipeline` class so that `zarr-python` can use the class but it is not meant to be instantiated and we do not guarantee the stability of its API beyond what is required so that `zarr-python` can use it. Therefore, it is not documented here. We also export two errors, `DiscontiguousArrayError` and `CollapsedDimensionError` that can be thrown in the process of converting to indexers that `zarrs` can understand (see below for more details).
24+
We export a `ZarrsCodecPipeline` class so that `zarr-python` can use the class but it is not meant to be instantiated and we do not guarantee the stability of its API beyond what is required so that `zarr-python` can use it. Therefore, it is not documented here.
2525

2626
At the moment, we only support a subset of the `zarr-python` stores:
2727

@@ -86,7 +86,7 @@ Chunk concurrency is typically favored because:
8686

8787
## Supported Indexing Methods
8888

89-
We **do not** officially support the following indexing methods. Some of these methods may error out, others may not:
89+
The following methods will trigger use with the old zarr-python pipeline:
9090

9191
1. Any `oindex` or `vindex` integer `np.ndarray` indexing with dimensionality >=3 i.e.,
9292

@@ -116,6 +116,9 @@ We **do not** officially support the following indexing methods. Some of these
116116
arr[0:10, ..., 0:5]
117117
```
118118

119-
Otherwise, we believe that we support your indexing case: slices, ints, and all integer `np.ndarray` indices in 2D for reading, contiguous integer `np.ndarray` indices along one axis for writing etc. Please file an issue if you believe we have more holes in our coverage than we are aware of or you wish to contribute! For example, we have an [issue in zarrs for integer-array indexing](https://github.com/LDeakin/zarrs/issues/52) that would unblock a lot of these issues!
120119

121-
That being said, using non-contiguous integer `np.ndarray` indexing for reads may not be as fast as expected given the performance of other supported methods. Until `zarrs` supports integer indexing, only fetching chunks is done in `rust` while indexing then occurs in `python`.
120+
Furthermore, using anything except contiguous (i.e., slices or consecutive integer) `np.ndarray` for numeric data will fall back to the default `zarr-python` implementation.
121+
122+
Please file an issue if you believe we have more holes in our coverage than we are aware of or you wish to contribute! For example, we have an [issue in zarrs for integer-array indexing](https://github.com/LDeakin/zarrs/issues/52) that would unblock a lot the use of the rust pipeline for that use-case (very useful for mini-batch training perhaps!).
123+
124+
Further, any codecs not supported by `zarrs` will also automatically fall back to the python implementation.

docs/api.md

Lines changed: 0 additions & 13 deletions
This file was deleted.

docs/index.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,5 @@
55
:hidden: true
66
:maxdepth: 1
77
8-
api
98
contributing
109
```

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ dependencies = [
3030
'donfig',
3131
'pytest',
3232
'universal_pathlib>=0.2.0',
33-
'zarr>=3.0.0',
33+
"zarr",
3434
]
3535

3636
[project.optional-dependencies]

python/zarrs/_internal.pyi

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@
44
import typing
55
from enum import Enum, auto
66

7-
import numpy
87
import numpy.typing
98

109
class Basic:
@@ -27,9 +26,6 @@ class CodecPipelineImpl:
2726
chunk_descriptions: typing.Sequence[WithSubset],
2827
value: numpy.typing.NDArray[typing.Any],
2928
) -> None: ...
30-
def retrieve_chunks(
31-
self, chunk_descriptions: typing.Sequence[Basic]
32-
) -> list[numpy.typing.NDArray[numpy.uint8]]: ...
3329
def store_chunks_with_indices(
3430
self,
3531
chunk_descriptions: typing.Sequence[WithSubset],

python/zarrs/pipeline.py

Lines changed: 115 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,17 @@
22

33
import asyncio
44
import json
5+
import re
56
from dataclasses import dataclass
67
from typing import TYPE_CHECKING, TypedDict
78

89
import numpy as np
910
from zarr.abc.codec import Codec, CodecPipeline
11+
from zarr.core import BatchedCodecPipeline
1012
from zarr.core.config import config
1113

1214
if TYPE_CHECKING:
13-
from collections.abc import Iterable, Iterator
15+
from collections.abc import Generator, Iterable, Iterator
1416
from typing import Any, Self
1517

1618
from zarr.abc.store import ByteGetter, ByteSetter
@@ -20,28 +22,64 @@
2022
from zarr.core.common import ChunkCoords
2123
from zarr.core.indexing import SelectorTuple
2224

23-
from ._internal import CodecPipelineImpl
25+
from ._internal import CodecPipelineImpl, codec_metadata_v2_to_v3
2426
from .utils import (
2527
CollapsedDimensionError,
2628
DiscontiguousArrayError,
27-
make_chunk_info_for_rust,
29+
FillValueNoneError,
2830
make_chunk_info_for_rust_with_indices,
2931
)
3032

3133

32-
def get_codec_pipeline_impl(codec_metadata_json: str) -> CodecPipelineImpl:
33-
return CodecPipelineImpl(
34-
codec_metadata_json,
35-
validate_checksums=config.get("codec_pipeline.validate_checksums", None),
36-
store_empty_chunks=config.get("array.write_empty_chunks", None),
37-
chunk_concurrent_minimum=config.get(
38-
"codec_pipeline.chunk_concurrent_minimum", None
39-
),
40-
chunk_concurrent_maximum=config.get(
41-
"codec_pipeline.chunk_concurrent_maximum", None
42-
),
43-
num_threads=config.get("threading.max_workers", None),
44-
)
34+
class UnsupportedDataTypeError(Exception):
35+
pass
36+
37+
38+
class UnsupportedMetadataError(Exception):
39+
pass
40+
41+
42+
def get_codec_pipeline_impl(codec_metadata_json: str) -> CodecPipelineImpl | None:
43+
try:
44+
return CodecPipelineImpl(
45+
codec_metadata_json,
46+
validate_checksums=config.get("codec_pipeline.validate_checksums", None),
47+
store_empty_chunks=config.get("array.write_empty_chunks", None),
48+
chunk_concurrent_minimum=config.get(
49+
"codec_pipeline.chunk_concurrent_minimum", None
50+
),
51+
chunk_concurrent_maximum=config.get(
52+
"codec_pipeline.chunk_concurrent_maximum", None
53+
),
54+
num_threads=config.get("threading.max_workers", None),
55+
)
56+
except TypeError as e:
57+
if re.match(r"codec (delta|zlib) is not supported", str(e)):
58+
return None
59+
else:
60+
raise e
61+
62+
63+
def codecs_to_dict(codecs: Iterable[Codec]) -> Generator[dict[str, Any], None, None]:
64+
for codec in codecs:
65+
if codec.__class__.__name__ == "V2Codec":
66+
codec_dict = codec.to_dict()
67+
if codec_dict.get("filters", None) is not None:
68+
filters = [
69+
json.dumps(filter.get_config())
70+
for filter in codec_dict.get("filters")
71+
]
72+
else:
73+
filters = None
74+
if codec_dict.get("compressor", None) is not None:
75+
compressor = json.dumps(codec_dict.get("compressor").get_config())
76+
else:
77+
compressor = None
78+
codecs_v3 = codec_metadata_v2_to_v3(filters, compressor)
79+
for codec in codecs_v3:
80+
yield json.loads(codec)
81+
else:
82+
yield codec.to_dict()
4583

4684

4785
class ZarrsCodecPipelineState(TypedDict):
@@ -52,8 +90,9 @@ class ZarrsCodecPipelineState(TypedDict):
5290
@dataclass
5391
class ZarrsCodecPipeline(CodecPipeline):
5492
codecs: tuple[Codec, ...]
55-
impl: CodecPipelineImpl
93+
impl: CodecPipelineImpl | None
5694
codec_metadata_json: str
95+
python_impl: BatchedCodecPipeline
5796

5897
def __getstate__(self) -> ZarrsCodecPipelineState:
5998
return {"codec_metadata_json": self.codec_metadata_json, "codecs": self.codecs}
@@ -62,13 +101,14 @@ def __setstate__(self, state: ZarrsCodecPipelineState):
62101
self.codecs = state["codecs"]
63102
self.codec_metadata_json = state["codec_metadata_json"]
64103
self.impl = get_codec_pipeline_impl(self.codec_metadata_json)
104+
self.python_impl = BatchedCodecPipeline.from_codecs(self.codecs)
65105

66106
def evolve_from_array_spec(self, array_spec: ArraySpec) -> Self:
67107
raise NotImplementedError("evolve_from_array_spec")
68108

69109
@classmethod
70110
def from_codecs(cls, codecs: Iterable[Codec]) -> Self:
71-
codec_metadata = [codec.to_dict() for codec in codecs]
111+
codec_metadata = list(codecs_to_dict(codecs))
72112
codec_metadata_json = json.dumps(codec_metadata)
73113
# TODO: upstream zarr-python has not settled on how to deal with configs yet
74114
# Should they be checked when an array is created, or when an operation is performed?
@@ -78,6 +118,7 @@ def from_codecs(cls, codecs: Iterable[Codec]) -> Self:
78118
codec_metadata_json=codec_metadata_json,
79119
codecs=tuple(codecs),
80120
impl=get_codec_pipeline_impl(codec_metadata_json),
121+
python_impl=BatchedCodecPipeline.from_codecs(codecs),
81122
)
82123

83124
@property
@@ -120,29 +161,32 @@ async def read(
120161
drop_axes: tuple[int, ...] = (), # FIXME: unused
121162
) -> None:
122163
# FIXME: Error if array is not in host memory
123-
out: NDArrayLike = out.as_ndarray_like()
124164
if not out.dtype.isnative:
125165
raise RuntimeError("Non-native byte order not supported")
126166
try:
167+
if self.impl is None:
168+
raise UnsupportedMetadataError()
169+
self._raise_error_on_unsupported_batch_dtype(batch_info)
127170
chunks_desc = make_chunk_info_for_rust_with_indices(
128171
batch_info, drop_axes, out.shape
129172
)
130-
except (DiscontiguousArrayError, CollapsedDimensionError):
131-
chunks_desc = make_chunk_info_for_rust(batch_info)
173+
except (
174+
UnsupportedMetadataError,
175+
DiscontiguousArrayError,
176+
CollapsedDimensionError,
177+
UnsupportedDataTypeError,
178+
FillValueNoneError,
179+
):
180+
await self.python_impl.read(batch_info, out, drop_axes)
181+
return None
132182
else:
183+
out: NDArrayLike = out.as_ndarray_like()
133184
await asyncio.to_thread(
134185
self.impl.retrieve_chunks_and_apply_index,
135186
chunks_desc,
136187
out,
137188
)
138189
return None
139-
chunks = await asyncio.to_thread(self.impl.retrieve_chunks, chunks_desc)
140-
for chunk, (_, spec, selection, out_selection) in zip(chunks, batch_info):
141-
chunk_reshaped = chunk.view(spec.dtype).reshape(spec.shape)
142-
chunk_selected = chunk_reshaped[selection]
143-
if drop_axes:
144-
chunk_selected = np.squeeze(chunk_selected, axis=drop_axes)
145-
out[out_selection] = chunk_selected
146190

147191
async def write(
148192
self,
@@ -152,14 +196,46 @@ async def write(
152196
value: NDBuffer, # type: ignore
153197
drop_axes: tuple[int, ...] = (),
154198
) -> None:
155-
# FIXME: Error if array is not in host memory
156-
value: NDArrayLike | np.ndarray = value.as_ndarray_like()
157-
if not value.dtype.isnative:
158-
value = np.ascontiguousarray(value, dtype=value.dtype.newbyteorder("="))
159-
elif not value.flags.c_contiguous:
160-
value = np.ascontiguousarray(value)
161-
chunks_desc = make_chunk_info_for_rust_with_indices(
162-
batch_info, drop_axes, value.shape
163-
)
164-
await asyncio.to_thread(self.impl.store_chunks_with_indices, chunks_desc, value)
165-
return None
199+
try:
200+
if self.impl is None:
201+
raise UnsupportedMetadataError()
202+
self._raise_error_on_unsupported_batch_dtype(batch_info)
203+
chunks_desc = make_chunk_info_for_rust_with_indices(
204+
batch_info, drop_axes, value.shape
205+
)
206+
except (
207+
UnsupportedMetadataError,
208+
DiscontiguousArrayError,
209+
CollapsedDimensionError,
210+
UnsupportedDataTypeError,
211+
FillValueNoneError,
212+
):
213+
await self.python_impl.write(batch_info, value, drop_axes)
214+
return None
215+
else:
216+
# FIXME: Error if array is not in host memory
217+
value_np: NDArrayLike | np.ndarray = value.as_ndarray_like()
218+
if not value_np.dtype.isnative:
219+
value_np = np.ascontiguousarray(
220+
value_np, dtype=value_np.dtype.newbyteorder("=")
221+
)
222+
elif not value_np.flags.c_contiguous:
223+
value_np = np.ascontiguousarray(value_np)
224+
await asyncio.to_thread(
225+
self.impl.store_chunks_with_indices, chunks_desc, value_np
226+
)
227+
return None
228+
229+
def _raise_error_on_unsupported_batch_dtype(
230+
self,
231+
batch_info: Iterable[
232+
tuple[ByteSetter, ArraySpec, SelectorTuple, SelectorTuple]
233+
],
234+
):
235+
# https://github.com/LDeakin/zarrs/blob/0532fe983b7b42b59dbf84e50a2fe5e6f7bad4ce/zarrs_metadata/src/v2_to_v3.rs#L289-L293 for VSUMm
236+
# Further, our pipeline does not support variable-length objects due to limitations on decode_into, so object is also out
237+
if any(
238+
info.dtype.kind in {"V", "S", "U", "M", "m", "O"}
239+
for (_, info, _, _) in batch_info
240+
):
241+
raise UnsupportedDataTypeError()

python/zarrs/utils.py

Lines changed: 21 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,12 @@
33
import operator
44
import os
55
from functools import reduce
6-
from typing import TYPE_CHECKING
6+
from typing import TYPE_CHECKING, Any
77

88
import numpy as np
9+
from zarr.core.array_spec import ArraySpec
910
from zarr.core.indexing import SelectorTuple, is_integer
11+
from zarr.core.metadata.v2 import _default_fill_value
1012

1113
from zarrs._internal import Basic, WithSubset
1214

@@ -15,7 +17,6 @@
1517
from types import EllipsisType
1618

1719
from zarr.abc.store import ByteGetter, ByteSetter
18-
from zarr.core.array_spec import ArraySpec
1920

2021

2122
# adapted from https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor
@@ -31,6 +32,10 @@ class CollapsedDimensionError(Exception):
3132
pass
3233

3334

35+
class FillValueNoneError(Exception):
36+
pass
37+
38+
3439
# This is a (mostly) copy of the function from zarr.core.indexing that fixes:
3540
# DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated
3641
# TODO: Upstream this fix
@@ -134,6 +139,12 @@ def get_shape_for_selector(
134139
return resulting_shape_from_index(shape, selector_tuple, drop_axes, pad=pad)
135140

136141

142+
def get_implicit_fill_value(dtype: np.dtype, fill_value: Any) -> Any:
143+
if fill_value is None:
144+
fill_value = _default_fill_value(dtype)
145+
return fill_value
146+
147+
137148
def make_chunk_info_for_rust_with_indices(
138149
batch_info: Iterable[
139150
tuple[ByteGetter | ByteSetter, ArraySpec, SelectorTuple, SelectorTuple]
@@ -144,6 +155,14 @@ def make_chunk_info_for_rust_with_indices(
144155
shape = shape if shape else (1,) # constant array
145156
chunk_info_with_indices: list[WithSubset] = []
146157
for byte_getter, chunk_spec, chunk_selection, out_selection in batch_info:
158+
if chunk_spec.fill_value is None:
159+
chunk_spec = ArraySpec(
160+
chunk_spec.shape,
161+
chunk_spec.dtype,
162+
get_implicit_fill_value(chunk_spec.dtype, chunk_spec.fill_value),
163+
chunk_spec.config,
164+
chunk_spec.prototype,
165+
)
147166
chunk_info = Basic(byte_getter, chunk_spec)
148167
out_selection_as_slices = selector_tuple_to_slice_selection(out_selection)
149168
chunk_selection_as_slices = selector_tuple_to_slice_selection(chunk_selection)
@@ -169,14 +188,3 @@ def make_chunk_info_for_rust_with_indices(
169188
)
170189
)
171190
return chunk_info_with_indices
172-
173-
174-
def make_chunk_info_for_rust(
175-
batch_info: Iterable[
176-
tuple[ByteGetter | ByteSetter, ArraySpec, SelectorTuple, SelectorTuple]
177-
],
178-
) -> list[Basic]:
179-
return [
180-
Basic(byte_interface, chunk_spec)
181-
for (byte_interface, chunk_spec, _, _) in batch_info
182-
]

0 commit comments

Comments
 (0)