Skip to content

Commit bb7befc

Browse files
authored
Merge branch 'main' into doc/threading-max-workers-docs
2 parents b55dd0f + 8f14d67 commit bb7befc

15 files changed

Lines changed: 443 additions & 18 deletions

File tree

changes/2720.doc.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Document removal of `zarr.storage.init_group` in v3 migration guide, with replacement using `zarr.open_group`/`zarr.create_group`.

changes/3748.feature.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Added `array.read_missing_chunks` configuration option. When set to `False`, reading missing chunks raises a `ChunkNotFoundError` instead of filling them with the array's fill value.

docs/user-guide/arrays.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -158,13 +158,25 @@ print(f"Shape after second append: {z.shape}")
158158

159159
Zarr arrays are parametrized with a configuration that determines certain aspects of array behavior.
160160

161-
We currently support two configuration options for arrays: `write_empty_chunks` and `order`.
161+
We currently support three configuration options for arrays: `write_empty_chunks`, `read_missing_chunks`, and `order`.
162162

163163
| field | type | default | description |
164164
| - | - | - | - |
165165
| `write_empty_chunks` | `bool` | `False` | Controls whether empty chunks are written to storage. See [Empty chunks](performance.md#empty-chunks).
166+
| `read_missing_chunks` | `bool` | `True` | Controls whether missing chunks are filled with the array's fill value on read. If `False`, reading missing chunks raises a [`ChunkNotFoundError`][zarr.errors.ChunkNotFoundError].
166167
| `order` | `Literal["C", "F"]` | `"C"` | The memory layout of arrays returned when reading data from the store.
167168

169+
!!! info
170+
The Zarr V3 spec states that readers should interpret an uninitialized chunk as containing the
171+
array's `fill_value`. By default, Zarr-Python follows this behavior: a missing chunk is treated
172+
as uninitialized and filled with the array's `fill_value`. However, if you know that all chunks
173+
have been written (i.e., are initialized), you may want to treat a missing chunk as an error. Set
174+
`read_missing_chunks=False` to raise a [`ChunkNotFoundError`][zarr.errors.ChunkNotFoundError] instead.
175+
176+
!!! note
177+
`write_empty_chunks=False` skips writing chunks that are entirely the array's fill value.
178+
If `read_missing_chunks=False`, attempting to read these missing chunks will raise a [`ChunkNotFoundError`][zarr.errors.ChunkNotFoundError].
179+
168180
You can specify the configuration when you create an array with the `config` keyword argument.
169181
`config` can be passed as either a `dict` or an `ArrayConfig` object.
170182

docs/user-guide/config.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ Configuration options include the following:
3030
- Default Zarr format `default_zarr_version`
3131
- Default array order in memory `array.order`
3232
- Whether empty chunks are written to storage `array.write_empty_chunks`
33+
- Whether missing chunks are filled with the array's fill value on read `array.read_missing_chunks` (default `True`). Set to `False` to raise a [`ChunkNotFoundError`][zarr.errors.ChunkNotFoundError] instead.
3334
- Async and threading options, e.g. `async.concurrency` and `threading.max_workers`
3435
- Selections of implementations of codecs, codec pipelines and buffers
3536
- Enabling GPU support with `zarr.config.enable_gpu()`. See GPU support for more.

docs/user-guide/v3_migration.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,15 @@ The following sections provide details on breaking changes in Zarr-Python 3.
114114
- Use [`zarr.Group.require_array`][] in place of `zarr.Group.require_dataset`
115115
3. Disallow "." syntax for getting group members. To get a member of a group named `foo`,
116116
use `group["foo"]` in place of `group.foo`.
117+
4. The `zarr.storage.init_group` low-level helper function has been removed. Use
118+
[`zarr.open_group`][] or [`zarr.create_group`][] instead:
119+
120+
```diff
121+
- from zarr.storage import init_group
122+
- init_group(store, overwrite=True, path="my/path")
123+
+ import zarr
124+
+ zarr.open_group(store, mode="w", path="my/path")
125+
```
117126

118127
### The Store class
119128

src/zarr/abc/codec.py

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -67,20 +67,19 @@ def _check_codecjson_v2(data: object) -> TypeGuard[CodecJSON_V2[str]]:
6767

6868

6969
@runtime_checkable
70-
class SupportsSyncCodec(Protocol):
70+
class SupportsSyncCodec[CI: CodecInput, CO: CodecOutput](Protocol):
7171
"""Protocol for codecs that support synchronous encode/decode.
7272
73-
Codecs implementing this protocol provide ``_decode_sync`` and ``_encode_sync``
73+
Codecs implementing this protocol provide `_decode_sync` and `_encode_sync`
7474
methods that perform encoding/decoding without requiring an async event loop.
75+
76+
The type parameters mirror `BaseCodec`: `CI` is the decoded type and `CO` is
77+
the encoded type.
7578
"""
7679

77-
def _decode_sync(
78-
self, chunk_data: NDBuffer | Buffer, chunk_spec: ArraySpec
79-
) -> NDBuffer | Buffer: ...
80+
def _decode_sync(self, chunk_data: CO, chunk_spec: ArraySpec) -> CI: ...
8081

81-
def _encode_sync(
82-
self, chunk_data: NDBuffer | Buffer, chunk_spec: ArraySpec
83-
) -> NDBuffer | Buffer | None: ...
82+
def _encode_sync(self, chunk_data: CI, chunk_spec: ArraySpec) -> CO | None: ...
8483

8584

8685
class BaseCodec[CI: CodecInput, CO: CodecOutput](Metadata):

src/zarr/core/array.py

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,7 @@
117117
from zarr.core.sync import sync
118118
from zarr.errors import (
119119
ArrayNotFoundError,
120+
ChunkNotFoundError,
120121
MetadataValidationError,
121122
ZarrDeprecationWarning,
122123
ZarrUserWarning,
@@ -5610,7 +5611,8 @@ async def _get_selection(
56105611
_config = replace(_config, order=order)
56115612

56125613
# reading chunks and decoding them
5613-
await codec_pipeline.read(
5614+
indexed_chunks = list(indexer)
5615+
results = await codec_pipeline.read(
56145616
[
56155617
(
56165618
store_path / metadata.encode_chunk_key(chunk_coords),
@@ -5619,11 +5621,26 @@ async def _get_selection(
56195621
out_selection,
56205622
is_complete_chunk,
56215623
)
5622-
for chunk_coords, chunk_selection, out_selection, is_complete_chunk in indexer
5624+
for chunk_coords, chunk_selection, out_selection, is_complete_chunk in indexed_chunks
56235625
],
56245626
out_buffer,
56255627
drop_axes=indexer.drop_axes,
56265628
)
5629+
if _config.read_missing_chunks is False:
5630+
missing_info = []
5631+
for i, result in enumerate(results):
5632+
if result["status"] == "missing":
5633+
coords = indexed_chunks[i][0]
5634+
key = metadata.encode_chunk_key(coords)
5635+
missing_info.append(f" chunk '{key}' (grid position {coords})")
5636+
if missing_info:
5637+
chunks_str = "\n".join(missing_info)
5638+
raise ChunkNotFoundError(
5639+
f"{len(missing_info)} chunk(s) not found in store '{store_path}'.\n"
5640+
f"Set the 'array.read_missing_chunks' config to True to fill "
5641+
f"missing chunks with the fill value.\n"
5642+
f"Missing chunks:\n{chunks_str}"
5643+
)
56275644
if isinstance(indexer, BasicIndexer) and indexer.shape == ():
56285645
return out_buffer.as_scalar()
56295646
return out_buffer.as_ndarray_like()

src/zarr/core/array_spec.py

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ class ArrayConfigParams(TypedDict):
2828

2929
order: NotRequired[MemoryOrder]
3030
write_empty_chunks: NotRequired[bool]
31+
read_missing_chunks: NotRequired[bool]
3132

3233

3334
@dataclass(frozen=True)
@@ -41,17 +42,25 @@ class ArrayConfig:
4142
The memory layout of the arrays returned when reading data from the store.
4243
write_empty_chunks : bool
4344
If True, empty chunks will be written to the store.
45+
read_missing_chunks : bool
46+
If True, missing chunks will be filled with the array's fill value on read.
47+
If False, reading missing chunks will raise a ``ChunkNotFoundError``.
4448
"""
4549

4650
order: MemoryOrder
4751
write_empty_chunks: bool
52+
read_missing_chunks: bool
4853

49-
def __init__(self, order: MemoryOrder, write_empty_chunks: bool) -> None:
54+
def __init__(
55+
self, order: MemoryOrder, write_empty_chunks: bool, *, read_missing_chunks: bool = True
56+
) -> None:
5057
order_parsed = parse_order(order)
5158
write_empty_chunks_parsed = parse_bool(write_empty_chunks)
59+
read_missing_chunks_parsed = parse_bool(read_missing_chunks)
5260

5361
object.__setattr__(self, "order", order_parsed)
5462
object.__setattr__(self, "write_empty_chunks", write_empty_chunks_parsed)
63+
object.__setattr__(self, "read_missing_chunks", read_missing_chunks_parsed)
5564

5665
@classmethod
5766
def from_dict(cls, data: ArrayConfigParams) -> Self:
@@ -62,7 +71,9 @@ def from_dict(cls, data: ArrayConfigParams) -> Self:
6271
"""
6372
kwargs_out: ArrayConfigParams = {}
6473
for f in fields(ArrayConfig):
65-
field_name = cast("Literal['order', 'write_empty_chunks']", f.name)
74+
field_name = cast(
75+
"Literal['order', 'write_empty_chunks', 'read_missing_chunks']", f.name
76+
)
6677
if field_name not in data:
6778
kwargs_out[field_name] = zarr_config.get(f"array.{field_name}")
6879
else:
@@ -73,7 +84,11 @@ def to_dict(self) -> ArrayConfigParams:
7384
"""
7485
Serialize an instance of this class to a dict.
7586
"""
76-
return {"order": self.order, "write_empty_chunks": self.write_empty_chunks}
87+
return {
88+
"order": self.order,
89+
"write_empty_chunks": self.write_empty_chunks,
90+
"read_missing_chunks": self.read_missing_chunks,
91+
}
7792

7893

7994
ArrayConfigLike = ArrayConfig | ArrayConfigParams

src/zarr/core/codec_pipeline.py

Lines changed: 107 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
from __future__ import annotations
22

3-
from dataclasses import dataclass
3+
from dataclasses import dataclass, field
44
from itertools import islice, pairwise
55
from typing import TYPE_CHECKING, Any
66
from warnings import warn
@@ -14,6 +14,7 @@
1414
Codec,
1515
CodecPipeline,
1616
GetResult,
17+
SupportsSyncCodec,
1718
)
1819
from zarr.core.common import concurrent_map
1920
from zarr.core.config import config
@@ -66,6 +67,111 @@ def fill_value_or_default(chunk_spec: ArraySpec) -> Any:
6667
return fill_value
6768

6869

70+
@dataclass(slots=True, kw_only=True)
71+
class ChunkTransform:
72+
"""A synchronous codec chain bound to an ArraySpec.
73+
74+
Provides `encode` and `decode` for pure-compute codec operations
75+
(no IO, no threading, no batching).
76+
77+
All codecs must implement `SupportsSyncCodec`. Construction will
78+
raise `TypeError` if any codec does not.
79+
"""
80+
81+
codecs: tuple[Codec, ...]
82+
array_spec: ArraySpec
83+
84+
# (sync codec, input_spec) pairs in pipeline order.
85+
_aa_codecs: tuple[tuple[SupportsSyncCodec[NDBuffer, NDBuffer], ArraySpec], ...] = field(
86+
init=False, repr=False, compare=False
87+
)
88+
_ab_codec: SupportsSyncCodec[NDBuffer, Buffer] = field(init=False, repr=False, compare=False)
89+
_ab_spec: ArraySpec = field(init=False, repr=False, compare=False)
90+
_bb_codecs: tuple[SupportsSyncCodec[Buffer, Buffer], ...] = field(
91+
init=False, repr=False, compare=False
92+
)
93+
94+
def __post_init__(self) -> None:
95+
non_sync = [c for c in self.codecs if not isinstance(c, SupportsSyncCodec)]
96+
if non_sync:
97+
names = ", ".join(type(c).__name__ for c in non_sync)
98+
raise TypeError(
99+
f"All codecs must implement SupportsSyncCodec. The following do not: {names}"
100+
)
101+
102+
aa, ab, bb = codecs_from_list(list(self.codecs))
103+
104+
aa_codecs: list[tuple[SupportsSyncCodec[NDBuffer, NDBuffer], ArraySpec]] = []
105+
spec = self.array_spec
106+
for aa_codec in aa:
107+
assert isinstance(aa_codec, SupportsSyncCodec)
108+
aa_codecs.append((aa_codec, spec))
109+
spec = aa_codec.resolve_metadata(spec)
110+
111+
self._aa_codecs = tuple(aa_codecs)
112+
assert isinstance(ab, SupportsSyncCodec)
113+
self._ab_codec = ab
114+
self._ab_spec = spec
115+
bb_sync: list[SupportsSyncCodec[Buffer, Buffer]] = []
116+
for bb_codec in bb:
117+
assert isinstance(bb_codec, SupportsSyncCodec)
118+
bb_sync.append(bb_codec)
119+
self._bb_codecs = tuple(bb_sync)
120+
121+
def decode(
122+
self,
123+
chunk_bytes: Buffer,
124+
) -> NDBuffer:
125+
"""Decode a single chunk through the full codec chain, synchronously.
126+
127+
Pure compute -- no IO.
128+
"""
129+
data: Buffer = chunk_bytes
130+
for bb_codec in reversed(self._bb_codecs):
131+
data = bb_codec._decode_sync(data, self._ab_spec)
132+
133+
chunk_array: NDBuffer = self._ab_codec._decode_sync(data, self._ab_spec)
134+
135+
for aa_codec, spec in reversed(self._aa_codecs):
136+
chunk_array = aa_codec._decode_sync(chunk_array, spec)
137+
138+
return chunk_array
139+
140+
def encode(
141+
self,
142+
chunk_array: NDBuffer,
143+
) -> Buffer | None:
144+
"""Encode a single chunk through the full codec chain, synchronously.
145+
146+
Pure compute -- no IO.
147+
"""
148+
aa_data: NDBuffer = chunk_array
149+
for aa_codec, spec in self._aa_codecs:
150+
aa_result = aa_codec._encode_sync(aa_data, spec)
151+
if aa_result is None:
152+
return None
153+
aa_data = aa_result
154+
155+
ab_result = self._ab_codec._encode_sync(aa_data, self._ab_spec)
156+
if ab_result is None:
157+
return None
158+
159+
bb_data: Buffer = ab_result
160+
for bb_codec in self._bb_codecs:
161+
bb_result = bb_codec._encode_sync(bb_data, self._ab_spec)
162+
if bb_result is None:
163+
return None
164+
bb_data = bb_result
165+
166+
return bb_data
167+
168+
def compute_encoded_size(self, byte_length: int, array_spec: ArraySpec) -> int:
169+
for codec in self.codecs:
170+
byte_length = codec.compute_encoded_size(byte_length, array_spec)
171+
array_spec = codec.resolve_metadata(array_spec)
172+
return byte_length
173+
174+
69175
@dataclass(frozen=True)
70176
class BatchedCodecPipeline(CodecPipeline):
71177
"""Default codec pipeline.

src/zarr/core/config.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,7 @@ def enable_gpu(self) -> ConfigSet:
9696
"array": {
9797
"order": "C",
9898
"write_empty_chunks": False,
99+
"read_missing_chunks": True,
99100
"target_shard_size_bytes": None,
100101
},
101102
"async": {"concurrency": 10, "timeout": None},

0 commit comments

Comments
 (0)