Skip to content

Commit 1bbc826

Browse files
williamsnelltomwhited-v-bmaxrjones
authored
Fill missing chunks (#3748)
* Add `codec_pipeline.fill_missing_chunks` config * Set default for `fill_missing_chunks` in config.py. Add test replicating example in zarr-python #486. * Add fill_missing_chunks to examples of config options. * Add to /changes * Parameterize tests to make sure we hit both branches of `if self.supports_partial_decode`. * Fix lint errors: remove parentheses, type kwargs. * Move config from codec_pipeline -> array. Update docs, tests. * Delegate missing-shard detection away from _get_chunk_spec. Codify expected behaviour of fill_missing_chunks for both sharding and write_empty_chunks via tests. Use elif to make control flow slightly clearer. * Define ChunkNotFoundError; expose chunk key and chunk index in ChunkNotFoundError * update docs * fix links * cleanup * Pass chunk indexes up * fill_missing_chunks -> read_missing_chunks * Resolve behavioural differences between main and maxrjones/zarr-python@37a40e3. Update docstrings to match current behaviour. Move description of sharding behaviour to test, now that it has no dedicated codepath. --------- Co-authored-by: Tom White <tom.e.white@gmail.com> Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com> Co-authored-by: Max Jones <14077947+maxrjones@users.noreply.github.com>
1 parent c9b534a commit 1bbc826

File tree

8 files changed

+164
-7
lines changed

8 files changed

+164
-7
lines changed

changes/3748.feature.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Added `array.read_missing_chunks` configuration option. When set to `False`, reading missing chunks raises a `ChunkNotFoundError` instead of filling them with the array's fill value.

docs/user-guide/arrays.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -158,13 +158,25 @@ print(f"Shape after second append: {z.shape}")
158158

159159
Zarr arrays are parametrized with a configuration that determines certain aspects of array behavior.
160160

161-
We currently support two configuration options for arrays: `write_empty_chunks` and `order`.
161+
We currently support three configuration options for arrays: `write_empty_chunks`, `read_missing_chunks`, and `order`.
162162

163163
| field | type | default | description |
164164
| - | - | - | - |
165165
| `write_empty_chunks` | `bool` | `False` | Controls whether empty chunks are written to storage. See [Empty chunks](performance.md#empty-chunks).
166+
| `read_missing_chunks` | `bool` | `True` | Controls whether missing chunks are filled with the array's fill value on read. If `False`, reading missing chunks raises a [`ChunkNotFoundError`][zarr.errors.ChunkNotFoundError].
166167
| `order` | `Literal["C", "F"]` | `"C"` | The memory layout of arrays returned when reading data from the store.
167168

169+
!!! info
170+
The Zarr V3 spec states that readers should interpret an uninitialized chunk as containing the
171+
array's `fill_value`. By default, Zarr-Python follows this behavior: a missing chunk is treated
172+
as uninitialized and filled with the array's `fill_value`. However, if you know that all chunks
173+
have been written (i.e., are initialized), you may want to treat a missing chunk as an error. Set
174+
`read_missing_chunks=False` to raise a [`ChunkNotFoundError`][zarr.errors.ChunkNotFoundError] instead.
175+
176+
!!! note
177+
`write_empty_chunks=False` skips writing chunks that are entirely the array's fill value.
178+
If `read_missing_chunks=False`, attempting to read these missing chunks will raise a [`ChunkNotFoundError`][zarr.errors.ChunkNotFoundError].
179+
168180
You can specify the configuration when you create an array with the `config` keyword argument.
169181
`config` can be passed as either a `dict` or an `ArrayConfig` object.
170182

docs/user-guide/config.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ Configuration options include the following:
3030
- Default Zarr format `default_zarr_version`
3131
- Default array order in memory `array.order`
3232
- Whether empty chunks are written to storage `array.write_empty_chunks`
33+
- Whether missing chunks are filled with the array's fill value on read `array.read_missing_chunks` (default `True`). Set to `False` to raise a [`ChunkNotFoundError`][zarr.errors.ChunkNotFoundError] instead.
3334
- Async and threading options, e.g. `async.concurrency` and `threading.max_workers`
3435
- Selections of implementations of codecs, codec pipelines and buffers
3536
- Enabling GPU support with `zarr.config.enable_gpu()`. See GPU support for more.

src/zarr/core/array.py

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,7 @@
117117
from zarr.core.sync import sync
118118
from zarr.errors import (
119119
ArrayNotFoundError,
120+
ChunkNotFoundError,
120121
MetadataValidationError,
121122
ZarrDeprecationWarning,
122123
ZarrUserWarning,
@@ -5610,7 +5611,8 @@ async def _get_selection(
56105611
_config = replace(_config, order=order)
56115612

56125613
# reading chunks and decoding them
5613-
await codec_pipeline.read(
5614+
indexed_chunks = list(indexer)
5615+
results = await codec_pipeline.read(
56145616
[
56155617
(
56165618
store_path / metadata.encode_chunk_key(chunk_coords),
@@ -5619,11 +5621,26 @@ async def _get_selection(
56195621
out_selection,
56205622
is_complete_chunk,
56215623
)
5622-
for chunk_coords, chunk_selection, out_selection, is_complete_chunk in indexer
5624+
for chunk_coords, chunk_selection, out_selection, is_complete_chunk in indexed_chunks
56235625
],
56245626
out_buffer,
56255627
drop_axes=indexer.drop_axes,
56265628
)
5629+
if _config.read_missing_chunks is False:
5630+
missing_info = []
5631+
for i, result in enumerate(results):
5632+
if result["status"] == "missing":
5633+
coords = indexed_chunks[i][0]
5634+
key = metadata.encode_chunk_key(coords)
5635+
missing_info.append(f" chunk '{key}' (grid position {coords})")
5636+
if missing_info:
5637+
chunks_str = "\n".join(missing_info)
5638+
raise ChunkNotFoundError(
5639+
f"{len(missing_info)} chunk(s) not found in store '{store_path}'.\n"
5640+
f"Set the 'array.read_missing_chunks' config to True to fill "
5641+
f"missing chunks with the fill value.\n"
5642+
f"Missing chunks:\n{chunks_str}"
5643+
)
56275644
if isinstance(indexer, BasicIndexer) and indexer.shape == ():
56285645
return out_buffer.as_scalar()
56295646
return out_buffer.as_ndarray_like()

src/zarr/core/array_spec.py

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ class ArrayConfigParams(TypedDict):
2828

2929
order: NotRequired[MemoryOrder]
3030
write_empty_chunks: NotRequired[bool]
31+
read_missing_chunks: NotRequired[bool]
3132

3233

3334
@dataclass(frozen=True)
@@ -41,17 +42,25 @@ class ArrayConfig:
4142
The memory layout of the arrays returned when reading data from the store.
4243
write_empty_chunks : bool
4344
If True, empty chunks will be written to the store.
45+
read_missing_chunks : bool
46+
If True, missing chunks will be filled with the array's fill value on read.
47+
If False, reading missing chunks will raise a ``ChunkNotFoundError``.
4448
"""
4549

4650
order: MemoryOrder
4751
write_empty_chunks: bool
52+
read_missing_chunks: bool
4853

49-
def __init__(self, order: MemoryOrder, write_empty_chunks: bool) -> None:
54+
def __init__(
55+
self, order: MemoryOrder, write_empty_chunks: bool, *, read_missing_chunks: bool = True
56+
) -> None:
5057
order_parsed = parse_order(order)
5158
write_empty_chunks_parsed = parse_bool(write_empty_chunks)
59+
read_missing_chunks_parsed = parse_bool(read_missing_chunks)
5260

5361
object.__setattr__(self, "order", order_parsed)
5462
object.__setattr__(self, "write_empty_chunks", write_empty_chunks_parsed)
63+
object.__setattr__(self, "read_missing_chunks", read_missing_chunks_parsed)
5564

5665
@classmethod
5766
def from_dict(cls, data: ArrayConfigParams) -> Self:
@@ -62,7 +71,9 @@ def from_dict(cls, data: ArrayConfigParams) -> Self:
6271
"""
6372
kwargs_out: ArrayConfigParams = {}
6473
for f in fields(ArrayConfig):
65-
field_name = cast("Literal['order', 'write_empty_chunks']", f.name)
74+
field_name = cast(
75+
"Literal['order', 'write_empty_chunks', 'read_missing_chunks']", f.name
76+
)
6677
if field_name not in data:
6778
kwargs_out[field_name] = zarr_config.get(f"array.{field_name}")
6879
else:
@@ -73,7 +84,11 @@ def to_dict(self) -> ArrayConfigParams:
7384
"""
7485
Serialize an instance of this class to a dict.
7586
"""
76-
return {"order": self.order, "write_empty_chunks": self.write_empty_chunks}
87+
return {
88+
"order": self.order,
89+
"write_empty_chunks": self.write_empty_chunks,
90+
"read_missing_chunks": self.read_missing_chunks,
91+
}
7792

7893

7994
ArrayConfigLike = ArrayConfig | ArrayConfigParams

src/zarr/core/config.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,7 @@ def enable_gpu(self) -> ConfigSet:
9696
"array": {
9797
"order": "C",
9898
"write_empty_chunks": False,
99+
"read_missing_chunks": True,
99100
"target_shard_size_bytes": None,
100101
},
101102
"async": {"concurrency": 10, "timeout": None},

src/zarr/errors.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
"ArrayNotFoundError",
44
"BaseZarrError",
55
"BoundsCheckError",
6+
"ChunkNotFoundError",
67
"ContainsArrayAndGroupError",
78
"ContainsArrayError",
89
"ContainsGroupError",
@@ -144,3 +145,9 @@ class BoundsCheckError(IndexError): ...
144145

145146

146147
class ArrayIndexError(IndexError): ...
148+
149+
150+
class ChunkNotFoundError(BaseZarrError):
151+
"""
152+
Raised when a chunk that was expected to exist in storage was not retrieved successfully.
153+
"""

tests/test_config.py

Lines changed: 104 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
from zarr.core.codec_pipeline import BatchedCodecPipeline
2424
from zarr.core.config import BadConfigError, config
2525
from zarr.core.indexing import SelectorTuple
26-
from zarr.errors import ZarrUserWarning
26+
from zarr.errors import ChunkNotFoundError, ZarrUserWarning
2727
from zarr.registry import (
2828
fully_qualified_name,
2929
get_buffer_class,
@@ -53,6 +53,7 @@ def test_config_defaults_set() -> None:
5353
"array": {
5454
"order": "C",
5555
"write_empty_chunks": False,
56+
"read_missing_chunks": True,
5657
"target_shard_size_bytes": None,
5758
},
5859
"async": {"concurrency": 10, "timeout": None},
@@ -319,6 +320,108 @@ class NewCodec2(BytesCodec):
319320
get_codec_class("new_codec")
320321

321322

323+
@pytest.mark.parametrize("store", ["local", "memory"], indirect=["store"])
324+
@pytest.mark.parametrize(
325+
"kwargs",
326+
[
327+
{"shards": (4, 4)},
328+
{"compressors": None},
329+
],
330+
ids=["partial_decode", "full_decode"],
331+
)
332+
def test_config_read_missing_chunks(store: Store, kwargs: dict[str, Any]) -> None:
333+
arr = zarr.create_array(
334+
store=store,
335+
shape=(4, 4),
336+
chunks=(2, 2),
337+
dtype="int32",
338+
fill_value=42,
339+
**kwargs,
340+
)
341+
342+
# default behavior: missing chunks are filled with the fill value
343+
result = zarr.open_array(store)[:]
344+
assert np.array_equal(result, np.full((4, 4), 42, dtype="int32"))
345+
346+
# with read_missing_chunks=False, reading missing chunks raises an error
347+
with config.set({"array.read_missing_chunks": False}):
348+
with pytest.raises(ChunkNotFoundError):
349+
zarr.open_array(store)[:]
350+
351+
# after writing data, all chunks exist and no error is raised
352+
arr[:] = np.arange(16, dtype="int32").reshape(4, 4)
353+
with config.set({"array.read_missing_chunks": False}):
354+
result = zarr.open_array(store)[:]
355+
assert np.array_equal(result, np.arange(16, dtype="int32").reshape(4, 4))
356+
357+
358+
@pytest.mark.parametrize("store", ["local", "memory"], indirect=["store"])
359+
def test_config_read_missing_chunks_sharded_inner(store: Store) -> None:
360+
"""Because the shard index and inner chunks should be stored
361+
together in a single storage object (read: a file or blob),
362+
we delegate to the shard index the responsibility of determining
363+
what chunks should be present.
364+
365+
Thus, `read_missing_chunks` raises an error only if the entire *shard*
366+
is missing. Missing inner chunks are filled with the array's fill value
367+
and do not raise an error, even if `read_missing_chunks=False` at the
368+
array level.
369+
"""
370+
arr = zarr.create_array(
371+
store=store,
372+
shape=(8, 4),
373+
chunks=(2, 2),
374+
shards=(4, 4),
375+
dtype="int32",
376+
fill_value=42,
377+
)
378+
379+
# write only one inner chunk in the first shard, leaving the second shard empty
380+
arr[0:2, 0:2] = np.ones((2, 2), dtype="int32")
381+
382+
with config.set({"array.read_missing_chunks": False}):
383+
a = zarr.open_array(store)
384+
385+
# first shard exists: missing inner chunks are filled, no error
386+
result = a[:4]
387+
expected = np.full((4, 4), 42, dtype="int32")
388+
expected[0:2, 0:2] = 1
389+
assert np.array_equal(result, expected)
390+
391+
# second shard is entirely missing: raises an error
392+
with pytest.raises(ChunkNotFoundError):
393+
a[4:]
394+
395+
396+
@pytest.mark.parametrize("store", ["local", "memory"], indirect=["store"])
397+
def test_config_read_missing_chunks_write_empty_chunks(store: Store) -> None:
398+
"""write_empty_chunks=False drops chunks equal to fill_value, which then
399+
appear missing to read_missing_chunks=False."""
400+
arr = zarr.create_array(
401+
store=store,
402+
shape=(4,),
403+
chunks=(2,),
404+
dtype="int32",
405+
fill_value=0,
406+
config={"write_empty_chunks": False, "read_missing_chunks": False},
407+
)
408+
409+
# write non-fill-value data: chunks are stored
410+
arr[:] = [1, 2, 3, 4]
411+
assert np.array_equal(arr[:], [1, 2, 3, 4])
412+
413+
# overwrite with fill_value: chunks are dropped by write_empty_chunks=False
414+
arr[:] = 0
415+
with pytest.raises(ChunkNotFoundError):
416+
arr[:]
417+
418+
# with write_empty_chunks=True, chunks are kept and no error is raised
419+
with config.set({"array.write_empty_chunks": True}):
420+
arr = zarr.open_array(store)
421+
arr[:] = 0
422+
assert np.array_equal(arr[:], [0, 0, 0, 0])
423+
424+
322425
@pytest.mark.parametrize(
323426
"key",
324427
[

0 commit comments

Comments
 (0)