Skip to content

Commit f64de89

Browse files
committed
Merge remote-tracking branch 'upstream/main' into perf/codec-chain
# Conflicts: # src/zarr/abc/store.py # src/zarr/storage/_common.py # src/zarr/storage/_local.py # src/zarr/testing/store.py # tests/test_codecs/test_zstd.py
2 parents 55821b8 + 03355b8 commit f64de89

14 files changed

Lines changed: 211 additions & 118 deletions

File tree

docs/_static/favicon-96x96.png

12.4 KB
Loading

docs/overrides/stylesheets/extra.css

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -56,11 +56,6 @@
5656
.md-header .md-search__input {
5757
background-color: rgba(255, 255, 255, 0.15);
5858
border: 1px solid rgba(255, 255, 255, 0.2);
59-
color: white;
60-
}
61-
62-
.md-header .md-search__input::placeholder {
63-
color: rgba(255, 255, 255, 0.7);
6459
}
6560

6661
/* Navigation tabs */

docs/user-guide/glossary.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Glossary
2+
3+
This page defines key terms used throughout the zarr-python documentation and API.
4+
5+
## Array Structure
6+
7+
### Array
8+
9+
An N-dimensional typed array stored in a Zarr [store](#store). An array's
10+
[metadata](#metadata) defines its shape, data type, chunk layout, and codecs.
11+
12+
### Chunk
13+
14+
The fundamental unit of data in a Zarr array. An array is divided into chunks
15+
along each dimension according to the [chunk grid](#chunk-grid), which is currently
16+
part of Zarr's private API. Each chunk is independently compressed and encoded
17+
through the array's [codec](#codec) pipeline.
18+
19+
When [sharding](#shard) is used, "chunk" refers to the inner chunks within each
20+
shard, because those are the compressible units. The chunks are the smallest units
21+
that can be read independently.
22+
23+
!!! warning "Convention specific to zarr-python"
24+
The use of "chunk" to mean the inner sub-chunk within a shard is a convention
25+
adopted by zarr-python's `Array` API. In the Zarr V3 specification and in other
26+
Zarr implementations, "chunk" may refer to the top-level grid cells (which
27+
zarr-python calls "shards" when the sharding codec is used). Be aware of this
28+
distinction when working across libraries.
29+
30+
**API**: [`Array.chunks`][zarr.Array.chunks] returns the chunk shape. When
31+
sharding is used, this is the inner chunk shape.
32+
33+
### Chunk Grid
34+
35+
The partitioning of an array's elements into [chunks](#chunk). In Zarr V3, the
36+
chunk grid is defined in the array [metadata](#metadata) and determines the
37+
boundaries of each storage object.
38+
39+
When sharding is used, the chunk grid defines the [shard](#shard) boundaries,
40+
not the inner chunk boundaries. The inner chunk shape is defined within the
41+
[sharding codec](#shard).
42+
43+
**API**: The `chunk_grid` field in array metadata contains the storage-level
44+
grid.
45+
46+
### Shard
47+
48+
A storage object that contains one or more [chunks](#chunk). Sharding reduces the
49+
number of objects in a [store](#store) by grouping chunks together, which
50+
improves performance on file systems and object storage.
51+
52+
Within each shard, chunks are compressed independently and can be read
53+
individually. However, writing requires updating the full shard for consistency,
54+
making shards the unit of writing and chunks the unit of reading.
55+
56+
Sharding is implemented as a [codec](#codec) (the sharding indexed codec).
57+
When sharding is used:
58+
59+
- The [chunk grid](#chunk-grid) in metadata defines the shard boundaries
60+
- The sharding codec's `chunk_shape` defines the inner chunk size
61+
- Each shard contains `shard_shape / chunk_shape` chunks per dimension
62+
63+
**API**: [`Array.shards`][zarr.Array.shards] returns the shard shape, or `None`
64+
if sharding is not used. [`Array.chunks`][zarr.Array.chunks] returns the inner
65+
chunk shape.
66+
67+
## Storage
68+
69+
### Store
70+
71+
A key-value storage backend that holds Zarr data and metadata. Stores implement
72+
the [`zarr.abc.store.Store`][] interface. Examples include local file systems,
73+
cloud object storage (S3, GCS, Azure), zip files, and in-memory dictionaries.
74+
75+
Each [chunk](#chunk) or [shard](#shard) is stored as a single value (object or
76+
file) in the store, addressed by a key derived from its grid coordinates.
77+
78+
### Metadata
79+
80+
The JSON document (`zarr.json`) that describes an [array](#array) or group. For
81+
arrays, metadata includes the shape, data type, [chunk grid](#chunk-grid), fill
82+
value, and [codec](#codec) pipeline. Metadata is stored alongside the data in
83+
the [store](#store). Zarr-Python does not yet expose its internal metadata
84+
representation as part of its public API.
85+
86+
## Codecs
87+
88+
### Codec
89+
90+
A transformation applied to array data during reading and writing. Codecs are
91+
chained into a pipeline and come in three types:
92+
93+
- **Array-to-array**: Transforms like transpose that rearrange array elements
94+
- **Array-to-bytes**: Serialization that converts an array to a byte sequence
95+
(exactly one required)
96+
- **Bytes-to-bytes**: Compression or checksums applied to the serialized bytes
97+
98+
The [sharding indexed codec](#shard) is a special array-to-bytes codec that
99+
groups multiple [chunks](#chunk) into a single storage object.
100+
101+
## API Properties
102+
103+
The following properties are available on [`zarr.Array`][]:
104+
105+
| Property | Description |
106+
|----------|-------------|
107+
| `.chunks` | Chunk shape — the inner chunk shape when sharding is used |
108+
| `.shards` | Shard shape, or `None` if no sharding |
109+
| `.nchunks` | Total number of independently compressible units across the array |
110+
| `.cdata_shape` | Number of independently compressible units per dimension |

docs/user-guide/index.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,10 @@ Take your skills to the next level:
3535
- **[Extending](extending.md)** - Extend functionality with custom code
3636
- **[Consolidated Metadata](consolidated_metadata.md)** - Advanced metadata management
3737

38+
## Reference
39+
40+
- **[Glossary](glossary.md)** - Definitions of key terms (chunks, shards, codecs, etc.)
41+
3842
## Need Help?
3943

4044
- Browse the [API Reference](../api/zarr/index.md) for detailed function documentation

mkdocs.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ nav:
2727
- user-guide/gpu.md
2828
- user-guide/consolidated_metadata.md
2929
- user-guide/experimental.md
30+
- user-guide/glossary.md
3031
- Examples:
3132
- user-guide/examples/custom_dtype.md
3233
- API Reference:
@@ -84,6 +85,7 @@ theme:
8485
name: material
8586
custom_dir: docs/overrides
8687
logo: _static/logo_bw.png
88+
favicon: _static/favicon-96x96.png
8789

8890
palette:
8991
# Light mode

src/zarr/abc/store.py

Lines changed: 1 addition & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,6 @@
2222
"Store",
2323
"SupportsDeleteSync",
2424
"SupportsGetSync",
25-
"SupportsSetRangeSync",
2625
"SupportsSetSync",
2726
"SupportsSyncStore",
2827
"set_or_delete",
@@ -726,20 +725,13 @@ class SupportsSetSync(Protocol):
726725
def set_sync(self, key: str, value: Buffer) -> None: ...
727726

728727

729-
@runtime_checkable
730-
class SupportsSetRangeSync(Protocol):
731-
def set_range_sync(self, key: str, value: Buffer, start: int) -> None: ...
732-
733-
734728
@runtime_checkable
735729
class SupportsDeleteSync(Protocol):
736730
def delete_sync(self, key: str) -> None: ...
737731

738732

739733
@runtime_checkable
740-
class SupportsSyncStore(
741-
SupportsGetSync, SupportsSetSync, SupportsSetRangeSync, SupportsDeleteSync, Protocol
742-
): ...
734+
class SupportsSyncStore(SupportsGetSync, SupportsSetSync, SupportsDeleteSync, Protocol): ...
743735

744736

745737
async def set_or_delete(byte_setter: ByteSetter, value: Buffer | None) -> None:

src/zarr/core/chunk_grids.py

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -126,11 +126,14 @@ def normalize_chunks(chunks: Any, shape: tuple[int, ...], typesize: int) -> tupl
126126
chunks = tuple(int(chunks) for _ in shape)
127127

128128
# handle dask-style chunks (iterable of iterables)
129-
if all(isinstance(c, (tuple | list)) for c in chunks):
130-
# take first chunk size for each dimension
131-
chunks = tuple(
132-
c[0] for c in chunks
133-
) # TODO: check/error/warn for irregular chunks (e.g. if c[0] != c[1:-1])
129+
if all(isinstance(c, (tuple, list)) for c in chunks):
130+
for i, c in enumerate(chunks):
131+
if any(x != y for x, y in itertools.pairwise(c[:-1])) or (len(c) > 1 and c[-1] > c[0]):
132+
raise ValueError(
133+
f"Irregular chunk sizes in dimension {i}: {tuple(c)}. "
134+
"Only uniform chunks (with an optional smaller final chunk) are supported."
135+
)
136+
chunks = tuple(c[0] for c in chunks)
134137

135138
# handle bad dimensionality
136139
if len(chunks) > len(shape):

src/zarr/core/codec_pipeline.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -370,6 +370,8 @@ async def read_batch(
370370
chunk_array_batch, batch_info, strict=False
371371
):
372372
if chunk_array is not None:
373+
if drop_axes:
374+
chunk_array = chunk_array.squeeze(axis=drop_axes)
373375
out[out_selection] = chunk_array
374376
else:
375377
out[out_selection] = fill_value_or_default(chunk_spec)
@@ -392,7 +394,7 @@ async def read_batch(
392394
):
393395
if chunk_array is not None:
394396
tmp = chunk_array[chunk_selection]
395-
if drop_axes != ():
397+
if drop_axes:
396398
tmp = tmp.squeeze(axis=drop_axes)
397399
out[out_selection] = tmp
398400
else:
@@ -431,7 +433,7 @@ def _merge_chunk_array(
431433
else:
432434
chunk_value = value[out_selection]
433435
# handle missing singleton dimensions
434-
if drop_axes != ():
436+
if drop_axes:
435437
item = tuple(
436438
None # equivalent to np.newaxis
437439
if idx in drop_axes

src/zarr/storage/_common.py

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,13 @@
55
from pathlib import Path
66
from typing import TYPE_CHECKING, Any, Literal, Self, TypeAlias
77

8-
from zarr.abc.store import ByteRequest, Store
8+
from zarr.abc.store import (
9+
ByteRequest,
10+
Store,
11+
SupportsDeleteSync,
12+
SupportsGetSync,
13+
SupportsSetSync,
14+
)
915
from zarr.core.buffer import Buffer, default_buffer_prototype
1016
from zarr.core.common import (
1117
ANY_ACCESS_MODE,
@@ -239,21 +245,25 @@ def get_sync(
239245
byte_range: ByteRequest | None = None,
240246
) -> Buffer | None:
241247
"""Synchronous read — delegates to ``self.store.get_sync(self.path, ...)``."""
248+
if not isinstance(self.store, SupportsGetSync):
249+
raise TypeError(f"Store {type(self.store).__name__} does not support synchronous get.")
242250
if prototype is None:
243251
prototype = default_buffer_prototype()
244-
return self.store.get_sync(self.path, prototype=prototype, byte_range=byte_range) # type: ignore[attr-defined, no-any-return]
252+
return self.store.get_sync(self.path, prototype=prototype, byte_range=byte_range)
245253

246254
def set_sync(self, value: Buffer) -> None:
247255
"""Synchronous write — delegates to ``self.store.set_sync(self.path, value)``."""
248-
self.store.set_sync(self.path, value) # type: ignore[attr-defined]
249-
250-
def set_range_sync(self, value: Buffer, start: int) -> None:
251-
"""Synchronous byte-range write."""
252-
self.store.set_range_sync(self.path, value, start) # type: ignore[attr-defined]
256+
if not isinstance(self.store, SupportsSetSync):
257+
raise TypeError(f"Store {type(self.store).__name__} does not support synchronous set.")
258+
self.store.set_sync(self.path, value)
253259

254260
def delete_sync(self) -> None:
255261
"""Synchronous delete — delegates to ``self.store.delete_sync(self.path)``."""
256-
self.store.delete_sync(self.path) # type: ignore[attr-defined]
262+
if not isinstance(self.store, SupportsDeleteSync):
263+
raise TypeError(
264+
f"Store {type(self.store).__name__} does not support synchronous delete."
265+
)
266+
self.store.delete_sync(self.path)
257267

258268
def __truediv__(self, other: str) -> StorePath:
259269
"""Combine this store path with another path"""

src/zarr/storage/_local.py

Lines changed: 0 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -85,19 +85,6 @@ def _put(path: Path, value: Buffer, exclusive: bool = False) -> int:
8585
return f.write(view)
8686

8787

88-
def _put_range(path: Path, value: Buffer, start: int) -> None:
89-
view = value.as_buffer_like()
90-
file_size = path.stat().st_size
91-
if start + len(view) > file_size:
92-
raise ValueError(
93-
f"set_range would write beyond the end of the stored value: "
94-
f"start={start}, len(value)={len(view)}, stored size={file_size}"
95-
)
96-
with path.open("r+b") as f:
97-
f.seek(start)
98-
f.write(view)
99-
100-
10188
class LocalStore(Store):
10289
"""
10390
Store for the local file system.
@@ -241,12 +228,6 @@ def set_sync(self, key: str, value: Buffer) -> None:
241228
path = self.root / key
242229
_put(path, value)
243230

244-
def set_range_sync(self, key: str, value: Buffer, start: int) -> None:
245-
self._ensure_open_sync()
246-
self._check_writable()
247-
path = self.root / key
248-
_put_range(path, value, start)
249-
250231
def delete_sync(self, key: str) -> None:
251232
self._ensure_open_sync()
252233
self._check_writable()

0 commit comments

Comments
 (0)