Skip to content

Commit 682273c

Browse files
committed
Merge branch 'main' of github.com:zarr-developers/zarr-python into feat/v3-scale-offset-cast
2 parents 83fad88 + 7c78574 commit 682273c

File tree

5 files changed

+132
-5
lines changed

5 files changed

+132
-5
lines changed

docs/user-guide/glossary.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Glossary
2+
3+
This page defines key terms used throughout the zarr-python documentation and API.
4+
5+
## Array Structure
6+
7+
### Array
8+
9+
An N-dimensional typed array stored in a Zarr [store](#store). An array's
10+
[metadata](#metadata) defines its shape, data type, chunk layout, and codecs.
11+
12+
### Chunk
13+
14+
The fundamental unit of data in a Zarr array. An array is divided into chunks
15+
along each dimension according to the [chunk grid](#chunk-grid), which is currently
16+
part of Zarr's private API. Each chunk is independently compressed and encoded
17+
through the array's [codec](#codec) pipeline.
18+
19+
When [sharding](#shard) is used, "chunk" refers to the inner chunks within each
20+
shard, because those are the compressible units. The chunks are the smallest units
21+
that can be read independently.
22+
23+
!!! warning "Convention specific to zarr-python"
24+
The use of "chunk" to mean the inner sub-chunk within a shard is a convention
25+
adopted by zarr-python's `Array` API. In the Zarr V3 specification and in other
26+
Zarr implementations, "chunk" may refer to the top-level grid cells (which
27+
zarr-python calls "shards" when the sharding codec is used). Be aware of this
28+
distinction when working across libraries.
29+
30+
**API**: [`Array.chunks`][zarr.Array.chunks] returns the chunk shape. When
31+
sharding is used, this is the inner chunk shape.
32+
33+
### Chunk Grid
34+
35+
The partitioning of an array's elements into [chunks](#chunk). In Zarr V3, the
36+
chunk grid is defined in the array [metadata](#metadata) and determines the
37+
boundaries of each storage object.
38+
39+
When sharding is used, the chunk grid defines the [shard](#shard) boundaries,
40+
not the inner chunk boundaries. The inner chunk shape is defined within the
41+
[sharding codec](#shard).
42+
43+
**API**: The `chunk_grid` field in array metadata contains the storage-level
44+
grid.
45+
46+
### Shard
47+
48+
A storage object that contains one or more [chunks](#chunk). Sharding reduces the
49+
number of objects in a [store](#store) by grouping chunks together, which
50+
improves performance on file systems and object storage.
51+
52+
Within each shard, chunks are compressed independently and can be read
53+
individually. However, writing requires updating the full shard for consistency,
54+
making shards the unit of writing and chunks the unit of reading.
55+
56+
Sharding is implemented as a [codec](#codec) (the sharding indexed codec).
57+
When sharding is used:
58+
59+
- The [chunk grid](#chunk-grid) in metadata defines the shard boundaries
60+
- The sharding codec's `chunk_shape` defines the inner chunk size
61+
- Each shard contains `shard_shape / chunk_shape` chunks per dimension
62+
63+
**API**: [`Array.shards`][zarr.Array.shards] returns the shard shape, or `None`
64+
if sharding is not used. [`Array.chunks`][zarr.Array.chunks] returns the inner
65+
chunk shape.
66+
67+
## Storage
68+
69+
### Store
70+
71+
A key-value storage backend that holds Zarr data and metadata. Stores implement
72+
the [`zarr.abc.store.Store`][] interface. Examples include local file systems,
73+
cloud object storage (S3, GCS, Azure), zip files, and in-memory dictionaries.
74+
75+
Each [chunk](#chunk) or [shard](#shard) is stored as a single value (object or
76+
file) in the store, addressed by a key derived from its grid coordinates.
77+
78+
### Metadata
79+
80+
The JSON document (`zarr.json`) that describes an [array](#array) or group. For
81+
arrays, metadata includes the shape, data type, [chunk grid](#chunk-grid), fill
82+
value, and [codec](#codec) pipeline. Metadata is stored alongside the data in
83+
the [store](#store). Zarr-Python does not yet expose its internal metadata
84+
representation as part of its public API.
85+
86+
## Codecs
87+
88+
### Codec
89+
90+
A transformation applied to array data during reading and writing. Codecs are
91+
chained into a pipeline and come in three types:
92+
93+
- **Array-to-array**: Transforms like transpose that rearrange array elements
94+
- **Array-to-bytes**: Serialization that converts an array to a byte sequence
95+
(exactly one required)
96+
- **Bytes-to-bytes**: Compression or checksums applied to the serialized bytes
97+
98+
The [sharding indexed codec](#shard) is a special array-to-bytes codec that
99+
groups multiple [chunks](#chunk) into a single storage object.
100+
101+
## API Properties
102+
103+
The following properties are available on [`zarr.Array`][]:
104+
105+
| Property | Description |
106+
|----------|-------------|
107+
| `.chunks` | Chunk shape — the inner chunk shape when sharding is used |
108+
| `.shards` | Shard shape, or `None` if no sharding |
109+
| `.nchunks` | Total number of independently compressible units across the array |
110+
| `.cdata_shape` | Number of independently compressible units per dimension |

docs/user-guide/index.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,10 @@ Take your skills to the next level:
3535
- **[Extending](extending.md)** - Extend functionality with custom code
3636
- **[Consolidated Metadata](consolidated_metadata.md)** - Advanced metadata management
3737

38+
## Reference
39+
40+
- **[Glossary](glossary.md)** - Definitions of key terms (chunks, shards, codecs, etc.)
41+
3842
## Need Help?
3943

4044
- Browse the [API Reference](../api/zarr/index.md) for detailed function documentation

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ nav:
2727
- user-guide/gpu.md
2828
- user-guide/consolidated_metadata.md
2929
- user-guide/experimental.md
30+
- user-guide/glossary.md
3031
- Examples:
3132
- user-guide/examples/custom_dtype.md
3233
- API Reference:

src/zarr/core/chunk_grids.py

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -126,11 +126,14 @@ def normalize_chunks(chunks: Any, shape: tuple[int, ...], typesize: int) -> tupl
126126
chunks = tuple(int(chunks) for _ in shape)
127127

128128
# handle dask-style chunks (iterable of iterables)
129-
if all(isinstance(c, (tuple | list)) for c in chunks):
130-
# take first chunk size for each dimension
131-
chunks = tuple(
132-
c[0] for c in chunks
133-
) # TODO: check/error/warn for irregular chunks (e.g. if c[0] != c[1:-1])
129+
if all(isinstance(c, (tuple, list)) for c in chunks):
130+
for i, c in enumerate(chunks):
131+
if any(x != y for x, y in itertools.pairwise(c[:-1])) or (len(c) > 1 and c[-1] > c[0]):
132+
raise ValueError(
133+
f"Irregular chunk sizes in dimension {i}: {tuple(c)}. "
134+
"Only uniform chunks (with an optional smaller final chunk) are supported."
135+
)
136+
chunks = tuple(c[0] for c in chunks)
134137

135138
# handle bad dimensionality
136139
if len(chunks) > len(shape):

tests/test_chunk_grids.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,10 @@ def test_guess_chunks(shape: tuple[int, ...], itemsize: int) -> None:
3535
((30, None, None), (100, 20, 10), 1, (30, 20, 10)),
3636
((30, 20, None), (100, 20, 10), 1, (30, 20, 10)),
3737
((30, 20, 10), (100, 20, 10), 1, (30, 20, 10)),
38+
# dask-style chunks (uniform with optional smaller final chunk)
39+
(((100, 100, 100), (50, 50)), (300, 100), 1, (100, 50)),
40+
(((100, 100, 50),), (250,), 1, (100,)),
41+
(((100,),), (100,), 1, (100,)),
3842
# auto chunking
3943
(None, (100,), 1, (100,)),
4044
(-1, (100,), 1, (100,)),
@@ -52,3 +56,8 @@ def test_normalize_chunks_errors() -> None:
5256
normalize_chunks("foo", (100,), 1)
5357
with pytest.raises(ValueError):
5458
normalize_chunks((100, 10), (100,), 1)
59+
# dask-style irregular chunks should raise
60+
with pytest.raises(ValueError, match="Irregular chunk sizes"):
61+
normalize_chunks(((10, 20, 30),), (60,), 1)
62+
with pytest.raises(ValueError, match="Irregular chunk sizes"):
63+
normalize_chunks(((100, 100), (10, 20)), (200, 30), 1)

0 commit comments

Comments
 (0)