Skip to content

Commit 373910c

Browse files
maxrjonesd-v-bdstansbydcherianilan-gold
authored
feat: add experimental support for rectilinear (variable-sized) chunks (#3802)
Introduces a unified `ChunkGridMetadata` model that handles both regular and rectilinear chunk layouts through a common `RegularDimension`/`VaryingDimension` abstraction. Rectilinear chunks are gated behind a feature flag (`zarr.config.set({'array.rectilinear_chunks': True})`). Key changes: - New `ChunkGridMetadata` replaces `RegularChunkGrid` as the internal representation, supporting both regular and rectilinear dimensions - Rectilinear chunk grids can be created via nested sequences passed to `chunks` (e.g., `[[10, 20, 30], [50, 50]]`) - Rectilinear sharding: shard boundaries can be rectilinear while inner chunks remain regular - Existing arrays with regular chunk grids are read/written identically Breaking change: - `BaseCodec.validate()` and `CodecPipeline.validate()` now receive `ChunkGridMetadata` instead of `ChunkGrid` for the `chunk_grid` parameter --------- Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com> Co-authored-by: David Stansby <dstansby@gmail.com> Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> Co-authored-by: Ilan Gold <ilanbassgold@gmail.com> Co-authored-by: Sam Levang <39069044+slevang@users.noreply.github.com>
1 parent ad99861 commit 373910c

38 files changed

+6014
-427
lines changed

changes/3802.feature.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
Add support for rectilinear (variable-sized) chunk grids. This feature is experimental and
2+
must be explicitly enabled via ``zarr.config.set({'array.rectilinear_chunks': True})``.
3+
4+
Rectilinear chunks can be used through:
5+
6+
- **Creating arrays**: Pass nested sequences (e.g., ``[[10, 20, 30], [50, 50]]``) to ``chunks``
7+
in ``zarr.create_array``, ``zarr.from_array``, ``zarr.zeros``, ``zarr.ones``, ``zarr.full``,
8+
``zarr.open``, and related functions, or to ``chunk_shape`` in ``zarr.create``.
9+
- **Opening existing arrays**: Arrays stored with the ``rectilinear`` chunk grid are read
10+
transparently via ``zarr.open`` and ``zarr.open_array``.
11+
- **Rectilinear sharding**: Shard boundaries can be rectilinear while inner chunks remain regular.
12+
13+
**Breaking change**: The ``validate`` method on ``BaseCodec`` and ``CodecPipeline`` now receives
14+
a ``ChunkGridMetadata`` instance instead of a ``ChunkGrid`` instance for the ``chunk_grid``
15+
parameter. Third-party codecs that override ``validate`` and inspect the chunk grid will need to
16+
update their type annotations. No known downstream packages were using this parameter.

design/chunk-grid.md

Lines changed: 711 additions & 0 deletions
Large diffs are not rendered by default.

docs/user-guide/arrays.md

Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -611,6 +611,171 @@ In this example a shard shape of (1000, 1000) and a chunk shape of (100, 100) is
611611
This means that `10*10` chunks are stored in each shard, and there are `10*10` shards in total.
612612
Without the `shards` argument, there would be 10,000 chunks stored as individual files.
613613

614+
## Rectilinear (variable) chunk grids
615+
616+
!!! warning "Experimental"
617+
Rectilinear chunk grids are an experimental feature and may change in
618+
future releases. This feature is expected to stabilize in Zarr version 3.3.
619+
620+
Because the feature is still stabilizing, it is disabled by default and
621+
must be explicitly enabled:
622+
623+
```python
624+
import zarr
625+
zarr.config.set({"array.rectilinear_chunks": True})
626+
```
627+
628+
Or via the environment variable `ZARR_ARRAY__RECTILINEAR_CHUNKS=True`.
629+
630+
The examples below assume this config has been set.
631+
632+
By default, Zarr arrays use a regular chunk grid where every chunk along a
633+
given dimension has the same size (except possibly the final boundary chunk).
634+
Rectilinear chunk grids allow each chunk along a dimension to have a different
635+
size. This is useful when the natural partitioning of the data is not uniform —
636+
for example, satellite swaths of varying width, time series with irregular
637+
intervals, or spatial tiles of different extents.
638+
639+
### Creating arrays with rectilinear chunks
640+
641+
To create an array with rectilinear chunks, pass a nested list to the `chunks`
642+
parameter where each inner list gives the chunk sizes along one dimension:
643+
644+
```python exec="true" session="arrays" source="above" result="ansi"
645+
zarr.config.set({"array.rectilinear_chunks": True})
646+
z = zarr.create_array(
647+
store=zarr.storage.MemoryStore(),
648+
shape=(60, 100),
649+
chunks=[[10, 20, 30], [50, 50]],
650+
dtype='int32',
651+
)
652+
print(z.info)
653+
```
654+
655+
In this example the first dimension is split into three chunks of sizes 10, 20,
656+
and 30, while the second dimension is split into two equal chunks of size 50.
657+
658+
### Reading and writing data
659+
660+
Rectilinear arrays support the same indexing interface as regular arrays.
661+
Reads and writes that cross chunk boundaries of different sizes are handled
662+
automatically:
663+
664+
```python exec="true" session="arrays" source="above" result="ansi"
665+
import numpy as np
666+
data = np.arange(60 * 100, dtype='int32').reshape(60, 100)
667+
z[:] = data
668+
# Read a slice that spans the first two chunks (sizes 10 and 20) along axis 0
669+
print(z[5:25, 0:5])
670+
```
671+
672+
### Inspecting chunk sizes
673+
674+
The `.write_chunk_sizes` property returns the actual data size of each storage
675+
chunk along every dimension. It works for both regular and rectilinear arrays
676+
and returns a tuple of tuples (matching the dask `Array.chunks` convention).
677+
When sharding is used, `.read_chunk_sizes` returns the inner chunk sizes instead:
678+
679+
```python exec="true" session="arrays" source="above" result="ansi"
680+
print(z.write_chunk_sizes)
681+
```
682+
683+
For regular arrays, this includes the boundary chunk:
684+
685+
```python exec="true" session="arrays" source="above" result="ansi"
686+
z_regular = zarr.create_array(
687+
store=zarr.storage.MemoryStore(),
688+
shape=(100, 80),
689+
chunks=(30, 40),
690+
dtype='int32',
691+
)
692+
print(z_regular.write_chunk_sizes)
693+
```
694+
695+
Note that the `.chunks` property is only available for regular chunk grids. For
696+
rectilinear arrays, use `.write_chunk_sizes` (or `.read_chunk_sizes`) instead.
697+
698+
### Resizing and appending
699+
700+
Rectilinear arrays can be resized. When growing past the current edge sum, a
701+
new chunk is appended covering the additional extent. When shrinking, the chunk
702+
edges are preserved and the extent is re-bound (chunks beyond the new extent
703+
simply become inactive):
704+
705+
```python exec="true" session="arrays" source="above" result="ansi"
706+
z = zarr.create_array(
707+
store=zarr.storage.MemoryStore(),
708+
shape=(30,),
709+
chunks=[[10, 20]],
710+
dtype='float64',
711+
)
712+
z[:] = np.arange(30, dtype='float64')
713+
print(f"Before resize: chunk_sizes={z.write_chunk_sizes}")
714+
z.resize((50,))
715+
print(f"After resize: chunk_sizes={z.write_chunk_sizes}")
716+
```
717+
718+
The `append` method also works with rectilinear arrays:
719+
720+
```python exec="true" session="arrays" source="above" result="ansi"
721+
z.append(np.arange(10, dtype='float64'))
722+
print(f"After append: shape={z.shape}, chunk_sizes={z.write_chunk_sizes}")
723+
```
724+
725+
### Compressors and filters
726+
727+
Rectilinear arrays work with all codecs — compressors, filters, and checksums.
728+
Since each chunk may have a different size, the codec pipeline processes each
729+
chunk independently:
730+
731+
```python exec="true" session="arrays" source="above" result="ansi"
732+
z = zarr.create_array(
733+
store=zarr.storage.MemoryStore(),
734+
shape=(60, 100),
735+
chunks=[[10, 20, 30], [50, 50]],
736+
dtype='float64',
737+
filters=[zarr.codecs.TransposeCodec(order=(1, 0))],
738+
compressors=[zarr.codecs.BloscCodec(cname='zstd', clevel=3)],
739+
)
740+
z[:] = np.arange(60 * 100, dtype='float64').reshape(60, 100)
741+
np.testing.assert_array_equal(z[:], np.arange(60 * 100, dtype='float64').reshape(60, 100))
742+
print("Roundtrip OK")
743+
```
744+
745+
### Rectilinear shard boundaries
746+
747+
Rectilinear chunk grids can also be used for shard boundaries when combined
748+
with sharding. In this case, the outer grid (shards) is rectilinear while the
749+
inner chunks remain regular. Each shard dimension must be divisible by the
750+
corresponding inner chunk size:
751+
752+
```python exec="true" session="arrays" source="above" result="ansi"
753+
z = zarr.create_array(
754+
store=zarr.storage.MemoryStore(),
755+
shape=(120, 100),
756+
chunks=(10, 10),
757+
shards=[[60, 40, 20], [50, 50]],
758+
dtype='int32',
759+
)
760+
z[:] = np.arange(120 * 100, dtype='int32').reshape(120, 100)
761+
print(z[50:70, 40:60])
762+
```
763+
764+
Note that rectilinear inner chunks with sharding are not supported — only the
765+
shard boundaries can be rectilinear.
766+
767+
### Metadata format
768+
769+
Rectilinear chunk grid metadata uses run-length encoding (RLE) for compact
770+
serialization. When reading metadata, both bare integers and `[value, count]`
771+
pairs are accepted:
772+
773+
- `[10, 20, 30]` — three chunks with explicit sizes
774+
- `[[10, 3]]` — three chunks of size 10 (RLE shorthand)
775+
- `[[10, 3], 5]` — three chunks of size 10, then one chunk of size 5
776+
777+
When writing, Zarr automatically compresses repeated values into RLE format.
778+
614779
## Missing features in 3.0
615780

616781
The following features have not been ported to 3.0 yet.

docs/user-guide/config.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ Configuration options include the following:
3030
- Default Zarr format `default_zarr_version`
3131
- Default array order in memory `array.order`
3232
- Whether empty chunks are written to storage `array.write_empty_chunks`
33+
- Enable experimental rectilinear chunks `array.rectilinear_chunks`
3334
- Whether missing chunks are filled with the array's fill value on read `array.read_missing_chunks` (default `True`). Set to `False` to raise a [`ChunkNotFoundError`][zarr.errors.ChunkNotFoundError] instead.
3435
- Async and threading options, e.g. `async.concurrency` and `threading.max_workers`
3536
- Selections of implementations of codecs, codec pipelines and buffers

0 commit comments

Comments
 (0)