Skip to content

Commit 6e7f2f3

Browse files
d-v-bclaudemaxrjones
authored
refactor: simplify internal chunk representation (#3899)
* refactor: rename guess_chunks to more clearly indicate that it guesses regular chunks * fix: use the same chunk normalization path in all cases Previously rectilinear chunk grids and regular chunk grids normalized chunks inconsistently. This change ensures that chunk specifications are always normalized by the same routines in all cases. This change also ensures that chunks=(-1, ...) consistently normalizes to a full length chunk along that axis. * refactor: use newtype pattern * docs: changelog * fix: handle 0-length arrays * test: test untested cases of chunk normalization * fix: don't accept inane input * test: check error states in normalize_chunks_1d * refactor: make resolvedchunking recursive to support nested sharding * test: add _assert_chunks_equal helper for ChunksTuple Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf: ChunksTuple uses 1D int64 arrays per axis Switch normalize_chunks_1d to return np.ndarray[tuple[int], np.dtype[np.int64]] instead of tuple[int, ...]. The uniform-chunks branch now constructs in O(1) via np.full, recovering the single-allocation fast path that regressed when the canonical ChunksTuple representation was introduced. Update create_chunk_grid_metadata in v3.py to convert arrays to tuples of ints before passing to is_regular_nd and RectilinearChunkGridMetadata, keeping those downstream functions' signatures unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * revert: undo out-of-scope v3.py changes from 14788aa The previous commit 14788aa was meant to only touch chunk_grids.py (Tasks 2+3 of the ChunksTuple → int64-array refactor). It also modified create_chunk_grid_metadata in v3.py — that change belongs to a later task with a different approach (widen annotations rather than materialize tuples) and a better perf profile. Restoring v3.py to its pre-14788aa state. The proper v3.py change will land in a follow-up commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: cast ChunksTuple elements to int at consumer sites in array.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: widen is_regular_* annotations and cast ChunksTuple in create_chunk_grid_metadata Accept ndarray[int64] in is_regular_1d/is_regular_nd alongside Sequence[int]. Cast only the first element per axis on the regular path so D ints are allocated rather than N*D. Materialize fully on the rectilinear path because _validate_chunk_shapes checks isinstance(dim_spec, int) which rejects np.int64. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: cast ChunksTuple elements to int in create_array_metadata fixture Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: rewrite ChunksTuple equality checks and add return-type characterization Use _assert_chunks_equal for the three call sites that compared a tuple of int64 arrays against a tuple of int tuples. Add three small tests asserting that normalize_chunks_1d returns a 1D int64 ndarray for uniform, explicit-list, and -1 sentinel inputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf: vectorize is_regular_1d for ndarray inputs Per-element Python iteration over a 100K-element int64 array dominated create_array runtime on the (10**8,) chunks=(1000,) regression case (~6 ms in is_regular_nd, downstream of ChunksTuple normalization). Dispatch on np.ndarray and use a single vectorized comparison instead. End-to-end create_array on the regression case: ~7.9 ms -> ~0.6 ms. Promote numpy to a runtime import (was TYPE_CHECKING-only) for the isinstance dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: rename ResolvedChunking to ChunkLayout Names the structure (a layout) rather than the operation that produced it. Reads cleanly for both sharded and unsharded cases, fits the recursive inner-layout pattern, and is what one reaches for when reading the code cold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: add informative error for invalid chunk grid parameter * chore: make tests compact * fix: typos * test: add tests for regular grid helper functions * fix: reject chunks=True * docs: update changelog * docs: remove unneeded changelog entry * test: test for explicit 0-sized chunk length rejection * chore: hoist imports * refactor: add as_regular_shape helper routine * Update src/zarr/core/chunk_grids.py Co-authored-by: Max Jones <14077947+maxrjones@users.noreply.github.com> * Update src/zarr/core/chunk_grids.py Co-authored-by: Max Jones <14077947+maxrjones@users.noreply.github.com> * Update src/zarr/core/array.py Co-authored-by: Max Jones <14077947+maxrjones@users.noreply.github.com> * chore: cleanup --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Max Jones <14077947+maxrjones@users.noreply.github.com>
1 parent f9c53d5 commit 6e7f2f3

9 files changed

Lines changed: 671 additions & 225 deletions

File tree

changes/3899.bugfix.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
Make chunk normalization properly handle `-1` as a compact representation of the
2+
length of an entire axis. Reject several previously-accepted but ill-defined
3+
chunk specifications: `chunks=True` (previously silently produced size-1 chunks),
4+
chunk tuples shorter than the array's number of dimensions (previously padded to
5+
the array's shape), and `None` as a per-dimension chunk size. These all now
6+
raise informative errors. Also fix chunk handling for 0-length array dimensions,
7+
and add explicit rejection of 0-length chunks.

src/zarr/core/array.py

Lines changed: 42 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,13 @@
3939
)
4040
from zarr.core.buffer.cpu import buffer_prototype as cpu_buffer_prototype
4141
from zarr.core.chunk_grids import (
42+
SHARDED_INNER_CHUNK_MAX_BYTES,
4243
ChunkGrid,
43-
_auto_partition,
44-
normalize_chunks,
44+
_is_rectilinear_chunks,
45+
as_regular_shape,
46+
guess_chunks,
47+
normalize_chunks_nd,
48+
resolve_outer_and_inner_chunks,
4549
)
4650
from zarr.core.chunk_key_encodings import (
4751
ChunkKeyEncoding,
@@ -121,10 +125,8 @@
121125
)
122126
from zarr.core.metadata.v3 import (
123127
ChunkGridMetadata,
124-
RectilinearChunkGridMetadata,
125-
RegularChunkGridMetadata,
128+
create_chunk_grid_metadata,
126129
parse_node_type_array,
127-
resolve_chunks,
128130
)
129131
from zarr.core.sync import sync
130132
from zarr.errors import (
@@ -411,8 +413,7 @@ async def _create(
411413
if chunks is not None and chunk_shape is not None:
412414
raise ValueError("Only one of chunk_shape or chunks can be provided.")
413415

414-
from zarr.core.chunk_grids import _is_rectilinear_chunks
415-
416+
# Unify the v2 (chunks) and v3 (chunk_shape) parameter names
416417
_raw_chunks = chunks if chunks is not None else chunk_shape
417418

418419
config_parsed = parse_array_config(config)
@@ -438,7 +439,11 @@ async def _create(
438439
item_size = 1
439440
if isinstance(dtype_parsed, HasItemSize):
440441
item_size = dtype_parsed.item_size
441-
chunk_grid = resolve_chunks(_raw_chunks, shape, item_size)
442+
if _raw_chunks is None:
443+
outer_chunks = guess_chunks(shape, item_size)
444+
else:
445+
outer_chunks = normalize_chunks_nd(_raw_chunks, shape)
446+
chunk_grid = create_chunk_grid_metadata(outer_chunks)
442447
result = await cls._create_v3(
443448
store_path,
444449
shape=shape,
@@ -469,10 +474,12 @@ async def _create(
469474
item_size = 1
470475
if isinstance(dtype_parsed, HasItemSize):
471476
item_size = dtype_parsed.item_size
472-
if chunks:
473-
_chunks = normalize_chunks(chunks, shape, item_size)
477+
_raw = chunks or chunk_shape
478+
if _raw is None:
479+
outer_chunks = guess_chunks(shape, item_size)
474480
else:
475-
_chunks = normalize_chunks(chunk_shape, shape, item_size)
481+
outer_chunks = normalize_chunks_nd(_raw, shape)
482+
_chunks = as_regular_shape(outer_chunks)
476483

477484
if order is None:
478485
order_parsed = config_parsed.order
@@ -4385,6 +4392,7 @@ async def init_array(
43854392

43864393
zdtype = parse_dtype(dtype, zarr_format=zarr_format)
43874394
shape_parsed = parse_shapelike(shape)
4395+
item_size = zdtype.item_size if isinstance(zdtype, HasItemSize) else 1
43884396
chunk_key_encoding_parsed = _parse_chunk_key_encoding(
43894397
chunk_key_encoding, zarr_format=zarr_format
43904398
)
@@ -4397,12 +4405,7 @@ async def init_array(
43974405
else:
43984406
await ensure_no_existing_node(store_path, zarr_format=zarr_format)
43994407

4400-
# Detect rectilinear (nested list) chunks or shards, e.g. [[10, 20, 30], [25, 25]]
4401-
from zarr.core.chunk_grids import _is_rectilinear_chunks
4402-
4403-
rectilinear_meta: RectilinearChunkGridMetadata | None = None
4404-
rectilinear_shards = _is_rectilinear_chunks(shards)
4405-
4408+
# Validate rectilinear chunks constraints
44064409
if _is_rectilinear_chunks(chunks):
44074410
if zarr_format == 2:
44084411
raise ValueError("Zarr format 2 does not support rectilinear chunk grids.")
@@ -4412,43 +4415,29 @@ async def init_array(
44124415
"Use rectilinear shards instead: "
44134416
"chunks=(inner_size, ...), shards=[[shard_sizes], ...]"
44144417
)
4415-
rectilinear_meta = RectilinearChunkGridMetadata(
4416-
chunk_shapes=tuple(tuple(dim_edges) for dim_edges in chunks)
4418+
4419+
# Normalize the user's chunks into canonical ChunksTuple form
4420+
if chunks is None or chunks == "auto":
4421+
chunks_normalized = guess_chunks(
4422+
shape_parsed,
4423+
item_size,
4424+
max_bytes=SHARDED_INNER_CHUNK_MAX_BYTES if shards is not None else None,
44174425
)
4418-
# Use first chunk size per dim as placeholder for _auto_partition
4419-
chunks_flat: tuple[int, ...] | Literal["auto"] = tuple(dim_edges[0] for dim_edges in chunks)
44204426
else:
4421-
# Normalize scalar int to per-dimension tuple (e.g. chunks=100000 for a 1D array)
4422-
if isinstance(chunks, int):
4423-
chunks = tuple(chunks for _ in shape_parsed)
4424-
chunks_flat = cast("tuple[int, ...] | Literal['auto']", chunks)
4425-
4426-
# Handle rectilinear shards: shards=[[60, 40, 20], [50, 50]]
4427-
# means variable-sized shard boundaries with uniform inner chunks
4428-
shards_for_partition: ShardsLike | None = shards
4429-
if _is_rectilinear_chunks(shards):
4430-
if zarr_format == 2:
4431-
raise ValueError("Zarr format 2 does not support rectilinear chunk grids.")
4432-
rectilinear_meta = RectilinearChunkGridMetadata(
4433-
chunk_shapes=tuple(tuple(dim_edges) for dim_edges in shards)
4434-
)
4435-
# Use first shard size per dim as placeholder for _auto_partition
4436-
shards_for_partition = tuple(dim_edges[0] for dim_edges in shards)
4437-
4438-
item_size = 1
4439-
if isinstance(zdtype, HasItemSize):
4440-
item_size = zdtype.item_size
4427+
chunks_normalized = normalize_chunks_nd(chunks, shape_parsed)
44414428

4442-
shard_shape_parsed, chunk_shape_parsed = _auto_partition(
4429+
# Resolve chunks + shards into outer_chunks (grid metadata) and
4430+
# inner (sub-chunk structure for ShardingCodec, None if no sharding)
4431+
outer_chunks, inner = resolve_outer_and_inner_chunks(
44434432
array_shape=shape_parsed,
4444-
shard_shape=shards_for_partition,
4445-
chunk_shape=chunks_flat,
4433+
chunks=chunks_normalized,
4434+
shard_shape=shards,
44464435
item_size=item_size,
44474436
)
4448-
chunks_out: tuple[int, ...]
4437+
44494438
meta: ArrayV2Metadata | ArrayV3Metadata
44504439
if zarr_format == 2:
4451-
if shard_shape_parsed is not None:
4440+
if inner is not None:
44524441
msg = (
44534442
"Zarr format 2 arrays can only be created with `shard_shape` set to `None`. "
44544443
f"Got `shard_shape={shards}` instead."
@@ -4472,7 +4461,7 @@ async def init_array(
44724461
meta = AsyncArray._create_metadata_v2(
44734462
shape=shape_parsed,
44744463
dtype=zdtype,
4475-
chunks=chunk_shape_parsed,
4464+
chunks=as_regular_shape(outer_chunks),
44764465
dimension_separator=chunk_key_encoding_parsed.separator,
44774466
fill_value=fill_value,
44784467
order=order_parsed,
@@ -4488,40 +4477,29 @@ async def init_array(
44884477
dtype=zdtype,
44894478
)
44904479
sub_codecs = cast("tuple[Codec, ...]", (*array_array, array_bytes, *bytes_bytes))
4480+
grid = create_chunk_grid_metadata(outer_chunks)
44914481
codecs_out: tuple[Codec, ...]
4492-
if shard_shape_parsed is not None:
4482+
if inner is not None:
4483+
inner_chunks_flat = as_regular_shape(inner.outer_chunks)
44934484
index_location = None
44944485
if isinstance(shards, dict):
44954486
index_location = ShardingCodecIndexLocation(shards.get("index_location", None))
44964487
if index_location is None:
44974488
index_location = ShardingCodecIndexLocation.end
44984489
sharding_codec = ShardingCodec(
4499-
chunk_shape=chunk_shape_parsed, codecs=sub_codecs, index_location=index_location
4490+
chunk_shape=inner_chunks_flat, codecs=sub_codecs, index_location=index_location
45004491
)
4501-
# Use rectilinear grid for validation when shards are rectilinear
4502-
if rectilinear_shards and rectilinear_meta is not None:
4503-
validation_grid: ChunkGridMetadata = rectilinear_meta
4504-
else:
4505-
validation_grid = RegularChunkGridMetadata(chunk_shape=shard_shape_parsed)
45064492
sharding_codec.validate(
4507-
shape=chunk_shape_parsed,
4493+
shape=inner_chunks_flat,
45084494
dtype=zdtype,
4509-
chunk_grid=validation_grid,
4495+
chunk_grid=grid,
45104496
)
45114497
codecs_out = (sharding_codec,)
4512-
chunks_out = shard_shape_parsed
45134498
else:
4514-
chunks_out = chunk_shape_parsed
45154499
codecs_out = sub_codecs
45164500

45174501
if order is not None:
45184502
_warn_order_kwarg()
4519-
4520-
grid: ChunkGridMetadata
4521-
if rectilinear_meta is not None:
4522-
grid = rectilinear_meta
4523-
else:
4524-
grid = RegularChunkGridMetadata(chunk_shape=chunks_out)
45254503
meta = AsyncArray._create_metadata_v3(
45264504
shape=shape_parsed,
45274505
dtype=zdtype,

0 commit comments

Comments
 (0)