Skip to content

Commit 88ed0ac

Browse files
d-v-bclaudedependabot[bot]ilan-goldmkitti
authored
perf: cache lexicographic chunk coords in sharding codec (zarr-developers#4012)
* perf: cache lexicographic chunk coords in sharding codec The subchunk_write_order feature (zarr-developers#3826) regressed sharded write performance: _encode_partial_single rebuilt the full per-shard chunk coordinate grid on every write via `np.array(list(_subchunk_order_iter(..., "lexicographic")))`, and `to_dict_vectorized` rebuilt a tuple key per row with `tuple(coords.ravel())`. For a single-chunk write into a shard with tens of thousands of chunks this roughly doubled write time (~22ms -> ~40ms on test_sharded_morton_write_single_chunk, matching the -44% CodSpeed regression). Add cached `_lexicographic_order` (array) and `_lexicographic_order_keys` (tuples) helpers in indexing.py, mirroring `_morton_order`/`_morton_order_keys`, and pass the cached keys into `to_dict_vectorized` instead of deriving them row-by-row. This restores write throughput to the pre-zarr-developers#3826 baseline while preserving identical chunk ordering (verified equal to np.ndindex across shapes including 0-d and empty). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(deps): bump the actions group across 1 directory with 8 updates (zarr-developers#176) Bumps the actions group with 8 updates in the / directory: | Package | From | To | | --- | --- | --- | | [prefix-dev/setup-pixi](https://github.com/prefix-dev/setup-pixi) | `0.9.5` | `0.9.6` | | [codecov/codecov-action](https://github.com/codecov/codecov-action) | `6.0.0` | `6.0.1` | | [github/issue-metrics](https://github.com/github/issue-metrics) | `4.2.2` | `4.2.7` | | [j178/prek-action](https://github.com/j178/prek-action) | `2.0.3` | `2.0.4` | | [actions/upload-artifact](https://github.com/actions/upload-artifact) | `7.0.0` | `7.0.1` | | [actions/download-artifact](https://github.com/actions/download-artifact) | `7.0.0` | `8.0.1` | | [pypa/gh-action-pypi-publish](https://github.com/pypa/gh-action-pypi-publish) | `1.13.0` | `1.14.0` | | [zizmorcore/zizmor-action](https://github.com/zizmorcore/zizmor-action) | `0.5.3` | `0.5.6` | Updates `prefix-dev/setup-pixi` from 0.9.5 to 0.9.6 - [Release notes](https://github.com/prefix-dev/setup-pixi/releases) - [Commits](prefix-dev/setup-pixi@1b2de7f...5185adf) Updates `codecov/codecov-action` from 6.0.0 to 6.0.1 - [Release notes](https://github.com/codecov/codecov-action/releases) - [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md) - [Commits](codecov/codecov-action@57e3a13...e79a696) Updates `github/issue-metrics` from 4.2.2 to 4.2.7 - [Release notes](https://github.com/github/issue-metrics/releases) - [Commits](github-community-projects/issue-metrics@c9e9838...1e38d5e) Updates `j178/prek-action` from 2.0.3 to 2.0.4 - [Release notes](https://github.com/j178/prek-action/releases) - [Commits](j178/prek-action@6ad8027...bdca6f1) Updates `actions/upload-artifact` from 7.0.0 to 7.0.1 - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](actions/upload-artifact@v7...043fb46) Updates `actions/download-artifact` from 7.0.0 to 8.0.1 - [Release notes](https://github.com/actions/download-artifact/releases) - [Commits](actions/download-artifact@v7...3e5f45b) Updates `pypa/gh-action-pypi-publish` from 1.13.0 to 1.14.0 - [Release notes](https://github.com/pypa/gh-action-pypi-publish/releases) - [Commits](pypa/gh-action-pypi-publish@v1.13.0...cef2210) Updates `zizmorcore/zizmor-action` from 0.5.3 to 0.5.6 - [Release notes](https://github.com/zizmorcore/zizmor-action/releases) - [Commits](zizmorcore/zizmor-action@b1d7e1f...5f14fd0) --- updated-dependencies: - dependency-name: prefix-dev/setup-pixi dependency-version: 0.9.6 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: actions - dependency-name: codecov/codecov-action dependency-version: 6.0.1 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: actions - dependency-name: github/issue-metrics dependency-version: 4.2.7 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: actions - dependency-name: j178/prek-action dependency-version: 2.0.4 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: actions - dependency-name: actions/upload-artifact dependency-version: 7.0.1 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: actions - dependency-name: actions/download-artifact dependency-version: 8.0.1 dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions - dependency-name: pypa/gh-action-pypi-publish dependency-version: 1.14.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: actions - dependency-name: zizmorcore/zizmor-action dependency-version: 0.5.6 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: actions ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * refactor(sharding): derive coords inside to_dict_vectorized Address review feedback: `_ShardReader.to_dict_vectorized` took the lexicographic coordinate array and key tuples as parameters, even though the reader already knows its own `chunks_per_shard` and both structures are `lru_cache`d. Thread nothing in — fetch them inside the method via `_lexicographic_order`/`_lexicographic_order_keys`. Same cache, so no perf change; the call site collapses to `to_dict_vectorized()`. Add a unit test covering the method directly across 0-d, 1-d, and 2-d shard grids: present chunks map to their stored bytes, empty chunks to None, and every lexicographic coordinate appears as a key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Update src/zarr/core/indexing.py Co-authored-by: Ilan Gold <ilanbassgold@gmail.com> * refactor(sharding): drop redundant lexicographic_order_iter Address review feedback from @ilan-gold and @chuckwondo on the `lexicographic_order_iter` helper. `lexicographic_order_iter` returned a *lazy* iterator over an *eagerly-built, cached* tuple (`_lexicographic_order_keys`), which chuckwondo rightly flagged as confusing — and its output is byte-for-byte identical to the pre-existing, genuinely-lazy `c_order_iter` (verified across 0-d, empty, and N-d shapes). So the name promised laziness the implementation didn't provide, over a sequence we could already produce. Remove the wrapper and use the cached `_lexicographic_order_keys` tuple directly at the two `dict.fromkeys` call sites and in `_subchunk_order_iter`. This keeps the eager/cached coordinate tuples — which is the actual optimization: `dict.fromkeys` over the cached tuple is ~1.4x faster than over lazy `c_order_iter` at 32^3 (≈900us vs ≈1300us), because the cache amortizes tuple construction across repeated writes to same-shaped shards. Switching to `c_order_iter` would have reintroduced that cost, so it is deliberately not used here. Also drop the now-dead `tuple()` wrap in `morton_order_iter` (its argument is typed `tuple[int, ...]` and every caller passes one), per ilan-gold. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(indexing): prefer lexicographic_order_iter, soft-deprecate c_order_iter `c_order_iter` names a memory layout ("C order") rather than what the iterator actually yields. Reintroduce `lexicographic_order_iter` as the clearer name for the same row-major coordinate sequence, and make `c_order_iter` a thin alias that delegates to it, with a docstring note steering new code to the preferred name. No runtime warning — these are internal helpers. `lexicographic_order_iter` keeps the eager/cached implementation (iter over the lru_cached `_lexicographic_order_keys` tuple), which is ~1.4x faster than the old lazy `itertools.product` on the `dict.fromkeys` shard-write path and is the optimization this branch exists to deliver. The alias therefore changes `c_order_iter` from lazy to eager/cached; all in-repo callers (_ShardReader.__iter__, _is_total_shard, _subchunk_order_iter, and two tests) are migrated to `lexicographic_order_iter`, so nothing in-tree relies on the old laziness. Output is unchanged: lexicographic_order_iter, the c_order_iter alias, and np.ndindex all agree across 0-d, empty, and N-d shapes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(indexing): make lexicographic_order_iter the lazy primitive Per review from @mkitti: invert the relationship between the lazy iterator and the eagerly-collected tuple. `lexicographic_order_iter` is now a genuine lazy generator over the chunk-grid coordinates, and `_lexicographic_order_keys` collects it into a cached tuple — the eager version is "collect the lazy one", not the other way around. Previously lexicographic_order_iter returned iter() over the cached tuple, so any consumer that only needed a prefix still paid to materialize the entire grid. _is_total_shard does exactly that — an early-exit `all(coord in set for coord in ...)` — and on a cold cache for a 32^3 shard whose first coordinate is absent this dropped from ~15.8ms to ~24us (the lazy generator builds one coordinate and bails). The hot path is unchanged: the two dict.fromkeys sites consume the full grid and use the cached `_lexicographic_order_keys` tuple directly (~0.9ms at 32^3), so the regression fix this branch delivers is intact. This also resolves @chuckwondo's point — the iterator is now actually lazy rather than a thin wrapper over eager data. Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(indexing): make morton_order_iter the lazy primitive too Per @mkitti: the morton pair was backwards in the same way the lexicographic pair was. Invert it to match — `morton_order_iter` is now the lazy generator primitive and `_morton_order_keys` collects it into a cached tuple, mirroring `lexicographic_order_iter` / `_lexicographic_order_keys`. No behavioral change for the in-tree consumers (all fully consume the sequence) and the Z-order is identical; this keeps the two coordinate- order families symmetric and gives morton the same lazy/early-exit option lexicographic now has. Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(indexing): expose chunk-order coordinates as cached sequences Replace the morton/lexicographic order iterators (and the c_order_iter alias) with two cached, numpy-backed sequences: `morton_order_coords(shape)` and `lexicographic_order_coords(shape)`, each returning the grid coordinates in that order as a tuple of coordinate tuples. This addresses several points from review: - The earlier "lazy primitive" inversion de-optimized the hot write path: `morton_order_iter` rebuilt every coordinate tuple from the array on each call, and that path runs in `_encode_shard_dict` on every shard write (~16ms/write at 32^3 chunks-per-shard). The coords are a finite set of known length reused in full, so they are an indexable sequence built once and cached, not a lazily-rebuilt generator. (per @mkitti) - `lexicographic_order_iter` was never genuinely lazy — `_lexicographic_order` materializes the whole `np.indices` grid up front — so the early-exit framing was inaccurate. (per @Copilot, @chuckwondo) - Two functions differing only in caching vs laziness was redundant (per @ilan-gold); there is now one sequence per order. `_ShardReader.__iter__` wraps it in `iter()`, the only site that needs an iterator. - `_is_total_shard` no longer iterates the order at all: `all_chunk_coords` is always a subset of the shard grid (guaranteed by `validate`'s shard/chunk divisibility check), so a count check proves totality. A subset assertion documents the invariant. Coordinates are Python int tuples because every consumer uses them as dict keys / set members, which numpy arrays cannot be (unhashable, mutable); the numpy array is kept only for the vectorized index lookup in `to_dict_vectorized`. The per-shape cache holds ~prod(chunks_per_shard) tuples (~0.07% of shard size for multi-GB shards with (64,64,64) chunks), capped at 16 shapes per order. Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(bench): add warm-cache shard-write benchmark The existing test_sharded_morton_write_single_chunk clears the chunk-order cache before every iteration, so it only measures the cold grid-build cost. That made it blind to a regression where the per-shard coordinate tuples were rebuilt on every write instead of being reused from the cache — the cold benchmark could not distinguish the two (both pay the build each iteration). Add test_sharded_morton_write_single_chunk_warm_cache, which warms the cache once and then times repeated same-shape writes — the amortized regime the cache exists to optimize (many shards of one shape per array). Verified it discriminates: with the cached sequence it is ~4x faster than the cold benchmark, and a rebuild-every-write regression shows up as a ~4x slowdown here while staying invisible to the cold benchmark. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: update changelog for full-shard write coverage The fix caches the per-shard coordinate grid for every shard write, not only partial writes, and the win is amortized across repeated writes to same-shaped shards. Reword the note accordingly; keep it user-facing (the internal indexing helper refactor is not part of the public API). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * perf: build order-coord tuples via .tolist(); document dual representation `morton_order_coords` / `lexicographic_order_coords` built their tuple-of- tuples with a row-by-row `tuple(int(x) for x in row)` comprehension. Using `map(tuple, arr.tolist())` instead does the int conversion in a single C-level call, producing byte-identical native-int tuples ~8-9x faster (~16ms -> ~1.9ms cold build at 32^3). It is a per-shape cached build, so this only speeds the first write to each shard shape, but it is free. Also document in `to_dict_vectorized` why the chunk coordinates are needed in two forms — a numpy array for the vectorized index lookup and hashable tuples for the dict keys — since numpy rows are unhashable and a tuple list can't be used for the vectorized modulo/advanced-indexing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * perf,test: address code-review findings in the sharding coord cache - Drop the O(n_chunks) assert in _is_total_shard. It built a fresh set(lexicographic_order_coords(...)) on every partial read/write to check an invariant `validate` already guarantees, regressing the very partial-access hot path this PR optimizes (~673us vs ~112ns at 32^3 chunks-per-shard) and vanishing under -O. The invariant is documented in the comment; the count check alone proves totality. - Cache the colexicographic subchunk order. The colex branch of _subchunk_order_iter rebuilt the grid via uncached np.ndindex on every write while its morton/lexicographic siblings hit the cache; add colexicographic_order_coords (cached, derived from lexicographic_order_coords of the reversed shape) and use it. - Fix two benchmark docstrings: the cold benchmark now clears the lexicographic caches too (the write path builds that grid via dict.fromkeys / to_dict_vectorized, so a morton-only clear left it warm and under-reported the cold cost); the warm benchmark docstring now describes what it actually exercises (repeated writes to one shard, which reuse the cache identically to writes across same-shaped shards). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ilan Gold <ilanbassgold@gmail.com> Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
1 parent 97d781b commit 88ed0ac

6 files changed

Lines changed: 229 additions & 69 deletions

File tree

changes/4001.misc.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,6 @@
1-
Consolidated the array indexing test suite (`tests/test_indexing.py`): the loop-and-`np.random` based selection tests were rewritten as deterministic, parametrized `Expect`/`ExpectFail` cases on small arrays, error paths were split into their own named tests, and the two divergent `Expect` test-case dataclass pairs were unified onto the canonical one in `tests/conftest.py` (whose `ExpectFail` now has an optional regex `msg` and a `raises()` helper). Test-only change with no effect on the public API.
1+
Restore sharding write performance for shards with many inner chunks. The
2+
`subchunk_write_order` feature inadvertently rebuilt the per-shard chunk
3+
coordinate grid (up to tens of thousands of coordinate tuples) on every shard
4+
write. These coordinates are now computed once per shard shape and cached, so
5+
repeated writes to same-shaped shards reuse them, restoring write throughput to
6+
its previous level.

src/zarr/codecs/sharding.py

Lines changed: 44 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -46,9 +46,11 @@
4646
BasicIndexer,
4747
ChunkProjection,
4848
SelectorTuple,
49-
c_order_iter,
49+
_lexicographic_order,
50+
colexicographic_order_coords,
5051
get_indexer,
51-
morton_order_iter,
52+
lexicographic_order_coords,
53+
morton_order_coords,
5254
)
5355
from zarr.core.metadata.v3 import (
5456
ChunkGridMetadata,
@@ -261,31 +263,42 @@ def __len__(self) -> int:
261263
return int(self.index.offsets_and_lengths.size / 2)
262264

263265
def __iter__(self) -> Iterator[tuple[int, ...]]:
264-
return c_order_iter(self.index.chunks_per_shard)
266+
return iter(lexicographic_order_coords(self.index.chunks_per_shard))
265267

266-
def to_dict_vectorized(
267-
self,
268-
chunk_coords_array: npt.NDArray[np.integer[Any]],
269-
) -> dict[tuple[int, ...], Buffer | None]:
268+
def to_dict_vectorized(self) -> dict[tuple[int, ...], Buffer | None]:
270269
"""Build a dict of chunk coordinates to buffers using vectorized lookup.
271270
272-
Parameters
273-
----------
274-
chunk_coords_array : ndarray of shape (n_chunks, n_dims)
275-
Array of chunk coordinates for vectorized index lookup.
271+
The full per-shard chunk coordinate grid (both the array used for the
272+
vectorized index lookup and the plain tuples used as dict keys) is
273+
cached on `chunks_per_shard`, so neither is rebuilt on every call. For a
274+
shard with tens of thousands of chunks this avoids reconstructing that
275+
many tuples on every partial write.
276276
277277
Returns
278278
-------
279279
dict mapping chunk coordinate tuples to Buffer or None
280280
"""
281+
chunks_per_shard = self.index.chunks_per_shard
282+
# The same chunk-grid coordinates are needed in two forms, and neither can
283+
# stand in for the other:
284+
# - `chunk_coords_array`: an (n_chunks, n_dims) numpy array, fed to the
285+
# vectorized index lookup, which does modulo + advanced indexing on it.
286+
# A list of tuples can't be used for that without first being arrayified.
287+
# - `chunk_coords_keys`: the same coordinates as hashable Python tuples,
288+
# used as the result dict's keys. numpy array rows are unhashable
289+
# (mutable), so they can't key a dict.
290+
# Both are cached per shape (see indexing.py), so neither is rebuilt here;
291+
# row i of the array and key i refer to the same chunk.
292+
chunk_coords_array = _lexicographic_order(chunks_per_shard)
293+
chunk_coords_keys = lexicographic_order_coords(chunks_per_shard)
281294
starts, ends, valid = self.index.get_chunk_slices_vectorized(chunk_coords_array)
282295

283296
result: dict[tuple[int, ...], Buffer | None] = {}
284-
for i, coords in enumerate(chunk_coords_array):
297+
for i, coords in enumerate(chunk_coords_keys):
285298
if valid[i]:
286-
result[tuple(coords.ravel())] = self.buf[int(starts[i]) : int(ends[i])]
299+
result[coords] = self.buf[int(starts[i]) : int(ends[i])]
287300
else:
288-
result[tuple(coords.ravel())] = None
301+
result[coords] = None
289302

290303
return result
291304

@@ -533,13 +546,14 @@ async def _decode_partial_single(
533546
def _subchunk_order_iter(
534547
self, chunks_per_shard: tuple[int, ...], subchunk_write_order: SubchunkWriteOrder
535548
) -> Iterable[tuple[int, ...]]:
549+
subchunk_iter: Iterable[tuple[int, ...]]
536550
match subchunk_write_order:
537551
case "morton":
538-
subchunk_iter = morton_order_iter(chunks_per_shard)
552+
subchunk_iter = morton_order_coords(chunks_per_shard)
539553
case "lexicographic":
540-
subchunk_iter = np.ndindex(chunks_per_shard)
554+
subchunk_iter = lexicographic_order_coords(chunks_per_shard)
541555
case "colexicographic":
542-
subchunk_iter = (c[::-1] for c in np.ndindex(chunks_per_shard[::-1]))
556+
subchunk_iter = colexicographic_order_coords(chunks_per_shard)
543557
case "unordered":
544558
# "unordered" promises no particular layout; today it happens to be
545559
# lexicographic, but callers must not rely on that.
@@ -565,9 +579,7 @@ async def _encode_single(
565579
chunk_grid=ChunkGrid.from_sizes(shard_shape, chunk_shape),
566580
)
567581
)
568-
# The key order of this intermediate dict is immaterial; the physical layout is
569-
# decided later by the `subchunk_write_order` loop in `_encode_shard_dict`.
570-
shard_builder = dict.fromkeys(np.ndindex(chunks_per_shard))
582+
shard_builder = dict.fromkeys(lexicographic_order_coords(chunks_per_shard))
571583

572584
await self.codec_pipeline.write(
573585
[
@@ -610,19 +622,18 @@ async def _encode_partial_single(
610622
)
611623

612624
if self._is_complete_shard_write(indexer, chunks_per_shard):
613-
# Intermediate key order is immaterial (see `_encode_single`).
614-
shard_dict = dict.fromkeys(np.ndindex(chunks_per_shard))
625+
shard_dict = dict.fromkeys(lexicographic_order_coords(chunks_per_shard))
615626
else:
616627
shard_reader = await self._load_full_shard_maybe(
617628
byte_getter=byte_setter,
618629
prototype=chunk_spec.prototype,
619630
chunks_per_shard=chunks_per_shard,
620631
)
621632
shard_reader = shard_reader or _ShardReader.create_empty(chunks_per_shard)
622-
# Use vectorized lookup for better performance
623-
shard_dict = shard_reader.to_dict_vectorized(
624-
np.array(list(np.ndindex(chunks_per_shard)))
625-
)
633+
# Use vectorized lookup for better performance. The lexicographic
634+
# coordinate array and keys are cached, so neither is rebuilt on
635+
# every write.
636+
shard_dict = shard_reader.to_dict_vectorized()
626637

627638
await self.codec_pipeline.write(
628639
[
@@ -692,9 +703,13 @@ async def _encode_shard_dict(
692703
def _is_total_shard(
693704
self, all_chunk_coords: set[tuple[int, ...]], chunks_per_shard: tuple[int, ...]
694705
) -> bool:
695-
return len(all_chunk_coords) == product(chunks_per_shard) and all(
696-
chunk_coords in all_chunk_coords for chunk_coords in c_order_iter(chunks_per_shard)
697-
)
706+
# `all_chunk_coords` comes from an indexer over this shard's chunk grid, so
707+
# it is always a subset of that grid (`validate` requires the shard shape to
708+
# be divisible by the inner chunk shape, so the indexer cannot produce an
709+
# out-of-grid coordinate). A subset whose size equals the grid's is the
710+
# whole grid, so the count check alone proves totality — no need to build
711+
# and membership-test the full coordinate set on this hot path.
712+
return len(all_chunk_coords) == product(chunks_per_shard)
698713

699714
def _is_complete_shard_write(
700715
self,

src/zarr/core/indexing.py

Lines changed: 55 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1521,19 +1521,19 @@ def decode_morton_vectorized(
15211521

15221522

15231523
@lru_cache(maxsize=16)
1524-
def _morton_order(chunk_shape: tuple[int, ...]) -> npt.NDArray[np.intp]:
1525-
n_total = product(chunk_shape)
1526-
n_dims = len(chunk_shape)
1524+
def _morton_order(shape: tuple[int, ...]) -> npt.NDArray[np.intp]:
1525+
n_total = product(shape)
1526+
n_dims = len(shape)
15271527
if n_total == 0:
15281528
out = np.empty((0, n_dims), dtype=np.intp)
15291529
out.flags.writeable = False
15301530
return out
15311531

15321532
# Ceiling hypercube: smallest power-of-2 hypercube whose Morton codes span
1533-
# all valid coordinates in chunk_shape. (c-1).bit_length() gives the number
1533+
# all valid coordinates in shape. (c-1).bit_length() gives the number
15341534
# of bits needed to index c values (0 for singleton dims). n_z = 2**total_bits
15351535
# is the size of this hypercube.
1536-
total_bits = sum((c - 1).bit_length() for c in chunk_shape)
1536+
total_bits = sum((c - 1).bit_length() for c in shape)
15371537
n_z = 1 << total_bits if total_bits > 0 else 1
15381538

15391539
# Decode all Morton codes in the ceiling hypercube, then filter to valid coords.
@@ -1544,8 +1544,8 @@ def _morton_order(chunk_shape: tuple[int, ...]) -> npt.NDArray[np.intp]:
15441544
# Ceiling strategy: decode all n_z codes vectorized, filter in-bounds.
15451545
# Works well when the overgeneration ratio n_z/n_total is small (≤4).
15461546
z_values = np.arange(n_z, dtype=np.intp)
1547-
all_coords = decode_morton_vectorized(z_values, chunk_shape)
1548-
shape_arr = np.array(chunk_shape, dtype=np.intp)
1547+
all_coords = decode_morton_vectorized(z_values, shape)
1548+
shape_arr = np.array(shape, dtype=np.intp)
15491549
valid_mask = np.all(all_coords < shape_arr, axis=1)
15501550
order = all_coords[valid_mask]
15511551
else:
@@ -1554,11 +1554,11 @@ def _morton_order(chunk_shape: tuple[int, ...]) -> npt.NDArray[np.intp]:
15541554
# larger overgeneration penalty for near-miss shapes like (33,33,33).
15551555
# Cost: O(n_total * bits) encode + O(n_total log n_total) sort,
15561556
# vs O(n_z * bits) = O(8 * n_total * bits) for ceiling.
1557-
grids = np.meshgrid(*[np.arange(c, dtype=np.intp) for c in chunk_shape], indexing="ij")
1557+
grids = np.meshgrid(*[np.arange(c, dtype=np.intp) for c in shape], indexing="ij")
15581558
all_coords = np.stack([g.ravel() for g in grids], axis=1)
15591559

15601560
# Encode all coordinates to Morton codes (vectorized).
1561-
bits_per_dim = tuple((c - 1).bit_length() for c in chunk_shape)
1561+
bits_per_dim = tuple((c - 1).bit_length() for c in shape)
15621562
max_coord_bits = max(bits_per_dim)
15631563
z_codes = np.zeros(n_total, dtype=np.intp)
15641564
output_bit = 0
@@ -1576,16 +1576,56 @@ def _morton_order(chunk_shape: tuple[int, ...]) -> npt.NDArray[np.intp]:
15761576

15771577

15781578
@lru_cache(maxsize=16)
1579-
def _morton_order_keys(chunk_shape: tuple[int, ...]) -> tuple[tuple[int, ...], ...]:
1580-
return tuple(tuple(int(x) for x in row) for row in _morton_order(chunk_shape))
1579+
def morton_order_coords(shape: tuple[int, ...]) -> tuple[tuple[int, ...], ...]:
1580+
# The grid coordinates in Morton (Z) order, as a cached sequence. The
1581+
# coordinate set of a finite grid has a known length and is reused in full on
1582+
# every shard write, so it is built once (vectorized, via `_morton_order`) and
1583+
# cached per shape rather than recomputed. Indexable and `len`-able; iterate it
1584+
# directly where an iterator is needed.
1585+
#
1586+
# `.tolist()` converts the whole array to native Python ints in one C-level
1587+
# call; building the tuples row-by-row with `int(x)` is ~9x slower.
1588+
return tuple(map(tuple, _morton_order(shape).tolist()))
15811589

15821590

1583-
def morton_order_iter(chunk_shape: tuple[int, ...]) -> Iterator[tuple[int, ...]]:
1584-
return iter(_morton_order_keys(tuple(chunk_shape)))
1591+
@lru_cache(maxsize=16)
1592+
def _lexicographic_order(shape: tuple[int, ...]) -> npt.NDArray[np.intp]:
1593+
# Lexicographic (C-order) coordinates, computed vectorized and cached so that
1594+
# the sharding codec's per-shard chunk grid is not rebuilt on every call.
1595+
# Equivalent to `np.array(list(np.ndindex(shape)))` but without the
1596+
# Python-level iteration over every coordinate.
1597+
n_dims = len(shape)
1598+
if n_dims == 0:
1599+
# A 0-d shard holds a single chunk addressed by the empty coordinate, so
1600+
# the coordinate array has one row and zero columns. np.indices(()) cannot
1601+
# express this, so build it directly. Matches list(np.ndindex(())) == [()].
1602+
order = np.empty((1, 0), dtype=np.intp)
1603+
else:
1604+
order = np.indices(shape, dtype=np.intp).reshape(n_dims, -1).T
1605+
order.flags.writeable = False
1606+
return order
15851607

15861608

1587-
def c_order_iter(chunks_per_shard: tuple[int, ...]) -> Iterator[tuple[int, ...]]:
1588-
return itertools.product(*(range(x) for x in chunks_per_shard))
1609+
@lru_cache(maxsize=16)
1610+
def lexicographic_order_coords(shape: tuple[int, ...]) -> tuple[tuple[int, ...], ...]:
1611+
# The grid coordinates in lexicographic (row-major / C) order, as a cached
1612+
# sequence. The coordinate set of a finite grid has a known length and is
1613+
# reused in full on every shard write, so it is built once (vectorized, via
1614+
# `_lexicographic_order`) and cached per shape. Indexable and `len`-able;
1615+
# iterate it directly where an iterator is needed.
1616+
#
1617+
# `.tolist()` converts the whole array to native Python ints in one C-level
1618+
# call; building the tuples row-by-row with `int(x)` is ~9x slower.
1619+
return tuple(map(tuple, _lexicographic_order(shape).tolist()))
1620+
1621+
1622+
@lru_cache(maxsize=16)
1623+
def colexicographic_order_coords(shape: tuple[int, ...]) -> tuple[tuple[int, ...], ...]:
1624+
# The grid coordinates in colexicographic (column-major / F) order, as a cached
1625+
# sequence: the first axis varies fastest. Equivalent to reversing each axis,
1626+
# taking lexicographic order, and reversing the coordinates back. Cached per
1627+
# shape like its siblings so shard writes don't rebuild it.
1628+
return tuple(c[::-1] for c in lexicographic_order_coords(shape[::-1]))
15891629

15901630

15911631
def get_indexer(

0 commit comments

Comments
 (0)