Chunk-aligned indexing on ManifestArray#994
Merged
TomNicholas merged 10 commits intoMay 18, 2026
Merged
Conversation
Integer and slice indexers now subset a ManifestArray when they align with chunk boundaries (closes zarr-developers#51, supersedes zarr-developers#499). Misaligned selections raise SubChunkIndexingError — an IndexError subclass, not NotImplementedError, since splitting compressed chunks without loading the underlying data is a permanent constraint of a virtual array, not a missing feature. Previously slice misalignment silently no-op'd while ints raised NotImplementedError.
Sub-chunk indexing is impossible by design for a virtual array, not a missing-feature index error, so a ValueError subclass is a closer fit.
The step, start, and stop alignment checks all raise the same error for the same reason, so combine them into a single conditional with one message.
Cover the use case where ZarrParser produces a virtual dataset whose total chunk-ref count exceeds Icechunk's 50M-per-commit limit: slice along a chunk-aligned axis and write each slice with append_dim. Chunk-aligned indexing on ManifestArray (PR for zarr-developers#51) is what makes this cheap.
Exercises the workflow documented in docs/scaling.md: chunk-aligned .isel on a virtual Dataset subsets the underlying ChunkManifest, misaligned splits raise SubChunkIndexingError, and iterating chunk-aligned slices covers every original ref exactly once. Includes a note that selecting a single chunk requires a length-1 slice rather than an int (since ManifestArray preserves the indexed axis).
Match numpy / array-API semantics: int indexers drop the indexed axis, slice indexers preserve it. Integer indexing remains legal only when chunk_size == 1 along that axis — picking one element of a larger chunk is still sub-chunk indexing and raises SubChunkIndexingError. Dropping happens by passing int (rather than length-1 slice) selectors into the manifest's chunk-grid arrays, then trimming shape/chunks/dimension_names to the kept axes. The all-int case collapses the manifest to 0D, so wrap the result of numpy indexing with np.asarray to keep ChunkManifest.from_arrays happy when numpy hands back a Python scalar instead of a 0D ndarray. End-to-end win: xarray.Dataset.isel(time=N) now routes through cleanly on a virtual dataset (when chunk_size == 1 along time), matching the workflow documented in docs/scaling.md.
data_structures.md and faq.md still implied indexing on a ManifestArray was unimplemented. Reflect the chunk-aligned int/slice semantics that now work end-to-end, and flip the isel row in the kerchunk-comparison table.
7 tasks
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #994 +/- ##
=======================================
Coverage 89.97% 89.98%
=======================================
Files 33 33
Lines 2064 2066 +2
=======================================
+ Hits 1857 1859 +2
Misses 207 207
🚀 New features to boost your workflow:
|
Three narrow type fixes: collect the indexers into a list typed int | slice once np.ndarray is ruled out, broaden the dimension_names local to Any (zarr's tuple[str | None, ...] doesn't fit copy_and_replace_metadata's Iterable[str] hint but the runtime is fine), and cast np.asarray's dtype-erased return back to the manifest array types that from_arrays expects.
This was referenced May 18, 2026
TomNicholas
added a commit
that referenced
this pull request
May 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements chunk-aligned integer and slice indexing on
ManifestArray, makingxarray.Dataset.iselwork end-to-end on virtual datasets for any chunk-aligned selection.chunk_size == 1along that axis; picking a single element of a larger chunk is sub-chunk indexing and raises.SubChunkIndexingError— aValueErrorsubclass, notNotImplementedError, because splitting compressed chunks without loading bytes is a permanent constraint of a virtual array.NotImplementedError. Both paths are now real.Implementation lives in
virtualizarr/manifests/indexing.py: each 1D indexer is translated into a chunk-grid selector (int when the array axis is dropped, slice otherwise), the manifest's_paths/_offsets/_lengthsarrays are subset in one shot via numpy fancy indexing, the inlined-chunk dict is filtered and re-keyed, and the metadata is rebuilt with the kept axes'shape,chunks, anddimension_names.Documents the headline use case in
docs/scaling.mdunder "Tips for success": parse one massive Zarr store withZarrParser, then write to Icechunk in chunk-aligned.iselslices to stay under the 50M-chunk-refs-per-commit limit.Supersedes #499 (the old branch had diverged 151 files from
mainand the indexing framework onmainis a clean rewrite from #734 / #730, so I picked up the goal of #499 on a fresh branch rather than rebasing).Test plan
virtualizarr/tests/test_manifests/test_array.py::TestIndexing— 49 cases covering chunk-aligned int/slice, multi-dim, mixed, partial final chunk, axis dropping, and misalignment errorsvirtualizarr/tests/test_xarray.py::TestIsel— 6 cases routing through xarray's.isel: slice along chunk boundary, length-1-slice (keeps axis), integer (drops axis), integer misaligned (raises), slice misaligned (raises), iterative chunk-aligned append simulationtest_manifests/,test_xarray.py,test_integration.py,test_writers/(333 tests pass)Acceptance criteria
docs/releases.mddocs/scaling.md,docs/data_structures.md,docs/faq.md, and the__getitem__docstring)