Skip to content

Chunk-aligned indexing on ManifestArray#994

Merged
TomNicholas merged 10 commits into
zarr-developers:mainfrom
TomNicholas:chunk_aligned_indexing_v2
May 18, 2026
Merged

Chunk-aligned indexing on ManifestArray#994
TomNicholas merged 10 commits into
zarr-developers:mainfrom
TomNicholas:chunk_aligned_indexing_v2

Conversation

@TomNicholas

Copy link
Copy Markdown
Member

Summary

Implements chunk-aligned integer and slice indexing on ManifestArray, making xarray.Dataset.isel work end-to-end on virtual datasets for any chunk-aligned selection.

  • Slice indexers preserve the axis. Multi-chunk slices, mixed integer + slice, and selections that end on a partial final chunk all work.
  • Integer indexers drop the indexed axis (numpy / array-API semantics). Only legal when chunk_size == 1 along that axis; picking a single element of a larger chunk is sub-chunk indexing and raises.
  • Misaligned selections raise the new SubChunkIndexingError — a ValueError subclass, not NotImplementedError, because splitting compressed chunks without loading bytes is a permanent constraint of a virtual array.
  • Previously slice misalignment silently no-op'd while integer indexing unconditionally raised NotImplementedError. Both paths are now real.

Implementation lives in virtualizarr/manifests/indexing.py: each 1D indexer is translated into a chunk-grid selector (int when the array axis is dropped, slice otherwise), the manifest's _paths / _offsets / _lengths arrays are subset in one shot via numpy fancy indexing, the inlined-chunk dict is filtered and re-keyed, and the metadata is rebuilt with the kept axes' shape, chunks, and dimension_names.

Documents the headline use case in docs/scaling.md under "Tips for success": parse one massive Zarr store with ZarrParser, then write to Icechunk in chunk-aligned .isel slices to stay under the 50M-chunk-refs-per-commit limit.

Supersedes #499 (the old branch had diverged 151 files from main and the indexing framework on main is a clean rewrite from #734 / #730, so I picked up the goal of #499 on a fresh branch rather than rebasing).

Test plan

  • virtualizarr/tests/test_manifests/test_array.py::TestIndexing — 49 cases covering chunk-aligned int/slice, multi-dim, mixed, partial final chunk, axis dropping, and misalignment errors
  • virtualizarr/tests/test_xarray.py::TestIsel — 6 cases routing through xarray's .isel: slice along chunk boundary, length-1-slice (keeps axis), integer (drops axis), integer misaligned (raises), slice misaligned (raises), iterative chunk-aligned append simulation
  • No regressions across test_manifests/, test_xarray.py, test_integration.py, test_writers/ (333 tests pass)
  • Pre-commit hooks clean

Acceptance criteria

  • Closes Support indexing by slicing along chunk boundaries? #51
  • Tests added
  • Tests passing
  • Full type hint coverage
  • Changes are documented in docs/releases.md
  • New functionality has documentation (docs/scaling.md, docs/data_structures.md, docs/faq.md, and the __getitem__ docstring)

Integer and slice indexers now subset a ManifestArray when they align with
chunk boundaries (closes zarr-developers#51, supersedes zarr-developers#499). Misaligned selections raise
SubChunkIndexingError — an IndexError subclass, not NotImplementedError,
since splitting compressed chunks without loading the underlying data is a
permanent constraint of a virtual array, not a missing feature. Previously
slice misalignment silently no-op'd while ints raised NotImplementedError.
Sub-chunk indexing is impossible by design for a virtual array, not a
missing-feature index error, so a ValueError subclass is a closer fit.
The step, start, and stop alignment checks all raise the same error for
the same reason, so combine them into a single conditional with one message.
Cover the use case where ZarrParser produces a virtual dataset whose total
chunk-ref count exceeds Icechunk's 50M-per-commit limit: slice along a
chunk-aligned axis and write each slice with append_dim. Chunk-aligned
indexing on ManifestArray (PR for zarr-developers#51) is what makes this cheap.
Exercises the workflow documented in docs/scaling.md: chunk-aligned .isel on
a virtual Dataset subsets the underlying ChunkManifest, misaligned splits
raise SubChunkIndexingError, and iterating chunk-aligned slices covers
every original ref exactly once. Includes a note that selecting a single
chunk requires a length-1 slice rather than an int (since ManifestArray
preserves the indexed axis).
Match numpy / array-API semantics: int indexers drop the indexed axis, slice
indexers preserve it. Integer indexing remains legal only when chunk_size == 1
along that axis — picking one element of a larger chunk is still sub-chunk
indexing and raises SubChunkIndexingError.

Dropping happens by passing int (rather than length-1 slice) selectors into
the manifest's chunk-grid arrays, then trimming shape/chunks/dimension_names
to the kept axes. The all-int case collapses the manifest to 0D, so wrap the
result of numpy indexing with np.asarray to keep ChunkManifest.from_arrays
happy when numpy hands back a Python scalar instead of a 0D ndarray.

End-to-end win: xarray.Dataset.isel(time=N) now routes through cleanly on a
virtual dataset (when chunk_size == 1 along time), matching the workflow
documented in docs/scaling.md.
data_structures.md and faq.md still implied indexing on a ManifestArray
was unimplemented. Reflect the chunk-aligned int/slice semantics that now
work end-to-end, and flip the isel row in the kerchunk-comparison table.
@codecov

codecov Bot commented May 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.98%. Comparing base (8d7c33a) to head (34ff775).

Files with missing lines Patch % Lines
virtualizarr/manifests/indexing.py 90.00% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #994   +/-   ##
=======================================
  Coverage   89.97%   89.98%           
=======================================
  Files          33       33           
  Lines        2064     2066    +2     
=======================================
+ Hits         1857     1859    +2     
  Misses        207      207           
Files with missing lines Coverage Δ
virtualizarr/manifests/array.py 85.00% <ø> (ø)
virtualizarr/manifests/indexing.py 93.33% <90.00%> (+0.47%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Three narrow type fixes: collect the indexers into a list typed
int | slice once np.ndarray is ruled out, broaden the dimension_names
local to Any (zarr's tuple[str | None, ...] doesn't fit
copy_and_replace_metadata's Iterable[str] hint but the runtime is fine),
and cast np.asarray's dtype-erased return back to the manifest array
types that from_arrays expects.
@TomNicholas TomNicholas changed the title Chunk-aligned indexing on ManifestArray (closes #51) Chunk-aligned indexing on ManifestArray May 18, 2026
@TomNicholas TomNicholas merged commit e213350 into zarr-developers:main May 18, 2026
16 of 17 checks passed
TomNicholas added a commit that referenced this pull request May 18, 2026
Add a one-line summary, fill in missing PRs (#974, #997, #999) and the
missing #994 link for chunk-aligned indexing, tighten the long entries,
and set the release date.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support indexing by slicing along chunk boundaries?

1 participant