Chunk-aligned indexing on ManifestArray by TomNicholas · Pull Request #994 · zarr-developers/VirtualiZarr

TomNicholas · 2026-05-18T16:33:53Z

Summary

Implements chunk-aligned integer and slice indexing on ManifestArray, making xarray.Dataset.isel work end-to-end on virtual datasets for any chunk-aligned selection.

Slice indexers preserve the axis. Multi-chunk slices, mixed integer + slice, and selections that end on a partial final chunk all work.
Integer indexers drop the indexed axis (numpy / array-API semantics). Only legal when chunk_size == 1 along that axis; picking a single element of a larger chunk is sub-chunk indexing and raises.
Misaligned selections raise the new SubChunkIndexingError — a ValueError subclass, not NotImplementedError, because splitting compressed chunks without loading bytes is a permanent constraint of a virtual array.
Previously slice misalignment silently no-op'd while integer indexing unconditionally raised NotImplementedError. Both paths are now real.

Implementation lives in virtualizarr/manifests/indexing.py: each 1D indexer is translated into a chunk-grid selector (int when the array axis is dropped, slice otherwise), the manifest's _paths / _offsets / _lengths arrays are subset in one shot via numpy fancy indexing, the inlined-chunk dict is filtered and re-keyed, and the metadata is rebuilt with the kept axes' shape, chunks, and dimension_names.

Documents the headline use case in docs/scaling.md under "Tips for success": parse one massive Zarr store with ZarrParser, then write to Icechunk in chunk-aligned .isel slices to stay under the 50M-chunk-refs-per-commit limit.

Supersedes #499 (the old branch had diverged 151 files from main and the indexing framework on main is a clean rewrite from #734 / #730, so I picked up the goal of #499 on a fresh branch rather than rebasing).

Test plan

virtualizarr/tests/test_manifests/test_array.py::TestIndexing — 49 cases covering chunk-aligned int/slice, multi-dim, mixed, partial final chunk, axis dropping, and misalignment errors
virtualizarr/tests/test_xarray.py::TestIsel — 6 cases routing through xarray's .isel: slice along chunk boundary, length-1-slice (keeps axis), integer (drops axis), integer misaligned (raises), slice misaligned (raises), iterative chunk-aligned append simulation
No regressions across test_manifests/, test_xarray.py, test_integration.py, test_writers/ (333 tests pass)
Pre-commit hooks clean

Acceptance criteria

Closes Support indexing by slicing along chunk boundaries? #51
Tests added
Tests passing
Full type hint coverage
Changes are documented in docs/releases.md
New functionality has documentation (docs/scaling.md, docs/data_structures.md, docs/faq.md, and the __getitem__ docstring)

Integer and slice indexers now subset a ManifestArray when they align with chunk boundaries (closes zarr-developers#51, supersedes zarr-developers#499). Misaligned selections raise SubChunkIndexingError — an IndexError subclass, not NotImplementedError, since splitting compressed chunks without loading the underlying data is a permanent constraint of a virtual array, not a missing feature. Previously slice misalignment silently no-op'd while ints raised NotImplementedError.

Sub-chunk indexing is impossible by design for a virtual array, not a missing-feature index error, so a ValueError subclass is a closer fit.

The step, start, and stop alignment checks all raise the same error for the same reason, so combine them into a single conditional with one message.

Cover the use case where ZarrParser produces a virtual dataset whose total chunk-ref count exceeds Icechunk's 50M-per-commit limit: slice along a chunk-aligned axis and write each slice with append_dim. Chunk-aligned indexing on ManifestArray (PR for zarr-developers#51) is what makes this cheap.

Exercises the workflow documented in docs/scaling.md: chunk-aligned .isel on a virtual Dataset subsets the underlying ChunkManifest, misaligned splits raise SubChunkIndexingError, and iterating chunk-aligned slices covers every original ref exactly once. Includes a note that selecting a single chunk requires a length-1 slice rather than an int (since ManifestArray preserves the indexed axis).

Match numpy / array-API semantics: int indexers drop the indexed axis, slice indexers preserve it. Integer indexing remains legal only when chunk_size == 1 along that axis — picking one element of a larger chunk is still sub-chunk indexing and raises SubChunkIndexingError. Dropping happens by passing int (rather than length-1 slice) selectors into the manifest's chunk-grid arrays, then trimming shape/chunks/dimension_names to the kept axes. The all-int case collapses the manifest to 0D, so wrap the result of numpy indexing with np.asarray to keep ChunkManifest.from_arrays happy when numpy hands back a Python scalar instead of a 0D ndarray. End-to-end win: xarray.Dataset.isel(time=N) now routes through cleanly on a virtual dataset (when chunk_size == 1 along time), matching the workflow documented in docs/scaling.md.

data_structures.md and faq.md still implied indexing on a ManifestArray was unimplemented. Reflect the chunk-aligned int/slice semantics that now work end-to-end, and flip the isel row in the kerchunk-comparison table.

codecov · 2026-05-18T16:37:43Z

Codecov Report

❌ Patch coverage is 90.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.98%. Comparing base (8d7c33a) to head (34ff775).

Files with missing lines	Patch %	Lines
virtualizarr/manifests/indexing.py	90.00%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #994   +/-   ##
=======================================
  Coverage   89.97%   89.98%           
=======================================
  Files          33       33           
  Lines        2064     2066    +2     
=======================================
+ Hits         1857     1859    +2     
  Misses        207      207

Files with missing lines	Coverage Δ
virtualizarr/manifests/array.py	`85.00% <ø> (ø)`
virtualizarr/manifests/indexing.py	`93.33% <90.00%> (+0.47%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Three narrow type fixes: collect the indexers into a list typed int | slice once np.ndarray is ruled out, broaden the dimension_names local to Any (zarr's tuple[str | None, ...] doesn't fit copy_and_replace_metadata's Iterable[str] hint but the runtime is fine), and cast np.asarray's dtype-erased return back to the manifest array types that from_arrays expects.

Add a one-line summary, fill in missing PRs (#974, #997, #999) and the missing #994 link for chunk-aligned indexing, tighten the long entries, and set the release date.

TomNicholas added 9 commits May 18, 2026 10:52

refactor: make SubChunkIndexingError a ValueError, not IndexError

cb18b7c

Sub-chunk indexing is impossible by design for a virtual array, not a missing-feature index error, so a ValueError subclass is a closer fit.

update __getitem__ docstring

3a3fcd7

refactor: collapse SubChunkIndexingError raises into one

dec8679

The step, start, and stop alignment checks all raise the same error for the same reason, so combine them into a single conditional with one message.

update __getitem__ docstring

0479216

docs: update outdated mentions of indexing limitations

f7626e5

data_structures.md and faq.md still implied indexing on a ManifestArray was unimplemented. Reflect the chunk-aligned int/slice semantics that now work end-to-end, and flip the isel row in the kerchunk-comparison table.

TomNicholas mentioned this pull request May 18, 2026

Chunk aligned indexing using zarr-python v3 #499

Closed

7 tasks

TomNicholas temporarily deployed to test-release May 18, 2026 16:34 — with GitHub Actions Inactive

TomNicholas temporarily deployed to test-release May 18, 2026 16:50 — with GitHub Actions Inactive

TomNicholas changed the title ~~Chunk-aligned indexing on ManifestArray (closes #51)~~ Chunk-aligned indexing on ManifestArray May 18, 2026

TomNicholas merged commit e213350 into zarr-developers:main May 18, 2026
16 of 17 checks passed

This was referenced May 18, 2026

Sub-chunk slicing for uncompressed ManifestArrays (part of #86) #996

Merged

Support indexing when it removes a dimension of length 1 #879

Closed

Release notes for v2.6.2 #1000

Merged

TomNicholas added a commit that referenced this pull request May 18, 2026

docs: prepare v2.6.2 release notes (#1000)

122d959

Add a one-line summary, fill in missing PRs (#974, #997, #999) and the missing #994 link for chunk-aligned indexing, tighten the long entries, and set the release date.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunk-aligned indexing on ManifestArray#994

Chunk-aligned indexing on ManifestArray#994
TomNicholas merged 10 commits into
zarr-developers:mainfrom
TomNicholas:chunk_aligned_indexing_v2

TomNicholas commented May 18, 2026

Uh oh!

codecov Bot commented May 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TomNicholas commented May 18, 2026

Summary

Test plan

Acceptance criteria

Uh oh!

codecov Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented May 18, 2026 •

edited

Loading