Sub-chunk slicing for uncompressed ManifestArrays (part of #86)#996
Merged
TomNicholas merged 7 commits intoMay 18, 2026
Merged
Conversation
When an array has only a BytesCodec (no compressors, no value transforms), a slice that lands within a single source chunk along axis 0 can be expressed as a rewritten byte offset and length on every surviving chunk reference — no data loaded. Picks up part of issue zarr-developers#86: now you can isel a single timestep from a multi-row chunk produced by, e.g., the netCDF3 parser without rechunking. The dispatch is deterministic: axis 0, uncompressed, slice contained in one source chunk, step == 1 → byte-shift path; everything else → existing chunk-aligned path, which raises SubChunkIndexingError for misaligned selections. step != 1 and sub-chunk indexing on other axes remain out of scope here and still raise.
Replace the hardcoded axis-0 check with a detector that reads the codec stack and returns the eligible axis (or None). [BytesCodec] gives C-order with axis 0; [TransposeCodec(order=...), BytesCodec] gives whatever axis order[0] points to in the logical array — for the reversed-order F-order case that's the last axis. The byte stride is computed commutatively as prod(chunks of the other axes) * itemsize, so the same formula works for any chosen axis. Also fix a latent no-op short-circuit: when the eligible axis happens to have a single source chunk, its chunk-grid selector slice(0, 1, 1) would equal slice(0, dim, 1) and skip the pending byte adjustment. Suppress the short-circuit when a sub-chunk adjustment is queued.
The sub-chunk branch in apply_selection had inline arithmetic mixed with the surrounding dispatch logic. Pull it out alongside the other _compute_* helpers so apply_selection's per-axis loop reads as a straight two-way choice between chunk-aligned and sub-chunk paths.
Removes the explicit isinstance assert at the call site: TypeGuard[slice] tells mypy that when the function returns True, its int|slice argument is in fact a slice. _compute_sub_chunk_axis_selection then accepts the narrowed value directly. Pure typing-level change, no runtime effect.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #996 +/- ##
==========================================
+ Coverage 89.98% 90.04% +0.06%
==========================================
Files 33 33
Lines 2066 2089 +23
==========================================
+ Hits 1859 1881 +22
- Misses 207 208 +1
🚀 New features to boost your workflow:
|
Reads a 2D netCDF3 file, .isel's within the single source chunk on axis 0, writes to an in-memory icechunk store, reads back via xr.open_zarr, and asserts byte-equality against a direct xr.open_dataset + .isel on the same file. Exercises the sub-chunk byte-adjustment path through the parser, accessor, and store layers in one go.
Three small cleanups around test_integration.py: - Move the netcdf3_file fixture from test_parsers/conftest.py to the root conftest.py and turn it into a factory that accepts a Dataset, so both the existing parser test and the new integration test can share it. - Replace pytest.mark.skipif(not has_icechunk, ...) with the existing @requires_icechunk decorator from virtualizarr.tests. - Drop @requires_zarr_python and its definition: zarr is a hard project dependency, so the decorator is a no-op import-skip that misleadingly suggests zarr is optional.
TomNicholas
commented
May 18, 2026
| has_tifffile, requires_tifffile = _importorskip("tifffile") | ||
| has_imagecodecs, requires_imagecodecs = _importorskip("imagecodecs") | ||
| has_hdf5plugin, requires_hdf5plugin = _importorskip("hdf5plugin") | ||
| has_zarr_python, requires_zarr_python = _importorskip("zarr") |
Member
Author
There was a problem hiding this comment.
Drive-by cleanup - zarr is a required dep so this is outdated
TomNicholas
commented
May 18, 2026
|
|
||
| @requires_scipy | ||
| @requires_icechunk | ||
| def test_subchunk_slice_netcdf3_through_icechunk_roundtrip( |
Member
Author
There was a problem hiding this comment.
This integration test tests the actual use case
Collaborator
|
🚀 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lets you sub-divide a chunk by slicing an uncompressed
ManifestArrayalong its largest-stride storage axis. The result is a new chunk reference with a bumped byte offset and a smaller length — no data loaded, no rechunking.The motivating case is
xarray.Dataset.isel(time=slice(...))on a virtual dataset produced by a parser like the netCDF3 one, where a single source "chunk" covers many timesteps and the user wants to pick out a subset without splitting compressed bytes (which is impossible) or rewriting refs.Eligible codec stacks:
[BytesCodec]— C-order layout, the axis is axis 0.[TransposeCodec(order=...), BytesCodec]— stored layout is the logical array permuted byorder. The eligible axis isorder[0]of the logical array (e.g. the last axis whenorder=(N-1, ..., 0), i.e. F-order).Out of scope and still raising
SubChunkIndexingError:step != 1(would also break contiguity).Detection lives in
_uncompressed_sub_chunk_axis(metadata)which returns the eligible axis orNone. The byte-math is in_compute_sub_chunk_axis_selectionalongside the other_compute_*helpers inindexing.py.The dispatch is deterministic: a single
if axis == sub_chunk_axis and _is_sub_chunk_slice(...)choice inapply_selection's per-axis loop, falling through to the existing chunk-aligned path otherwise.Builds on #994 (chunk-aligned indexing). Addresses part of #86 —
step != 1along the eligible axis remains a follow-up.Test plan
TestSubChunkSlicingUncompressed: 4 C-order success cases (parametrized over 1D/2D shapes, single- and multi-chunk surviving along other axes), 3 raise cases (compressed, crossing chunk boundary, non-eligible axis), 2 F-order success cases viaTransposeCodec(order=(1, 0)), 1 F-order axis-0 raise case (storage's fast axis).test_manifests/,test_xarray.py,test_integration.py,test_writers/(344 passing).TypeGuard[slice]on_is_sub_chunk_sliceto drop a runtimeisinstanceassert).Acceptance criteria
step != 1along the eligible axis as a follow-up)docs/releases.mddocs/data_structures.md,__getitem__docstring)