Skip to content

Commit 9f90d2e

Browse files
authored
Sub-chunk slicing for uncompressed ManifestArrays (part of #86) (#996)
* feat: axis-0 sub-chunk slicing on uncompressed ManifestArrays When an array has only a BytesCodec (no compressors, no value transforms), a slice that lands within a single source chunk along axis 0 can be expressed as a rewritten byte offset and length on every surviving chunk reference — no data loaded. Picks up part of issue #86: now you can isel a single timestep from a multi-row chunk produced by, e.g., the netCDF3 parser without rechunking. The dispatch is deterministic: axis 0, uncompressed, slice contained in one source chunk, step == 1 → byte-shift path; everything else → existing chunk-aligned path, which raises SubChunkIndexingError for misaligned selections. step != 1 and sub-chunk indexing on other axes remain out of scope here and still raise. * feat: generalize sub-chunk slicing to any largest-stride storage axis Replace the hardcoded axis-0 check with a detector that reads the codec stack and returns the eligible axis (or None). [BytesCodec] gives C-order with axis 0; [TransposeCodec(order=...), BytesCodec] gives whatever axis order[0] points to in the logical array — for the reversed-order F-order case that's the last axis. The byte stride is computed commutatively as prod(chunks of the other axes) * itemsize, so the same formula works for any chosen axis. Also fix a latent no-op short-circuit: when the eligible axis happens to have a single source chunk, its chunk-grid selector slice(0, 1, 1) would equal slice(0, dim, 1) and skip the pending byte adjustment. Suppress the short-circuit when a sub-chunk adjustment is queued. * refactor: extract _compute_sub_chunk_axis_selection helper The sub-chunk branch in apply_selection had inline arithmetic mixed with the surrounding dispatch logic. Pull it out alongside the other _compute_* helpers so apply_selection's per-axis loop reads as a straight two-way choice between chunk-aligned and sub-chunk paths. * refactor: narrow _is_sub_chunk_slice's return type with TypeGuard Removes the explicit isinstance assert at the call site: TypeGuard[slice] tells mypy that when the function returns True, its int|slice argument is in fact a slice. _compute_sub_chunk_axis_selection then accepts the narrowed value directly. Pure typing-level change, no runtime effect. * test: end-to-end netCDF3 sub-chunk slice through icechunk Reads a 2D netCDF3 file, .isel's within the single source chunk on axis 0, writes to an in-memory icechunk store, reads back via xr.open_zarr, and asserts byte-equality against a direct xr.open_dataset + .isel on the same file. Exercises the sub-chunk byte-adjustment path through the parser, accessor, and store layers in one go. * test: tidy fixtures and decorators around the sub-chunk integration test Three small cleanups around test_integration.py: - Move the netcdf3_file fixture from test_parsers/conftest.py to the root conftest.py and turn it into a factory that accepts a Dataset, so both the existing parser test and the new integration test can share it. - Replace pytest.mark.skipif(not has_icechunk, ...) with the existing @requires_icechunk decorator from virtualizarr.tests. - Drop @requires_zarr_python and its definition: zarr is a hard project dependency, so the decorator is a no-op import-skip that misleadingly suggests zarr is optional.
1 parent abd7d0f commit 9f90d2e

10 files changed

Lines changed: 437 additions & 23 deletions

File tree

conftest.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,20 @@ def local_registry():
8080
return ObjectStoreRegistry({"file://": LocalStore()})
8181

8282

83+
@pytest.fixture
84+
def netcdf3_file(tmp_path: Path):
85+
"""Factory for writing a temporary netCDF3 file with a caller-supplied Dataset."""
86+
87+
def _make(ds: xr.Dataset | None = None, name: str = "file.nc") -> Path:
88+
if ds is None:
89+
ds = xr.Dataset({"foo": ("x", np.array([1, 2, 3]))})
90+
filepath = tmp_path / name
91+
ds.to_netcdf(filepath, format="NETCDF3_CLASSIC")
92+
return filepath
93+
94+
return _make
95+
96+
8397
@pytest.fixture(params=["int8", "uint8", "float32"])
8498
def zarr_array_fill_value(request):
8599
store = zarr.storage.MemoryStore()

docs/data_structures.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -246,6 +246,7 @@ The whole point is to manipulate references to the data without actually loading
246246
!!! note
247247
You can index into a `ManifestArray` as long as the selection aligns with chunk boundaries — slicing through the interior of a chunk would require loading the chunk's bytes, which a virtual array deliberately cannot do.
248248
Chunk-aligned integer and slice indexing is supported, including mixed integer + slice indexers; integer indexers drop the indexed axis as in numpy. Misaligned selections raise `SubChunkIndexingError`.
249+
As a special case, an **uncompressed** array supports slicing _within_ a single source chunk along its largest-stride storage axis — the result is a new chunk reference with a bumped byte offset and a smaller length, no data loaded. This works for `[BytesCodec]` arrays (C-order; the axis is axis 0) and `[TransposeCodec(order=...), BytesCodec]` arrays (the axis is `order[0]` — e.g. the last axis for F-order). Useful for picking out a row from a multi-row chunk produced by a parser like the netCDF3 one.
249250
Arbitrary fancy indexing (e.g. with a boolean mask or integer array) is not supported, since it would generally require loading data.
250251

251252
## Zarr Groups

docs/releases.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@
66

77
- `ManifestArray` now supports chunk-aligned integer and slice indexing along each axis, including multi-chunk slices, mixed integer + slice indexers, and selections that include a partial final chunk. Integer indexers drop the indexed axis (numpy / array-API semantics) and are legal only when `chunk_size == 1` along that axis; slice indexers preserve the axis. This makes `xarray.Dataset.isel` work end-to-end on virtual datasets for any chunk-aligned selection. Indexers that would split individual chunks raise a new `SubChunkIndexingError` (a `ValueError` subclass) — a permanent constraint of a virtual array, not a missing feature. Previously slice misalignment silently no-op'd while integer indexing unconditionally raised `NotImplementedError`. Closes [#51](https://github.com/zarr-developers/VirtualiZarr/issues/51), supersedes [#499](https://github.com/zarr-developers/VirtualiZarr/pull/499).
88
By [Tom Nicholas](https://github.com/TomNicholas).
9+
- Slicing along the largest-stride storage axis of an uncompressed `ManifestArray` can now sub-divide a chunk — the result is a new reference into the same source file with a bumped byte offset and a smaller length. Eligible codec stacks are `[BytesCodec]` (C-order; the axis is axis 0) and `[TransposeCodec(order=...), BytesCodec]` (e.g. F-order with `order=(N-1, ..., 0)`, where the axis is `order[0]` — typically the last axis). Useful for picking a single timestep from a multi-row chunk produced by, e.g., the netCDF3 parser, without rechunking. Limited to slices that fit within one source chunk; multi-chunk-spanning sub-chunk slices, `step != 1`, and sub-chunk indexing on other axes still raise `SubChunkIndexingError`. Addresses part of [#86](https://github.com/zarr-developers/VirtualiZarr/issues/86).
10+
By [Tom Nicholas](https://github.com/TomNicholas).
911

1012
### Bug fixes
1113

virtualizarr/manifests/array.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -242,6 +242,12 @@ def __getitem__(
242242
- ``int`` — drops the indexed axis, following numpy / array-API semantics. Only legal
243243
when ``chunk_size == 1`` along that axis; otherwise picking a single element would
244244
require splitting a chunk.
245+
- Slice along the largest-stride storage axis of an **uncompressed** array that fits
246+
entirely within one source chunk — handled by rewriting the chunk reference's byte
247+
offset/length rather than splitting bytes. Useful for picking a single timestep from
248+
a multi-row chunk on a parser like the netCDF3 one. The eligible-axis is axis 0 for
249+
a plain ``[BytesCodec]`` array (C-order) or axis ``order[0]`` of a prepended
250+
``[TransposeCodec(order=...), BytesCodec]`` (e.g. the last axis for F-order).
245251
246252
Anything else — fancy indexing with arrays, misaligned slices, ``step != 1`` —
247253
raises ``SubChunkIndexingError`` or ``NotImplementedError``.

virtualizarr/manifests/indexing.py

Lines changed: 154 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
from types import EllipsisType
2-
from typing import TYPE_CHECKING, Any, TypeAlias, cast
2+
from typing import TYPE_CHECKING, Any, TypeAlias, TypeGuard, cast
33

44
import numpy as np
5+
from zarr.codecs import BytesCodec, TransposeCodec
6+
from zarr.core.metadata.v3 import ArrayV3Metadata
57

68
from virtualizarr.manifests.array_api import expand_dims
79
from virtualizarr.manifests.manifest import ChunkManifest
@@ -172,33 +174,62 @@ def apply_selection(
172174
raise TypeError(f"Invalid indexer type: {indexer_1d}")
173175
narrowed_indexers.append(indexer_1d)
174176

177+
sub_chunk_axis = _uncompressed_sub_chunk_axis(marr.metadata)
178+
175179
new_shape: list[int] = []
176180
new_chunks: list[int] = []
177181
chunk_grid_selectors: list[int | slice] = []
178182
kept_axes: list[int] = []
183+
# At most one sub-chunk axis (whichever axis has the largest byte stride in storage).
184+
# The byte adjustment is uniform across every surviving chunk, since chunks share layout.
185+
sub_chunk_byte_adjust: tuple[int, int] | None = None
179186
for axis, (axis_length, chunk_size, indexer_1d) in enumerate(
180187
zip(marr.shape, marr.chunks, narrowed_indexers, strict=True)
181188
):
182-
chunk_grid_selector, new_axis_length = _compute_chunk_aligned_selection_1d(
183-
indexer_1d, axis_length=axis_length, chunk_size=chunk_size
184-
)
189+
chunk_grid_selector: int | slice
190+
if axis == sub_chunk_axis and _is_sub_chunk_slice(
191+
indexer_1d, axis_length, chunk_size
192+
):
193+
chunk_grid_selector, new_axis_length, sub_chunk_byte_adjust = (
194+
_compute_sub_chunk_axis_selection(
195+
indexer_1d,
196+
axis_length=axis_length,
197+
chunk_size=chunk_size,
198+
other_axis_chunks=tuple(
199+
c for i, c in enumerate(marr.chunks) if i != axis
200+
),
201+
itemsize=marr.dtype.itemsize,
202+
)
203+
)
204+
new_chunks_for_axis = new_axis_length
205+
else:
206+
chunk_grid_selector, new_axis_length = _compute_chunk_aligned_selection_1d(
207+
indexer_1d, axis_length=axis_length, chunk_size=chunk_size
208+
)
209+
new_chunks_for_axis = chunk_size
210+
185211
chunk_grid_selectors.append(chunk_grid_selector)
186-
# int selectors drop the axis from the output array
212+
# int indexers drop the axis from the output array; slices preserve it (including the
213+
# sub-chunk path, which uses a length-1 chunk-grid slice selector).
187214
if not isinstance(indexer_1d, int):
188215
new_shape.append(new_axis_length)
189-
new_chunks.append(chunk_size)
216+
new_chunks.append(new_chunks_for_axis)
190217
kept_axes.append(axis)
191218

192219
chunk_grid_selectors_tuple = tuple(chunk_grid_selectors)
193220

194-
# short-circuit if every axis selects the whole chunk grid via a slice (a no-op)
195-
if all(
221+
# short-circuit if every axis selects the whole chunk grid via a slice (a no-op).
222+
# A pending sub-chunk byte adjustment is real work even if its single source chunk
223+
# happens to span the whole chunk grid along that axis, so don't short-circuit then.
224+
if sub_chunk_byte_adjust is None and all(
196225
isinstance(cgs, slice) and cgs == slice(0, dim, 1)
197226
for cgs, dim in zip(chunk_grid_selectors_tuple, marr.manifest.shape_chunk_grid)
198227
):
199228
return marr
200229

201230
new_manifest = _subset_manifest(marr.manifest, chunk_grid_selectors_tuple)
231+
if sub_chunk_byte_adjust is not None:
232+
new_manifest = _shift_manifest_byte_ranges(new_manifest, *sub_chunk_byte_adjust)
202233
old_dimension_names = marr.metadata.dimension_names
203234
# zarr's dimension_names is tuple[str | None, ...] but copy_and_replace_metadata's
204235
# type hint says Iterable[str]; the runtime handles None entries fine, so cast through.
@@ -265,6 +296,38 @@ def _compute_chunk_aligned_selection_1d(
265296
return slice(chunk_start, chunk_stop, 1), stop - start
266297

267298

299+
def _compute_sub_chunk_axis_selection(
300+
indexer_1d: slice,
301+
axis_length: int,
302+
chunk_size: int,
303+
other_axis_chunks: tuple[int, ...],
304+
itemsize: int,
305+
) -> tuple[slice, int, tuple[int, int]]:
306+
"""
307+
Translate a sub-chunk slice along the eligible (largest-stride) storage axis into a
308+
chunk-grid selector, an output axis length, and a uniform byte adjustment
309+
``(offset_delta, new_chunk_byte_length)`` applied to every surviving chunk reference.
310+
311+
Callers must have already confirmed that this slice is sub-chunk-eligible via
312+
``_is_sub_chunk_slice`` and that the array is uncompressed via
313+
``_uncompressed_sub_chunk_axis``.
314+
"""
315+
start, stop, _ = indexer_1d.indices(axis_length)
316+
chunk_index = start // chunk_size
317+
new_axis_length = stop - start
318+
# Bytes per index step along this axis within one chunk is the product of every
319+
# *other* axis's chunk size, times itemsize. Order doesn't matter since the product
320+
# is commutative.
321+
stride_bytes = int(np.prod(other_axis_chunks)) * itemsize
322+
inner_offset_bytes = (start - chunk_index * chunk_size) * stride_bytes
323+
sub_chunk_byte_adjust = (inner_offset_bytes, new_axis_length * stride_bytes)
324+
return (
325+
slice(chunk_index, chunk_index + 1, 1),
326+
new_axis_length,
327+
sub_chunk_byte_adjust,
328+
)
329+
330+
268331
def _subset_manifest(
269332
manifest: ChunkManifest, chunk_grid_selectors: tuple[int | slice, ...]
270333
) -> ChunkManifest:
@@ -323,3 +386,86 @@ def _subset_manifest(
323386
inlined=new_inlined,
324387
validate_paths=False,
325388
)
389+
390+
391+
def _uncompressed_sub_chunk_axis(metadata: ArrayV3Metadata) -> int | None:
392+
"""
393+
Return the axis along which sub-chunk slicing is implementable for this array, or
394+
``None`` if the codec stack disqualifies it.
395+
396+
Sub-chunk slicing rewrites an existing chunk reference's byte offset and length,
397+
so it only works when chunk bytes are raw element values in a fixed memory order —
398+
i.e., no compression, no value transforms, no checksums. The eligible codec stacks
399+
are:
400+
401+
- ``[BytesCodec]`` — C-order layout; the axis with the largest byte stride is axis 0.
402+
- ``[TransposeCodec(order=perm), BytesCodec]`` — stored layout is the logical array
403+
permuted by ``perm``; the axis with the largest byte stride in storage is logical
404+
axis ``perm[0]``. For the F-order case ``perm = (n-1, n-2, ..., 0)`` this picks out
405+
the last axis.
406+
"""
407+
codecs = metadata.codecs
408+
if len(codecs) == 1 and isinstance(codecs[0], BytesCodec):
409+
return 0
410+
if (
411+
len(codecs) == 2
412+
and isinstance(codecs[0], TransposeCodec)
413+
and isinstance(codecs[1], BytesCodec)
414+
):
415+
return int(codecs[0].order[0])
416+
return None
417+
418+
419+
def _is_sub_chunk_slice(
420+
indexer_1d: int | slice, axis_length: int, chunk_size: int
421+
) -> TypeGuard[slice]:
422+
"""
423+
True iff this is a slice that should take the sub-chunk path: step == 1, non-empty,
424+
fits entirely within one source chunk, and is NOT already chunk-aligned (chunk-aligned
425+
slices go through the simpler aligned path).
426+
427+
Typed as ``TypeGuard[slice]`` so callers can pass the narrowed indexer straight into
428+
helpers that take a ``slice``.
429+
"""
430+
if not isinstance(indexer_1d, slice):
431+
return False
432+
start, stop, step = indexer_1d.indices(axis_length)
433+
if step != 1 or start >= stop:
434+
return False
435+
# chunk-aligned slices are handled by _compute_chunk_aligned_selection_1d
436+
aligned = start % chunk_size == 0 and (
437+
stop == axis_length or stop % chunk_size == 0
438+
)
439+
if aligned:
440+
return False
441+
# contained in a single source chunk?
442+
return start // chunk_size == (stop - 1) // chunk_size
443+
444+
445+
def _shift_manifest_byte_ranges(
446+
manifest: ChunkManifest, offset_delta: int, new_length: int
447+
) -> ChunkManifest:
448+
"""
449+
Return a new ``ChunkManifest`` whose virtual chunk references point to a uniform
450+
sub-range of each original chunk: ``offset += offset_delta`` and ``length = new_length``.
451+
452+
Used by the uncompressed-axis-0 sub-chunk path, where every surviving chunk shares the
453+
same byte layout and therefore the same byte adjustment.
454+
"""
455+
new_offsets = cast(
456+
"np.ndarray[Any, np.dtype[np.uint64]]",
457+
manifest._offsets + np.uint64(offset_delta),
458+
)
459+
new_lengths = cast(
460+
"np.ndarray[Any, np.dtype[np.uint64]]",
461+
np.full_like(manifest._lengths, np.uint64(new_length)),
462+
)
463+
# paths and any inlined-chunk dict carry through unchanged: inlined chunks aren't
464+
# involved here (this path is only taken for uncompressed virtual references).
465+
return ChunkManifest.from_arrays(
466+
paths=manifest._paths,
467+
offsets=new_offsets,
468+
lengths=new_lengths,
469+
inlined=dict(manifest._inlined),
470+
validate_paths=False,
471+
)

virtualizarr/tests/__init__.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,6 @@ def _importorskip(
3838
has_tifffile, requires_tifffile = _importorskip("tifffile")
3939
has_imagecodecs, requires_imagecodecs = _importorskip("imagecodecs")
4040
has_hdf5plugin, requires_hdf5plugin = _importorskip("hdf5plugin")
41-
has_zarr_python, requires_zarr_python = _importorskip("zarr")
4241
has_dask, requires_dask = _importorskip("dask")
4342
has_obstore, requires_obstore = _importorskip("obstore")
4443
has_tiff, requires_tiff = _importorskip("virtual_tiff")

virtualizarr/tests/test_integration.py

Lines changed: 49 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,15 +18,16 @@
1818
ManifestStore,
1919
)
2020
from virtualizarr.manifests.utils import create_v3_array_metadata
21-
from virtualizarr.parsers import HDFParser, ZarrParser
21+
from virtualizarr.parsers import HDFParser, NetCDF3Parser, ZarrParser
2222
from virtualizarr.parsers.kerchunk.translator import manifestgroup_from_kerchunk_refs
2323
from virtualizarr.tests import (
2424
has_fastparquet,
2525
has_icechunk,
2626
has_kerchunk,
27+
requires_icechunk,
2728
requires_kerchunk,
2829
requires_network,
29-
requires_zarr_python,
30+
requires_scipy,
3031
slow_test,
3132
)
3233
from virtualizarr.tests.utils import PYTEST_TMP_DIRECTORY_URL_PREFIX
@@ -179,7 +180,6 @@ def roundtrip_as_in_memory_icechunk(
179180
)
180181

181182

182-
@requires_zarr_python
183183
@pytest.mark.parametrize(
184184
"roundtrip_func",
185185
[
@@ -551,3 +551,49 @@ def test_roundtrip_dataset_with_multiple_compressors():
551551
) as observed,
552552
):
553553
xr.testing.assert_allclose(expected, observed)
554+
555+
556+
@requires_scipy
557+
@requires_icechunk
558+
def test_subchunk_slice_netcdf3_through_icechunk_roundtrip(
559+
netcdf3_file, local_registry
560+
):
561+
# End-to-end check of the uncompressed sub-chunk slicing path: parse a netCDF3
562+
# file (whose variables become single multi-row chunks), .isel within a chunk,
563+
# write to icechunk, read back, and confirm bytes match a direct netCDF3 read +
564+
# slice on the same file.
565+
data = np.arange(32, dtype=np.float64).reshape(8, 4)
566+
nc_path = netcdf3_file(xr.Dataset({"foo": (["time", "x"], data)}))
567+
nc_url = f"file://{nc_path}"
568+
569+
with open_virtual_dataset(
570+
url=nc_url, parser=NetCDF3Parser(), registry=local_registry
571+
) as vds:
572+
# netCDF3 puts all rows in one source chunk; this slice is sub-chunk on axis 0.
573+
sliced_vds = vds.isel(time=slice(1, 3))
574+
575+
storage = icechunk.Storage.new_in_memory()
576+
config = icechunk.RepositoryConfig.default()
577+
container = icechunk.VirtualChunkContainer(
578+
url_prefix=PYTEST_TMP_DIRECTORY_URL_PREFIX,
579+
store=icechunk.local_filesystem_store(PYTEST_TMP_DIRECTORY_URL_PREFIX),
580+
)
581+
config.set_virtual_chunk_container(container)
582+
repo = icechunk.Repository.create(
583+
storage=storage,
584+
config=config,
585+
authorize_virtual_chunk_access={PYTEST_TMP_DIRECTORY_URL_PREFIX: None},
586+
)
587+
session = repo.writable_session("main")
588+
sliced_vds.vz.to_icechunk(session.store)
589+
session.commit("sub-chunk slice")
590+
591+
read_session = repo.readonly_session("main")
592+
with (
593+
xr.open_zarr(
594+
read_session.store, zarr_format=3, consolidated=False
595+
) as roundtripped,
596+
xr.open_dataset(nc_path) as direct,
597+
):
598+
expected = direct.isel(time=slice(1, 3))
599+
xrt.assert_identical(roundtripped.load(), expected.load())

0 commit comments

Comments
 (0)