Skip to content

feat: generalize Zarr I/O in spatialdata to local and add remote (fsspec / UPath) backends#1107

Draft
selmanozleyen wants to merge 54 commits intoscverse:mainfrom
selmanozleyen:feat/zarr-store-class
Draft

feat: generalize Zarr I/O in spatialdata to local and add remote (fsspec / UPath) backends#1107
selmanozleyen wants to merge 54 commits intoscverse:mainfrom
selmanozleyen:feat/zarr-store-class

Conversation

@selmanozleyen
Copy link
Copy Markdown
Member

@selmanozleyen selmanozleyen commented Apr 17, 2026

(disclaimer prettified with AI)

This PR generalizes Zarr I/O in spatialdata so local and remote
(fsspec / UPath) backends are treated uniformly, and tightens the write
path to be safe against accidental clobbering of remote data.

Main changes

  • ZarrStore abstraction (src/spatialdata/_store.py): a dataclass
    wrapping a UPath with its embedded fsspec filesystem, plus
    make_zarr_store / make_zarr_store_from_group constructors and
    open_read_store / open_write_store context managers. All Zarr opens
    go through these.
  • Read-only plumbing: _resolve_zarr_store(..., read_only=...)
    forwards the flag to every LocalStore / FsspecStore construction
    site, so readers cannot accidentally write.
  • Arrow-filesystem I/O for points / shapes Parquet (io_points.py,
    io_shapes.py): read_parquet / to_parquet now take
    filesystem=arrow_fs derived from the store's fsspec fs. Includes a
    fix for a to_parquet categorical-schema regression introduced by the
    handoff (demote known=[] categoricals back to unknown in
    _read_points so write_points's as_known() can unify categories
    across partitions).
  • Overwrite semantics for remote targets: top-level SpatialData.write
    to a remote UPath (i.e. not PosixUPath / WindowsUPath) requires
    overwrite=True, because we cannot reliably probe existence on
    arbitrary remotes. Local paths and internal per-element writes are
    unaffected.
  • zarr v3 cleanup: removed dead ConsolidatedMetadataStore branches
    that were unreachable on the zarr>=3.0.0 pin.
  • Dask task-graph adapter: _extract_parquet_paths_from_task
    restores get_dask_backing_files for the post-PR-unpinning dask #1006 dask expression
    API (FragmentWrapper.fragment.path) while still tolerating the legacy
    {"piece": (file, None, None)} shape; the matching key filter is
    broadened to "parquet" in name.lower() to catch fused-expression keys
    such as readparquetpyarrowfs-fused-values-*.
  • Legacy branches are tagged, not hidden: every getattr branch that
    exists only to tolerate older dask/zarr shapes carries a TODO(...)
    marker with an explicit retirement condition
    (TODO(legacy-dask), TODO(dask-task-api), TODO(zarr-v3-store-path),
    TODO(async-pyarrow-fs)).
  • Abstraction stress tests (tests/io/test_store_abstractions.py):
    memory-fsspec round-trip of every element type, HTTP-like no-listing
    filesystem, tamper-evident read, and write-side consolidated-metadata
    coverage. Plus unit tests in tests/io/test_store.py for
    _resolve_zarr_store and ZarrStore construction.

Relation to prior work

This branch was forked off
SamirMoustafa:cloud-storage-support,
which introduced the initial plumbing for remote storage and a dedicated
CI job for remote-storage tests. Building on that, this PR:

  • Kept: the motivation of supporting UPath-addressable remote backends
    end-to-end.
  • Refactored: the ad-hoc path / storage_options threading into a
    single ZarrStore value object, so every reader and writer takes one
    argument instead of reconstructing filesystems downstream.
  • Replaced: the provider-specific (S3 / Azure) test scaffolding with
    provider-agnostic fsspec.filesystem("memory") stress tests. No moto /
    Azurite needed; the abstractions are exercised against a synthetic
    remote that mimics the awkward parts of cloud backends (no directory
    listing, read-only enforcement, byte-level tamper detection).
  • Added: the read-only store mode, the remote-overwrite guard, the
    categorical fix, and the dask task-graph adapter.

The abstraction-stress tests here cover the same correctness invariants
#971 targeted on the
remote side, without requiring cloud credentials or emulators. Real-network
integration can be a follow-up behind a --run-remote marker.

Known gap left deliberately open

Not hidden behind @pytest.mark.xfail -- I would rather have reviewer
input on where to wire the fix than merge it silenced.

Consolidated metadata is written but not consumed on read.
tests/io/test_store_abstractions.py::TestConsolidatedMetadataOnRead::test_read_zarr_opens_via_consolidated_metadata
fails with saw 120 GETs < 10: reading a single-image SpatialData from a
memory UPath currently issues ~120 small cat_file calls because
read_zarr opens the root with zarr.open_group(store, mode="r") and
does not pass use_consolidated=True. This is a latency regression
over remote stores only; correctness is unaffected. Suggestions on
whether the fix should live in read_zarr itself, in open_read_store,
or be opt-in via a parameter are very welcome.

The previously-listed "Dask Task-iteration" gap (introduced by
53b9438a,
PR #1006) is now
fixed in this PR via the task-graph adapter above.

Test plan

  • pytest tests/io/test_store.py tests/io/test_store_abstractions.py
    (expected: 18 passed, 1 failed -- the consolidated-metadata
    invariant noted above).
  • pytest tests/io/test_readwrite.py vs main (expected: matches
    baseline, no new regressions; the Task-iteration failures present on
    main are resolved here).
  • Manual round-trip with a memory:// UPath using SpatialData.write
    / SpatialData.read.
  • grep -n "TODO(" src/spatialdata/_store.py src/spatialdata/_io/_utils.py
    lists four markers, each with a documented retirement condition.

cc: @melonora @SamirMoustafa

SamirMoustafa and others added 30 commits February 28, 2026 02:13
Patch da.to_zarr so ome_zarr's **kwargs are forwarded as zarr_array_kwargs,
avoiding FutureWarning and keeping behavior correct.
- _FsspecStoreRoot, _get_store_root for path-like store roots (local + fsspec)
- _storage_options_from_fs for parquet writes to Azure/S3/GCS
- _remote_zarr_store_exists, _ensure_async_fs for UPath/FsspecStore
- Extend _resolve_zarr_store for UPath and _FsspecStoreRoot with async fs
- _backed_elements_contained_in_path, _is_element_self_contained accept UPath
- path and _path accept Path | UPath; setter allows UPath
- write() accepts file_path: str | Path | UPath | None (None uses path)
- _validate_can_safely_write_to_path handles UPath and remote store existence
- _write_element accepts Path | UPath; skip local subfolder checks for UPath
- __repr__ and _get_groups_for_element use path without forcing Path()
…table, zarr

- Resolve store via _resolve_zarr_store in read paths (points, shapes, raster, table)
- Use _get_store_root for parquet paths; read/write parquet with storage_options for fsspec
- io_shapes: upload parquet to Azure/S3/GCS via temp file when path is _FsspecStoreRoot
- io_zarr: _get_store_root, UPath in _get_groups_for_element and _write_consolidated_metadata; set sdata.path to UPath when store is remote
- pyproject.toml: adlfs, gcsfs, moto[server], pytest-timeout in test extras
- Dockerfile.emulators: moto, Azurite, fake-gcs-server for tests/io/remote_storage/
… emulator config

- full_sdata fixture: two regions for table categorical (avoids 404 on remote read)
- tests/io/remote_storage/conftest.py: bucket/container creation, resilient async shutdown
- tests/io/remote_storage/test_remote_storage.py: parametrized Azure/S3/GCS roundtrip and write tests
- Added "dimension_separator" to the frozenset of internal keys that should not be passed to zarr.Group.create_array(), ensuring compatibility with various zarr versions.
- Updated test to set region labels for full_sdata table, allowing the test_set_table_annotates_spatialelement to succeed without errors.
- Updated the `test_subset` function to exclude labels and poly from the default table, ensuring accurate subset validation.
- Enhanced `test_validate_table_in_spatialdata` to assert that both regions (labels2d and poly) are correctly annotated in the table.
- Adjusted `test_labels_table_joins` to restrict the table to labels2d, ensuring the join returns the expected results.
…inux

- Added steps to build and run storage emulators (S3, Azure, GCS) using Docker, specifically for the Ubuntu environment.
- Implemented a wait mechanism to ensure emulators are ready before running tests.
- Adjusted test execution to skip remote storage tests on non-Linux platforms.
- Wrapped the fsspec async sync function to prevent RuntimeError "Loop is not running" during process exit when using remote storage (Azure, S3, GCS).
- Ensured compatibility with async session management in the _utils module.
…cols and improving storage options handling for parquet files.
… or UPath, and add tests to verify correct coercion of string paths to appropriate types.
…g for unsupported protocols in storage options, and add test cases to validate new functionality and ensure compatibility with cloud object store protocols.
@selmanozleyen selmanozleyen force-pushed the feat/zarr-store-class branch from f105f5f to e6fba59 Compare April 17, 2026 12:07
@selmanozleyen selmanozleyen marked this pull request as draft April 17, 2026 12:13
@selmanozleyen selmanozleyen changed the title feat: generalize Zarr I/O in spatialdata so local and add remote (fsspec / UPath) backends feat: generalize Zarr I/O in spatialdata to local and add remote (fsspec / UPath) backends Apr 17, 2026
selmanozleyen and others added 9 commits April 17, 2026 14:19
This reverts commit 2c4a579.
Dask's task-graph shape changed in PR scverse#1006 ("unpinning dask", commit 53b9438):
parquet reads are now Task objects wrapping ``_fragment_to_table`` with a
``FragmentWrapper`` in kwargs or inside fused-expression subgraphs, instead
of the legacy dicts with a ``piece`` tuple. This broke
``_search_for_backing_files_recursively`` in two ways:

1. ``"piece" in v.args[0]`` raised ``TypeError: argument of type 'Task' is
   not iterable`` before the fallback branch ever ran -- affecting every
   test that writes+reads points (``test_points``, ``test_roundtrip[points]``,
   ``test_io_and_lazy_loading_points``).
2. Fused expressions use key prefix ``readparquetpyarrowfs-fused-values-*``
   (not ``read_parquet-*``) with the FragmentWrapper nested inside lists of
   tuples inside a subgraph dict, so even after fixing (1) the parquet file
   was never discovered -- affecting ``test_self_contained``.

Replace the ad-hoc ``v.args[0]["piece"]`` / ``v.args[0].values()`` logic
with a uniformly recursive helper ``_extract_parquet_paths_from_task`` that
walks Mappings, Sequences, ``.args`` and ``.kwargs``, detecting
FragmentWrapper via the ``.fragment.path`` attribute chain (no private
dask_expr import) and still validating the legacy ``piece`` tuple shape
for backward compatibility. Broaden the outer key-prefix match to any key
containing "parquet"; ``.endswith(".parquet")`` inside the extractor keeps
false positives out.

Validated: 130 passed / 1 failed on tests/io/test_readwrite.py +
test_store.py + test_store_abstractions.py (up from 113 / 18 on baseline;
the one remaining failure is the intentionally-exposed consolidated-
metadata-on-read gap, unrelated to this change).

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants