feat: generalize Zarr I/O in spatialdata to local and add remote (fsspec / UPath) backends#1107
Draft
selmanozleyen wants to merge 54 commits intoscverse:mainfrom
Draft
feat: generalize Zarr I/O in spatialdata to local and add remote (fsspec / UPath) backends#1107selmanozleyen wants to merge 54 commits intoscverse:mainfrom
spatialdata to local and add remote (fsspec / UPath) backends#1107selmanozleyen wants to merge 54 commits intoscverse:mainfrom
Conversation
Patch da.to_zarr so ome_zarr's **kwargs are forwarded as zarr_array_kwargs, avoiding FutureWarning and keeping behavior correct.
- _FsspecStoreRoot, _get_store_root for path-like store roots (local + fsspec) - _storage_options_from_fs for parquet writes to Azure/S3/GCS - _remote_zarr_store_exists, _ensure_async_fs for UPath/FsspecStore - Extend _resolve_zarr_store for UPath and _FsspecStoreRoot with async fs - _backed_elements_contained_in_path, _is_element_self_contained accept UPath
- path and _path accept Path | UPath; setter allows UPath - write() accepts file_path: str | Path | UPath | None (None uses path) - _validate_can_safely_write_to_path handles UPath and remote store existence - _write_element accepts Path | UPath; skip local subfolder checks for UPath - __repr__ and _get_groups_for_element use path without forcing Path()
…table, zarr - Resolve store via _resolve_zarr_store in read paths (points, shapes, raster, table) - Use _get_store_root for parquet paths; read/write parquet with storage_options for fsspec - io_shapes: upload parquet to Azure/S3/GCS via temp file when path is _FsspecStoreRoot - io_zarr: _get_store_root, UPath in _get_groups_for_element and _write_consolidated_metadata; set sdata.path to UPath when store is remote
- pyproject.toml: adlfs, gcsfs, moto[server], pytest-timeout in test extras - Dockerfile.emulators: moto, Azurite, fake-gcs-server for tests/io/remote_storage/
… emulator config - full_sdata fixture: two regions for table categorical (avoids 404 on remote read) - tests/io/remote_storage/conftest.py: bucket/container creation, resilient async shutdown - tests/io/remote_storage/test_remote_storage.py: parametrized Azure/S3/GCS roundtrip and write tests
- Added "dimension_separator" to the frozenset of internal keys that should not be passed to zarr.Group.create_array(), ensuring compatibility with various zarr versions. - Updated test to set region labels for full_sdata table, allowing the test_set_table_annotates_spatialelement to succeed without errors.
- Updated the `test_subset` function to exclude labels and poly from the default table, ensuring accurate subset validation. - Enhanced `test_validate_table_in_spatialdata` to assert that both regions (labels2d and poly) are correctly annotated in the table. - Adjusted `test_labels_table_joins` to restrict the table to labels2d, ensuring the join returns the expected results.
…inux - Added steps to build and run storage emulators (S3, Azure, GCS) using Docker, specifically for the Ubuntu environment. - Implemented a wait mechanism to ensure emulators are ready before running tests. - Adjusted test execution to skip remote storage tests on non-Linux platforms.
- Wrapped the fsspec async sync function to prevent RuntimeError "Loop is not running" during process exit when using remote storage (Azure, S3, GCS). - Ensured compatibility with async session management in the _utils module.
for more information, see https://pre-commit.ci
…nd io_points modules
…cols and improving storage options handling for parquet files.
… or UPath, and add tests to verify correct coercion of string paths to appropriate types.
…g for unsupported protocols in storage options, and add test cases to validate new functionality and ensure compatibility with cloud object store protocols.
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…load configurations, and refining test execution conditions for different operating systems.
…to feat/zarr-store-class
for more information, see https://pre-commit.ci
f105f5f to
e6fba59
Compare
spatialdata so local and add remote (fsspec / UPath) backendsspatialdata to local and add remote (fsspec / UPath) backends
for more information, see https://pre-commit.ci
This reverts commit 2c4a579.
Dask's task-graph shape changed in PR scverse#1006 ("unpinning dask", commit 53b9438): parquet reads are now Task objects wrapping ``_fragment_to_table`` with a ``FragmentWrapper`` in kwargs or inside fused-expression subgraphs, instead of the legacy dicts with a ``piece`` tuple. This broke ``_search_for_backing_files_recursively`` in two ways: 1. ``"piece" in v.args[0]`` raised ``TypeError: argument of type 'Task' is not iterable`` before the fallback branch ever ran -- affecting every test that writes+reads points (``test_points``, ``test_roundtrip[points]``, ``test_io_and_lazy_loading_points``). 2. Fused expressions use key prefix ``readparquetpyarrowfs-fused-values-*`` (not ``read_parquet-*``) with the FragmentWrapper nested inside lists of tuples inside a subgraph dict, so even after fixing (1) the parquet file was never discovered -- affecting ``test_self_contained``. Replace the ad-hoc ``v.args[0]["piece"]`` / ``v.args[0].values()`` logic with a uniformly recursive helper ``_extract_parquet_paths_from_task`` that walks Mappings, Sequences, ``.args`` and ``.kwargs``, detecting FragmentWrapper via the ``.fragment.path`` attribute chain (no private dask_expr import) and still validating the legacy ``piece`` tuple shape for backward compatibility. Broaden the outer key-prefix match to any key containing "parquet"; ``.endswith(".parquet")`` inside the extractor keeps false positives out. Validated: 130 passed / 1 failed on tests/io/test_readwrite.py + test_store.py + test_store_abstractions.py (up from 113 / 18 on baseline; the one remaining failure is the intentionally-exposed consolidated- metadata-on-read gap, unrelated to this change). Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
(disclaimer prettified with AI)
This PR generalizes Zarr I/O in
spatialdataso local and remote(fsspec / UPath) backends are treated uniformly, and tightens the write
path to be safe against accidental clobbering of remote data.
Main changes
ZarrStoreabstraction (src/spatialdata/_store.py): a dataclasswrapping a
UPathwith its embedded fsspec filesystem, plusmake_zarr_store/make_zarr_store_from_groupconstructors andopen_read_store/open_write_storecontext managers. All Zarr opensgo through these.
_resolve_zarr_store(..., read_only=...)forwards the flag to every
LocalStore/FsspecStoreconstructionsite, so readers cannot accidentally write.
io_points.py,io_shapes.py):read_parquet/to_parquetnow takefilesystem=arrow_fsderived from the store's fsspec fs. Includes afix for a
to_parquetcategorical-schema regression introduced by thehandoff (demote
known=[]categoricals back tounknownin_read_pointssowrite_points'sas_known()can unify categoriesacross partitions).
SpatialData.writeto a remote
UPath(i.e. notPosixUPath/WindowsUPath) requiresoverwrite=True, because we cannot reliably probe existence onarbitrary remotes. Local paths and internal per-element writes are
unaffected.
ConsolidatedMetadataStorebranchesthat were unreachable on the
zarr>=3.0.0pin._extract_parquet_paths_from_taskrestores
get_dask_backing_filesfor the post-PR-unpinning dask #1006 dask expressionAPI (
FragmentWrapper.fragment.path) while still tolerating the legacy{"piece": (file, None, None)}shape; the matching key filter isbroadened to
"parquet" in name.lower()to catch fused-expression keyssuch as
readparquetpyarrowfs-fused-values-*.getattrbranch thatexists only to tolerate older dask/zarr shapes carries a
TODO(...)marker with an explicit retirement condition
(
TODO(legacy-dask),TODO(dask-task-api),TODO(zarr-v3-store-path),TODO(async-pyarrow-fs)).tests/io/test_store_abstractions.py):memory-fsspec round-trip of every element type, HTTP-like no-listing
filesystem, tamper-evident read, and write-side consolidated-metadata
coverage. Plus unit tests in
tests/io/test_store.pyfor_resolve_zarr_storeandZarrStoreconstruction.Relation to prior work
This branch was forked off
SamirMoustafa:cloud-storage-support,which introduced the initial plumbing for remote storage and a dedicated
CI job for remote-storage tests. Building on that, this PR:
end-to-end.
path/storage_optionsthreading into asingle
ZarrStorevalue object, so every reader and writer takes oneargument instead of reconstructing filesystems downstream.
provider-agnostic
fsspec.filesystem("memory")stress tests. No moto /Azurite needed; the abstractions are exercised against a synthetic
remote that mimics the awkward parts of cloud backends (no directory
listing, read-only enforcement, byte-level tamper detection).
categorical fix, and the dask task-graph adapter.
The abstraction-stress tests here cover the same correctness invariants
#971targeted on theremote side, without requiring cloud credentials or emulators. Real-network
integration can be a follow-up behind a
--run-remotemarker.Known gap left deliberately open
Not hidden behind
@pytest.mark.xfail-- I would rather have reviewerinput on where to wire the fix than merge it silenced.
Consolidated metadata is written but not consumed on read.
tests/io/test_store_abstractions.py::TestConsolidatedMetadataOnRead::test_read_zarr_opens_via_consolidated_metadatafails with
saw 120 GETs < 10: reading a single-imageSpatialDatafrom amemory UPath currently issues ~120 small
cat_filecalls becauseread_zarropens the root withzarr.open_group(store, mode="r")anddoes not pass
use_consolidated=True. This is a latency regressionover remote stores only; correctness is unaffected. Suggestions on
whether the fix should live in
read_zarritself, inopen_read_store,or be opt-in via a parameter are very welcome.
Test plan
pytest tests/io/test_store.py tests/io/test_store_abstractions.py(expected: 18 passed, 1 failed -- the consolidated-metadata
invariant noted above).
pytest tests/io/test_readwrite.pyvs main (expected: matchesbaseline, no new regressions; the Task-iteration failures present on
main are resolved here).
memory://UPath usingSpatialData.write/
SpatialData.read.grep -n "TODO(" src/spatialdata/_store.py src/spatialdata/_io/_utils.pylists four markers, each with a documented retirement condition.
cc: @melonora @SamirMoustafa