zarr-developers
diff --git a/‎docs/design/sync-bypass.md‎
Lines changed: 228 additions & 0 deletions b/‎docs/design/sync-bypass.md‎
Lines changed: 228 additions & 0 deletions
diff --git a/‎src/zarr/abc/codec.py‎
Lines changed: 35 additions & 3 deletions b/‎src/zarr/abc/codec.py‎
Lines changed: 35 additions & 3 deletions
diff --git a/‎src/zarr/abc/store.py‎
Lines changed: 36 additions & 6 deletions b/‎src/zarr/abc/store.py‎
Lines changed: 36 additions & 6 deletions
diff --git a/‎src/zarr/codecs/gzip.py‎
Lines changed: 8 additions & 0 deletions b/‎src/zarr/codecs/gzip.py‎
Lines changed: 8 additions & 0 deletions
@@ -0,0 +1,228 @@
+# Design: Fully Synchronous Read/Write Bypass
+
+## Problem
+
+Zarr-python's read/write path is inherently async: every `Array.__getitem__`
+or `Array.__setitem__` call passes through several layers of async machinery
+before any actual work happens. For workloads where both the codec chain and
+the store are fundamentally synchronous (e.g. gzip + MemoryStore, or
+zstd + LocalStore), this async overhead dominates latency.
+
+The call chain looks like this:
+
+```
+Array.__getitem__
+  └─ sync()                          # (1) thread hop: submits coroutine to background event loop
+       └─ AsyncArray._get_selection  #     runs on the event loop thread
+            └─ CodecPipeline.read    #     async pipeline
+                 ├─ concurrent_map   # (2) launches tasks on event loop
+                 │    └─ ByteGetter.get(prototype)   # (3) async store IO
+                 │         └─ MemoryStore.get()       #     just a dict lookup!
+                 └─ codec.decode()
+                      └─ asyncio.to_thread(...)       # (4) thread hop for CPU work
+                           └─ gzip.decompress(...)    #     actual compute
+```
+
+There are four sources of overhead, marked (1)-(4):
+
+1. **`sync()` bridge**: Every synchronous `Array` method calls `sync()`, which
+   uses `asyncio.run_coroutine_threadsafe()` to submit work to a background
+   event loop thread. Even when the coroutine does zero awaiting, this costs
+   ~30-50us for the round-trip through the event loop.
+
+2. **`concurrent_map` batching**: The pipeline groups chunks into batches and
+   dispatches them via `concurrent_map`, which creates asyncio tasks. For
+   single-chunk reads (the common case), this is pure overhead.
+
+3. **Async store IO**: `StorePath.get()` / `StorePath.set()` are `async def`.
+   For `MemoryStore` (a dict lookup) and `LocalStore` (a file read), the
+   underlying operation is synchronous — wrapping it in `async def` forces an
+   unnecessary context switch through the event loop.
+
+4. **`asyncio.to_thread` for codec compute**: `BatchedCodecPipeline` runs each
+   codec's encode/decode in `asyncio.to_thread()`, adding another thread hop.
+   `SyncCodecPipeline` (the foundation this work builds on) already eliminates
+   this by calling `_decode_sync` / `_encode_sync` inline.
+
+The net effect: a MemoryStore read of a single small chunk spends more time
+in async machinery than in actual decompression.
+
+
+## Solution
+
+When the codec pipeline and store both support synchronous operation, bypass
+the event loop entirely: run IO, codec compute, and buffer scatter all on the
+calling thread, with zero async overhead.
+
+The solution has three layers:
+
+### Layer 1: Sync Store IO
+
+Add `supports_sync`, `get_sync()`, `set_sync()`, and `delete_sync()` to the
+store abstraction. These are opt-in: the `Store` ABC provides default
+implementations that raise `NotImplementedError`, and only stores with native
+sync capabilities override them.
+
+```
+Store ABC (defaults: supports_sync=False, methods raise NotImplementedError)
+  ├── MemoryStore  (supports_sync=True, direct dict access)
+  ├── LocalStore   (supports_sync=True, direct file IO via _get/_put)
+  └── FsspecStore  (unchanged, remains async-only)
+
+StorePath delegates to its underlying Store:
+  get_sync()  →  self.store.get_sync(self.path, ...)
+  set_sync()  →  self.store.set_sync(self.path, ...)
+```
+
+**Key decision**: `StorePath` is what gets passed to the codec pipeline as a
+`ByteGetter` / `ByteSetter`. By adding sync methods to `StorePath`, the
+pipeline can call them directly without knowing the concrete store type.
+
+**Protocol gap**: The `ByteGetter` / `ByteSetter` protocols only define async
+methods (`get`, `set`, `delete`). Rather than modifying these widely-used
+protocols, the sync pipeline methods use `Any` type annotations for the
+byte_getter/byte_setter parameters and call `.get_sync()` etc. at runtime.
+This is a pragmatic tradeoff: the sync path is an optimization that only
+activates when `supports_sync` is True, so the runtime type is always a
+`StorePath` that has these methods.
+
+### Layer 2: Sync Codec Pipeline IO
+
+Add `supports_sync_io`, `read_sync()`, and `write_sync()` to the
+`CodecPipeline` ABC (non-abstract, default raises `NotImplementedError`).
+
+`SyncCodecPipeline` implements these with a simple sequential loop:
+
+```python
+# read_sync: for each chunk
+for byte_getter, chunk_spec, chunk_sel, out_sel, _ in batch_info:
+    chunk_bytes = byte_getter.get_sync(prototype=chunk_spec.prototype)  # sync IO
+    chunk_array = self._decode_one(chunk_bytes, ...)                    # sync compute
+    out[out_selection] = chunk_array[chunk_selection]                   # scatter
+```
+
+No batching, no `concurrent_map`, no event loop — just a Python for-loop.
+
+**Sharding fallback**: When `supports_partial_decode` is True (i.e. the codec
+pipeline uses sharding), `supports_sync_io` returns False and the Array falls
+back to the standard `sync()` path. This is because `ShardingCodec`'s
+`decode_partial` is async (it reads sub-ranges from the store) and does not
+have a sync equivalent.
+
+### Layer 3: Array Bypass
+
+Each of the 10 sync `Array` selection methods (5 getters, 5 setters) gains a
+fast path:
+
+```python
+def get_basic_selection(self, selection, *, out=None, prototype=None, fields=None):
+    indexer = BasicIndexer(selection, self.shape, self.metadata.chunk_grid)
+    if self._can_use_sync_path():
+        return _get_selection_sync(
+            self.async_array.store_path, self.async_array.metadata,
+            self.async_array.codec_pipeline, self.async_array.config,
+            indexer, out=out, fields=fields, prototype=prototype,
+        )
+    return sync(self.async_array._get_selection(indexer, ...))
+```
+
+`_can_use_sync_path()` checks three conditions:
+1. The codec pipeline supports sync IO (`supports_sync_io`)
+2. No partial decode is active (rules out sharding)
+3. The store supports sync (`supports_sync`)
+
+When all three hold, `_get_selection_sync` / `_set_selection_sync` run the
+entire operation on the calling thread. These functions mirror the async
+`_get_selection` / `_set_selection` exactly, but call `codec_pipeline.read_sync()`
+/ `write_sync()` instead of `await codec_pipeline.read()` / `write()`.
+
+
+## Resulting Call Chain
+
+With the sync bypass active, the call chain becomes:
+
+```
+Array.__getitem__
+  └─ _get_selection_sync             # runs on calling thread
+       └─ SyncCodecPipeline.read_sync
+            ├─ StorePath.get_sync    # direct dict/file access, no event loop
+            ├─ _decode_one           # inline codec chain, no to_thread
+            └─ out[sel] = array      # scatter into output
+```
+
+No `sync()`, no event loop, no `asyncio.to_thread`, no `concurrent_map`.
+
+
+## Additional Optimization: Codec Instance Caching
+
+`GzipCodec` was creating a new `GZip(level)` instance on every encode/decode
+call. `ZstdCodec` and `BloscCodec` already cache their codec instances via
+`@cached_property`. We apply the same pattern to `GzipCodec`:
+
+```python
+@cached_property
+def _gzip_codec(self) -> GZip:
+    return GZip(self.level)
+```
+
+This is safe because `GzipCodec` is a frozen dataclass — `level` never
+changes after construction, so the cached instance is always valid.
+
+
+## What Stays Unchanged
+
+- **`BatchedCodecPipeline`**: Unmodified. It inherits the default
+  `supports_sync_io=False` from the ABC.
+- **Remote stores** (`FsspecStore`): `supports_sync` stays `False`. All
+  remote IO remains async.
+- **Sharded arrays**: Fall back to the `sync()` path because
+  `supports_partial_decode` is True.
+- **All async APIs**: `AsyncArray`, `async def read/write`, etc. are
+  completely untouched. The sync bypass is an optimization of the
+  synchronous `Array` class only.
+
+
+## Files Modified
+
+| File | Layer | Change |
+|------|-------|--------|
+| `src/zarr/abc/store.py` | 1 | `supports_sync`, `get_sync`, `set_sync`, `delete_sync` on `Store` ABC |
+| `src/zarr/storage/_memory.py` | 1 | Sync store methods (direct dict access) |
+| `src/zarr/storage/_local.py` | 1 | Sync store methods (direct `_get`/`_put` calls) |
+| `src/zarr/storage/_common.py` | 1 | Sync methods on `StorePath` (delegates to store) |
+| `src/zarr/abc/codec.py` | 2 | `supports_sync_io`, `read_sync`, `write_sync` on `CodecPipeline` ABC |
+| `src/zarr/experimental/sync_codecs.py` | 2 | `read_sync`, `write_sync` implementation |
+| `src/zarr/core/array.py` | 3 | `_can_use_sync_path`, `_get_selection_sync`, `_set_selection_sync`, 10 method modifications |
+| `src/zarr/codecs/gzip.py` | — | `@cached_property` for GZip instance |
+
+
+## Design Tradeoffs
+
+**Duplication of `_get_selection` / `_set_selection`**: The sync versions
+(`_get_selection_sync`, `_set_selection_sync`) duplicate the setup logic
+(dtype resolution, buffer creation, value coercion) from the async originals.
+This is intentional: extracting shared helpers would add complexity and
+indirection to the hot path for no functional benefit. The two versions
+should be kept in sync manually.
+
+**Sequential chunk processing**: `read_sync` and `write_sync` process chunks
+sequentially in a for-loop, with no parallelism. For the target use case
+(MemoryStore, LocalStore), this is optimal: MemoryStore is a dict lookup
+(~1us), LocalStore is a file read that benefits from OS page cache, and
+Python's GIL prevents true parallelism for CPU-bound codec work anyway. The
+async path with `concurrent_map` is better for remote stores where IO latency
+can be overlapped.
+
+**`Any` type annotations**: The `read_sync` and `write_sync` methods on
+`SyncCodecPipeline` use `Any` for the byte_getter/byte_setter type in the
+`batch_info` tuples. This avoids modifying the `ByteGetter`/`ByteSetter`
+protocols, which are public API. The runtime type is always `StorePath`, which
+has the sync methods; the type system just can't express this constraint
+through the existing protocol hierarchy.
+
+**No sync partial decode/encode**: Sharding's `decode_partial` /
+`encode_partial` methods are inherently async (they issue byte-range reads to
+the store). Rather than adding sync variants to the sharding codec (which
+would require significant refactoring), we simply fall back to the `sync()`
+path for sharded arrays. This is the right tradeoff because sharded arrays
+typically involve remote stores where async IO is beneficial anyway.
@@ -474,9 +474,31 @@ async def write(
         """
         ...
 
+    # -------------------------------------------------------------------
+    # Fully synchronous read/write (opt-in)
+    #
+    # When a CodecPipeline subclass can run the entire read/write path
+    # (store IO + codec compute + buffer scatter) without touching the
+    # event loop, it overrides these methods and sets supports_sync_io
+    # to True. This lets Array selection methods bypass sync() entirely.
+    #
+    # The default implementations raise NotImplementedError, so
+    # BatchedCodecPipeline (the standard pipeline) is unaffected.
+    #
+    # See docs/design/sync-bypass.md for the full design rationale.
+    # -------------------------------------------------------------------
+
     @property
     def supports_sync_io(self) -> bool:
-        """Whether this pipeline supports fully synchronous read/write."""
+        """Whether this pipeline can run read/write entirely on the calling thread.
+
+        True when:
+        - All codecs support synchronous encode/decode (_decode_sync/_encode_sync)
+        - The pipeline's read_sync/write_sync methods are implemented
+
+        Checked by ``Array._can_use_sync_path()`` to decide whether to bypass
+        the ``sync()`` event-loop bridge.
+        """
         return False
 
     def read_sync(
@@ -485,7 +507,12 @@ def read_sync(
         out: NDBuffer,
         drop_axes: tuple[int, ...] = (),
     ) -> None:
-        """Synchronous read path. Only available on pipelines that support it."""
+        """Synchronous read: fetch bytes from store, decode, scatter into out.
+
+        Runs entirely on the calling thread. Only available when
+        ``supports_sync_io`` is True. Called by ``_get_selection_sync`` in
+        ``array.py`` when the sync bypass is active.
+        """
         raise NotImplementedError
 
     def write_sync(
@@ -494,7 +521,12 @@ def write_sync(
         value: NDBuffer,
         drop_axes: tuple[int, ...] = (),
     ) -> None:
-        """Synchronous write path. Only available on pipelines that support it."""
+        """Synchronous write: gather from value, encode, persist to store.
+
+        Runs entirely on the calling thread. Only available when
+        ``supports_sync_io`` is True. Called by ``_set_selection_sync`` in
+        ``array.py`` when the sync bypass is active.
+        """
         raise NotImplementedError
 
 
 
@@ -524,12 +524,29 @@ def supports_partial_writes(self) -> Literal[False]:
         """
         return False
 
+    # -----------------------------------------------------------------------
+    # Synchronous IO interface (opt-in)
+    #
+    # These methods enable the SyncCodecPipeline to bypass the event loop
+    # entirely for store IO. The default implementations raise
+    # NotImplementedError; stores that wrap fundamentally synchronous
+    # operations (MemoryStore, LocalStore) override them with direct
+    # implementations. Remote/cloud stores (FsspecStore) leave them as-is
+    # and remain async-only.
+    #
+    # See docs/design/sync-bypass.md for the full design rationale.
+    # -----------------------------------------------------------------------
+
     @property
     def supports_sync(self) -> bool:
-        """Does the store support synchronous get/set/delete?
+        """Whether this store has native synchronous get/set/delete methods.
+
+        When True, ``SyncCodecPipeline.read_sync`` / ``write_sync`` will call
+        ``get_sync`` / ``set_sync`` / ``delete_sync`` directly on the calling
+        thread, avoiding the event loop overhead of the async equivalents.
 
-        When True, the sync codec pipeline can bypass the event loop for IO.
-        Override in subclasses that have native sync implementations.
+        Subclasses that override the sync methods below should also override
+        this property to return True.
         """
         return False
 
@@ -539,15 +556,28 @@ def get_sync(
         prototype: BufferPrototype,
         byte_range: ByteRequest | None = None,
     ) -> Buffer | None:
-        """Synchronous version of get(). Only available when supports_sync is True."""
+        """Synchronous version of ``get()``.
+
+        Called by ``SyncCodecPipeline.read_sync`` to fetch chunk bytes without
+        going through the event loop. Only called when ``supports_sync`` is
+        True, so the default ``NotImplementedError`` is never hit in practice.
+        """
         raise NotImplementedError
 
     def set_sync(self, key: str, value: Buffer) -> None:
-        """Synchronous version of set(). Only available when supports_sync is True."""
+        """Synchronous version of ``set()``.
+
+        Called by ``SyncCodecPipeline.write_sync`` to persist encoded chunk
+        bytes without going through the event loop.
+        """
         raise NotImplementedError
 
     def delete_sync(self, key: str) -> None:
-        """Synchronous version of delete(). Only available when supports_sync is True."""
+        """Synchronous version of ``delete()``.
+
+        Called by ``SyncCodecPipeline.write_sync`` when a chunk should be
+        removed (e.g. an empty chunk with ``write_empty_chunks=False``).
+        """
         raise NotImplementedError
 
     @property
 
@@ -49,11 +49,19 @@ def from_dict(cls, data: dict[str, JSON]) -> Self:
     def to_dict(self) -> dict[str, JSON]:
         return {"name": "gzip", "configuration": {"level": self.level}}
 
+    # Cache the numcodecs GZip instance. GzipCodec is a frozen dataclass,
+    # so `level` never changes after construction, making this safe.
+    # This matches the pattern used by ZstdCodec._zstd_codec and
+    # BloscCodec._blosc_codec. Without caching, a new GZip(level) was
+    # created on every encode/decode call.
     @cached_property
     def _gzip_codec(self) -> GZip:
         return GZip(self.level)
 
     def _decode_sync(self, chunk_bytes: Buffer, chunk_spec: ArraySpec) -> Buffer:
+        # Use the cached codec instance instead of creating GZip(self.level)
+        # each time. The async _decode_single delegates to this method via
+        # asyncio.to_thread, so both paths benefit from the cache.
         return as_numpy_array_wrapper(self._gzip_codec.decode, chunk_bytes, chunk_spec.prototype)
 
     def _encode_sync(self, chunk_bytes: Buffer, chunk_spec: ArraySpec) -> Buffer | None: