zarr-developers
diff --git a/‎docs/superpowers/specs/2026-04-17-codec-pipeline-invariants.md‎
Lines changed: 185 additions & 0 deletions b/‎docs/superpowers/specs/2026-04-17-codec-pipeline-invariants.md‎
Lines changed: 185 additions & 0 deletions
@@ -0,0 +1,185 @@
+# Codec Pipeline Invariants
+
+This document describes the contracts that the codec pipeline, codec
+chain, sharding codec, and buffer abstractions rely on. Each invariant
+is enforced by a runtime test in `tests/test_codec_invariants.py`.
+Any change to the pipeline that violates one of these invariants
+should fail those tests immediately, instead of waiting for an
+end-to-end test in a particular configuration to surface the bug.
+
+## Why this exists
+
+The new `SyncCodecPipeline` accumulated several correctness bugs that
+all had the same shape: **case-by-case reasoning about how codec /
+shard / IO / buffer pieces interact, missing a combination.** Examples:
+
+- The `ChunkTransform` cached prototype-bearing specs at construction
+  time, so per-call GPU prototypes were silently discarded.
+- The byte-range write path produced a dense layout, breaking the
+  shard-compactness contract that downstream code (and existing
+  tests) depended on.
+- The shard index decoded from a read-only buffer crashed when the
+  byte-range path tried to mutate it.
+- Empty inner chunks weren't compacted out of partial shard writes.
+
+In each case, the code "worked" for the configuration the author was
+thinking about, and broke for a different one. The invariants below
+are the contracts the author should have written down first.
+
+## Codec chain invariants
+
+### C1. Codecs only mutate the array spec's `shape`.
+
+`Codec.resolve_metadata(spec)` returns an `ArraySpec` that may differ
+from `spec` in `shape` (e.g. `TransposeCodec` permutes it). Every
+other field — `prototype`, `dtype`, `fill_value`, `config` — is
+unchanged.
+
+**Consequence:** A `ChunkTransform` cannot bake any field other than
+`shape` into a cache. Anything else (notably `prototype`) must be
+read from the runtime `chunk_spec` on every call.
+
+**Test:** for every registered codec, call `resolve_metadata(spec)`
+with various specs and assert `result.prototype is spec.prototype`,
+`result.dtype == spec.dtype`, etc.
+
+### C2. Each codec call receives the spec at its position in the chain.
+
+A `chunk_spec` flows through the codec chain. AA codecs see the
+running spec (each step potentially changes shape). The AB codec sees
+the spec after all AA codecs have resolved. BB codecs all see the
+post-AB spec.
+
+**Consequence:** the pipeline must walk the AA codecs to get the AB
+spec for any given input. The walk is cheap; what matters is that
+the pipeline walks from the *runtime* `chunk_spec`, not from a
+constructor-time spec.
+
+**Test:** a fake codec that asserts `chunk_spec.prototype is X` where
+`X` was passed to the pipeline at call time.
+
+### C3. The pipeline owns IO unless a codec opts in to partial
+       encode/decode.
+
+Codecs are pure compute. The pipeline is responsible for fetching
+encoded blobs from the store, calling `decode_chunk`, and writing
+encoded results back. The exception is when the AB codec implements
+`ArrayBytesCodecPartialDecodeMixin` / `ArrayBytesCodecPartialEncodeMixin`
+— then the codec receives the store handle and handles its own IO
+(this is how `ShardingCodec` does byte-range reads/writes).
+
+**Consequence:** A pipeline must check `supports_partial_encode` /
+`supports_partial_decode` and dispatch to the codec when true. It must
+not branch on codec type (`isinstance(..., ShardingCodec)`).
+
+**Test:** the pipeline should produce identical results when forced
+through both code paths (force partial-encode-capable codec to use
+the generic path, and vice versa where possible).
+
+## Shard layout invariants
+
+### S1. The shard layout is *compact*, not dense.
+
+A shard blob contains only the inner chunks that are present, packed
+together (in morton order). Absent chunks are recorded in the index
+with `(MAX_UINT_64, MAX_UINT_64)` and consume no space in the data
+region.
+
+**Consequence:** Even with fixed-size inner codecs, the shard size
+varies with how many chunks are present. Anything that writes a shard
+must produce the compact layout.
+
+**Counter-example we hit:** the byte-range fast path wrote every
+affected chunk into a fixed slot, producing a dense layout. This
+worked for reads (the index correctly recorded offsets) but broke
+size-checking tests and wasted space.
+
+### S2. `write_empty_chunks=False` (the default) means an inner chunk
+       that equals `fill_value` must NOT be written.
+
+When merged data equals the fill value, the chunk is omitted from
+the shard entirely (no entry in the data region, `MAX_UINT_64` in
+the index). If all chunks become absent, the entire shard is deleted.
+
+**Consequence:** any partial-shard-write code path must compute the
+merged chunk content, check against `fill_value`, and omit it if
+they match.
+
+**Test:** write fill-value data to a sharded array; assert the
+relevant store keys do not exist.
+
+### S3. The byte-range fast path requires `write_empty_chunks=True`.
+
+Byte-range writes update fixed slots in an existing shard blob. They
+cannot compact away chunks that newly equal `fill_value` (would
+require rewriting the whole blob anyway). The optimization is only
+valid when the user explicitly opts out of empty-chunk skipping.
+
+**Consequence:** `_encode_partial_sync`'s byte-range path is gated
+on `not skip_empty`. Under the default config, partial shard writes
+take the full-rewrite path.
+
+## Buffer invariants
+
+### B1. Buffers returned from store IO may be read-only.
+
+`LocalStore` returns mmap-backed buffers; `ZipStore` returns views
+into a zip member; remote stores may return immutable buffers.
+
+**Consequence:** any code that decodes from a store-returned buffer
+and then mutates the result must `.copy()` first.
+
+**Counter-example we hit:** `_ShardIndex.set_chunk_slice` mutates
+`self.offsets_and_lengths`. When the index was decoded from a
+read-only buffer, this raised `ValueError: assignment destination
+is read-only`.
+
+### B2. `decode_chunk` may return a view of its input.
+
+`BytesCodec._decode_sync` returns an `NDBuffer` that views the same
+memory as the input `Buffer`. Subsequent mutations affect both.
+
+**Consequence:** if the caller intends to mutate the decoded array
+(e.g. for a partial-write merge), it must `.copy()` first.
+
+### B3. The buffer prototype is a runtime parameter, not metadata.
+
+The `prototype` field of `ArraySpec` indicates what kind of buffer
+the *caller* wants for this operation. The same array can be read
+into CPU buffers on one call and GPU buffers on the next. The
+codec pipeline must use the per-call prototype, not a cached one.
+
+**Consequence:** see C1, C2 above.
+
+## Test plan
+
+`tests/test_codec_invariants.py` should contain:
+
+1. **C1 enforcement:** parametric test over all registered codecs
+   that calls `resolve_metadata` with a sentinel `ArraySpec` and
+   asserts the non-shape fields are unchanged.
+
+2. **C2 enforcement:** a test that passes a `ChunkTransform`
+   different prototypes per call (using a fake codec that records
+   the prototype it was called with) and asserts the recorded
+   prototypes match.
+
+3. **S1 + S2 enforcement:** a test that writes various combinations
+   of present/absent chunks to a sharded array and asserts the
+   resulting shard sizes match the compact-layout formula. Run for
+   both `BatchedCodecPipeline` and `SyncCodecPipeline`.
+
+4. **S3 enforcement:** a test that uses a mock store to record
+   `set_range_sync` calls. Verify that under default config no
+   `set_range_sync` happens for partial shard writes; under
+   `write_empty_chunks=True` it does.
+
+5. **B1 enforcement:** a test that creates a sharded array on
+   `LocalStore`, then triggers a partial write, and asserts no
+   `ValueError: assignment destination is read-only` is raised.
+
+6. **B2 enforcement:** a test that calls `BytesCodec._decode_sync`
+   then mutates the result, and verifies the input buffer is not
+   modified (or, if we accept the view semantics, document that
+   callers must copy and add a test that calling decode-then-mutate
+   without copy gives a clear error / known behavior).