|
| 1 | +# Codec Pipeline Invariants |
| 2 | + |
| 3 | +This document describes the contracts that the codec pipeline, codec |
| 4 | +chain, sharding codec, and buffer abstractions rely on. Each invariant |
| 5 | +is enforced by a runtime test in `tests/test_codec_invariants.py`. |
| 6 | +Any change to the pipeline that violates one of these invariants |
| 7 | +should fail those tests immediately, instead of waiting for an |
| 8 | +end-to-end test in a particular configuration to surface the bug. |
| 9 | + |
| 10 | +## Why this exists |
| 11 | + |
| 12 | +The new `SyncCodecPipeline` accumulated several correctness bugs that |
| 13 | +all had the same shape: **case-by-case reasoning about how codec / |
| 14 | +shard / IO / buffer pieces interact, missing a combination.** Examples: |
| 15 | + |
| 16 | +- The `ChunkTransform` cached prototype-bearing specs at construction |
| 17 | + time, so per-call GPU prototypes were silently discarded. |
| 18 | +- The byte-range write path produced a dense layout, breaking the |
| 19 | + shard-compactness contract that downstream code (and existing |
| 20 | + tests) depended on. |
| 21 | +- The shard index decoded from a read-only buffer crashed when the |
| 22 | + byte-range path tried to mutate it. |
| 23 | +- Empty inner chunks weren't compacted out of partial shard writes. |
| 24 | + |
| 25 | +In each case, the code "worked" for the configuration the author was |
| 26 | +thinking about, and broke for a different one. The invariants below |
| 27 | +are the contracts the author should have written down first. |
| 28 | + |
| 29 | +## Codec chain invariants |
| 30 | + |
| 31 | +### C1. Codecs only mutate the array spec's `shape`. |
| 32 | + |
| 33 | +`Codec.resolve_metadata(spec)` returns an `ArraySpec` that may differ |
| 34 | +from `spec` in `shape` (e.g. `TransposeCodec` permutes it). Every |
| 35 | +other field — `prototype`, `dtype`, `fill_value`, `config` — is |
| 36 | +unchanged. |
| 37 | + |
| 38 | +**Consequence:** A `ChunkTransform` cannot bake any field other than |
| 39 | +`shape` into a cache. Anything else (notably `prototype`) must be |
| 40 | +read from the runtime `chunk_spec` on every call. |
| 41 | + |
| 42 | +**Test:** for every registered codec, call `resolve_metadata(spec)` |
| 43 | +with various specs and assert `result.prototype is spec.prototype`, |
| 44 | +`result.dtype == spec.dtype`, etc. |
| 45 | + |
| 46 | +### C2. Each codec call receives the spec at its position in the chain. |
| 47 | + |
| 48 | +A `chunk_spec` flows through the codec chain. AA codecs see the |
| 49 | +running spec (each step potentially changes shape). The AB codec sees |
| 50 | +the spec after all AA codecs have resolved. BB codecs all see the |
| 51 | +post-AB spec. |
| 52 | + |
| 53 | +**Consequence:** the pipeline must walk the AA codecs to get the AB |
| 54 | +spec for any given input. The walk is cheap; what matters is that |
| 55 | +the pipeline walks from the *runtime* `chunk_spec`, not from a |
| 56 | +constructor-time spec. |
| 57 | + |
| 58 | +**Test:** a fake codec that asserts `chunk_spec.prototype is X` where |
| 59 | +`X` was passed to the pipeline at call time. |
| 60 | + |
| 61 | +### C3. The pipeline owns IO unless a codec opts in to partial |
| 62 | + encode/decode. |
| 63 | + |
| 64 | +Codecs are pure compute. The pipeline is responsible for fetching |
| 65 | +encoded blobs from the store, calling `decode_chunk`, and writing |
| 66 | +encoded results back. The exception is when the AB codec implements |
| 67 | +`ArrayBytesCodecPartialDecodeMixin` / `ArrayBytesCodecPartialEncodeMixin` |
| 68 | +— then the codec receives the store handle and handles its own IO |
| 69 | +(this is how `ShardingCodec` does byte-range reads/writes). |
| 70 | + |
| 71 | +**Consequence:** A pipeline must check `supports_partial_encode` / |
| 72 | +`supports_partial_decode` and dispatch to the codec when true. It must |
| 73 | +not branch on codec type (`isinstance(..., ShardingCodec)`). |
| 74 | + |
| 75 | +**Test:** the pipeline should produce identical results when forced |
| 76 | +through both code paths (force partial-encode-capable codec to use |
| 77 | +the generic path, and vice versa where possible). |
| 78 | + |
| 79 | +## Shard layout invariants |
| 80 | + |
| 81 | +### S1. The shard layout is *compact*, not dense. |
| 82 | + |
| 83 | +A shard blob contains only the inner chunks that are present, packed |
| 84 | +together (in morton order). Absent chunks are recorded in the index |
| 85 | +with `(MAX_UINT_64, MAX_UINT_64)` and consume no space in the data |
| 86 | +region. |
| 87 | + |
| 88 | +**Consequence:** Even with fixed-size inner codecs, the shard size |
| 89 | +varies with how many chunks are present. Anything that writes a shard |
| 90 | +must produce the compact layout. |
| 91 | + |
| 92 | +**Counter-example we hit:** the byte-range fast path wrote every |
| 93 | +affected chunk into a fixed slot, producing a dense layout. This |
| 94 | +worked for reads (the index correctly recorded offsets) but broke |
| 95 | +size-checking tests and wasted space. |
| 96 | + |
| 97 | +### S2. `write_empty_chunks=False` (the default) means an inner chunk |
| 98 | + that equals `fill_value` must NOT be written. |
| 99 | + |
| 100 | +When merged data equals the fill value, the chunk is omitted from |
| 101 | +the shard entirely (no entry in the data region, `MAX_UINT_64` in |
| 102 | +the index). If all chunks become absent, the entire shard is deleted. |
| 103 | + |
| 104 | +**Consequence:** any partial-shard-write code path must compute the |
| 105 | +merged chunk content, check against `fill_value`, and omit it if |
| 106 | +they match. |
| 107 | + |
| 108 | +**Test:** write fill-value data to a sharded array; assert the |
| 109 | +relevant store keys do not exist. |
| 110 | + |
| 111 | +### S3. The byte-range fast path requires `write_empty_chunks=True`. |
| 112 | + |
| 113 | +Byte-range writes update fixed slots in an existing shard blob. They |
| 114 | +cannot compact away chunks that newly equal `fill_value` (would |
| 115 | +require rewriting the whole blob anyway). The optimization is only |
| 116 | +valid when the user explicitly opts out of empty-chunk skipping. |
| 117 | + |
| 118 | +**Consequence:** `_encode_partial_sync`'s byte-range path is gated |
| 119 | +on `not skip_empty`. Under the default config, partial shard writes |
| 120 | +take the full-rewrite path. |
| 121 | + |
| 122 | +## Buffer invariants |
| 123 | + |
| 124 | +### B1. Buffers returned from store IO may be read-only. |
| 125 | + |
| 126 | +`LocalStore` returns mmap-backed buffers; `ZipStore` returns views |
| 127 | +into a zip member; remote stores may return immutable buffers. |
| 128 | + |
| 129 | +**Consequence:** any code that decodes from a store-returned buffer |
| 130 | +and then mutates the result must `.copy()` first. |
| 131 | + |
| 132 | +**Counter-example we hit:** `_ShardIndex.set_chunk_slice` mutates |
| 133 | +`self.offsets_and_lengths`. When the index was decoded from a |
| 134 | +read-only buffer, this raised `ValueError: assignment destination |
| 135 | +is read-only`. |
| 136 | + |
| 137 | +### B2. `decode_chunk` may return a view of its input. |
| 138 | + |
| 139 | +`BytesCodec._decode_sync` returns an `NDBuffer` that views the same |
| 140 | +memory as the input `Buffer`. Subsequent mutations affect both. |
| 141 | + |
| 142 | +**Consequence:** if the caller intends to mutate the decoded array |
| 143 | +(e.g. for a partial-write merge), it must `.copy()` first. |
| 144 | + |
| 145 | +### B3. The buffer prototype is a runtime parameter, not metadata. |
| 146 | + |
| 147 | +The `prototype` field of `ArraySpec` indicates what kind of buffer |
| 148 | +the *caller* wants for this operation. The same array can be read |
| 149 | +into CPU buffers on one call and GPU buffers on the next. The |
| 150 | +codec pipeline must use the per-call prototype, not a cached one. |
| 151 | + |
| 152 | +**Consequence:** see C1, C2 above. |
| 153 | + |
| 154 | +## Test plan |
| 155 | + |
| 156 | +`tests/test_codec_invariants.py` should contain: |
| 157 | + |
| 158 | +1. **C1 enforcement:** parametric test over all registered codecs |
| 159 | + that calls `resolve_metadata` with a sentinel `ArraySpec` and |
| 160 | + asserts the non-shape fields are unchanged. |
| 161 | + |
| 162 | +2. **C2 enforcement:** a test that passes a `ChunkTransform` |
| 163 | + different prototypes per call (using a fake codec that records |
| 164 | + the prototype it was called with) and asserts the recorded |
| 165 | + prototypes match. |
| 166 | + |
| 167 | +3. **S1 + S2 enforcement:** a test that writes various combinations |
| 168 | + of present/absent chunks to a sharded array and asserts the |
| 169 | + resulting shard sizes match the compact-layout formula. Run for |
| 170 | + both `BatchedCodecPipeline` and `SyncCodecPipeline`. |
| 171 | + |
| 172 | +4. **S3 enforcement:** a test that uses a mock store to record |
| 173 | + `set_range_sync` calls. Verify that under default config no |
| 174 | + `set_range_sync` happens for partial shard writes; under |
| 175 | + `write_empty_chunks=True` it does. |
| 176 | + |
| 177 | +5. **B1 enforcement:** a test that creates a sharded array on |
| 178 | + `LocalStore`, then triggers a partial write, and asserts no |
| 179 | + `ValueError: assignment destination is read-only` is raised. |
| 180 | + |
| 181 | +6. **B2 enforcement:** a test that calls `BytesCodec._decode_sync` |
| 182 | + then mutates the result, and verifies the input buffer is not |
| 183 | + modified (or, if we accept the view semantics, document that |
| 184 | + callers must copy and add a test that calling decode-then-mutate |
| 185 | + without copy gives a clear error / known behavior). |
0 commit comments