|
| 1 | +# ADR 0003 — Verifier ↔ Slab Pool Integration |
| 2 | + |
| 3 | +- **Status**: Accepted |
| 4 | +- **Date**: 2026-05-24 |
| 5 | +- **Decision drivers**: Memory accounting accuracy, multi-session |
| 6 | + serving correctness, engineering risk vs reward at v0.2.0 scope. |
| 7 | +- **Depends on**: ADR 0001, ADR 0002. |
| 8 | +- **Supersedes**: nothing. |
| 9 | + |
| 10 | +## 1. Context |
| 11 | + |
| 12 | +ADR 0001 §5.3 and `docs/local-inference-engine.md` envisioned a |
| 13 | +fixed-slab KV pool replacing the verifier's `transformers.cache_utils.DynamicCache` |
| 14 | +entirely. PR #8 shipped the slab pool and admission scheduler; PR |
| 15 | +#12 wired HTTP routes through that scheduler; PR #13 added Prometheus |
| 16 | +metrics including `scheduler_pool_in_use` and `scheduler_pool_total` |
| 17 | +gauges. |
| 18 | + |
| 19 | +There is one residual asymmetry: the slab tensors handed out by |
| 20 | +`SlabPool.acquire()` are currently **placeholder bookkeeping bytes** |
| 21 | +(1-element bf16 tensors per slab in the default placeholder pool; |
| 22 | +~4 bytes total). The verifier's actual KV cache continues to live in |
| 23 | +the `DynamicCache` that `transformers` allocates and manages. This |
| 24 | +means: |
| 25 | + |
| 26 | +- `scheduler_pool_in_use` reports the count of held slabs honestly, |
| 27 | + but `slab.kv_bytes` and `slab.live_kv_bytes` are misleading: the |
| 28 | + numbers reflect the placeholder tensors, not the real KV memory |
| 29 | + the session is consuming. |
| 30 | +- A multi-session deployment with `max_concurrent=N` actually holds |
| 31 | + `N × DynamicCache_bytes` of KV in `transformers`-managed memory. |
| 32 | + None of that shows up in the slab pool's `total_kv_bytes` property. |
| 33 | +- The original design vision — *the slab pool's tensors ARE the |
| 34 | + verifier's KV cache* — would close this gap by making the slab |
| 35 | + tensors hold the real K/V data and having the model forward |
| 36 | + consume them directly. |
| 37 | + |
| 38 | +## 2. The full refactor and why we are not doing it now |
| 39 | + |
| 40 | +The full refactor target replaces `DynamicCache` with a custom |
| 41 | +`SlabBackedCache` subclass that: |
| 42 | + |
| 43 | +1. Implements every method on `transformers.cache_utils.Cache` that |
| 44 | + the Qwen3 forward uses (`update`, `get_seq_length`, |
| 45 | + `crop_past_key_values`, layer-iteration, etc.). |
| 46 | +2. Stores K/V layer tensors as views into the slab's pre-allocated |
| 47 | + `[num_layers, num_heads, capacity, head_dim]` buffers rather |
| 48 | + than allocating fresh per-step tensors. |
| 49 | +3. Routes the sink+window trim through `KVSlab.append` / |
| 50 | + `KVSlab.truncate` / the existing window-slide logic. |
| 51 | +4. Preserves RoPE correctness: surviving K vectors keep the rotation |
| 52 | + they had at their original positions, and new keys rotate at |
| 53 | + their true global position. |
| 54 | +5. Preserves the speculative decoder's bit-equivalence with vanilla |
| 55 | + greedy AR (the existing test contract). |
| 56 | + |
| 57 | +This is a substantial body of work. Two factors push the engineering |
| 58 | +risk meaningfully higher than a typical refactor: |
| 59 | + |
| 60 | +- **Correctness fragility.** `transformers` 4.x's `Cache` API has |
| 61 | + documented behaviors but no formal contract. Subtle wrong-output |
| 62 | + bugs from a slightly off `cache_position` or `update()` semantic |
| 63 | + would not show up in our current test suite — we have no |
| 64 | + bit-equivalence harness comparing a `SlabBackedCache` run against |
| 65 | + a `DynamicCache` run on the same prompt. Without that test |
| 66 | + infrastructure, "the tests pass" does not mean "the model is |
| 67 | + generating correctly". |
| 68 | +- **Cross-version churn.** Qwen3's modeling code lives inside |
| 69 | + `transformers`; its expectations of `past_key_values` change |
| 70 | + across `transformers` minor versions. A `SlabBackedCache` that |
| 71 | + works on 4.45 may break silently on 4.52. Maintenance load is |
| 72 | + unbounded until we add a CI matrix that exercises both ends of |
| 73 | + our pinned `transformers` range. |
| 74 | + |
| 75 | +The combination of "high probability of subtle wrong-output bugs" |
| 76 | +and "no test infrastructure to detect them" makes shipping the full |
| 77 | +refactor in v0.2.0 a poor risk/reward trade. We defer it. |
| 78 | + |
| 79 | +## 3. Decision: ship an intermediate step now, full refactor in v0.3 |
| 80 | + |
| 81 | +For v0.2.0, we ship the **smallest concrete step that makes the |
| 82 | +metrics accurate** without modifying the verifier's model-forward |
| 83 | +path: |
| 84 | + |
| 85 | +1. `KVSlab` gains a `live_kv_bytes_override: Optional[int]` attribute |
| 86 | + and the `live_kv_bytes` property returns the override when set. |
| 87 | +2. A new `inference_engine/scheduler/pooled_verifier.py` defines |
| 88 | + `PooledVerifier`, a wrapper around any verifier (PyTorch |
| 89 | + `SinkWindowVerifier` or `MLXSinkWindowVerifier`) that: |
| 90 | + - Holds an optional reference to a `SlabPool`. |
| 91 | + - On `prefill()`: acquires a slab (releasing any previously |
| 92 | + held one). |
| 93 | + - On `reset()`: releases the held slab, if any. |
| 94 | + - After every forward (`prefill` / `forward_block` / `append_token` |
| 95 | + / `commit_or_truncate`): writes the verifier's real |
| 96 | + `stats.peak_kv_bytes` snapshot into the slab's |
| 97 | + `live_kv_bytes_override`, so `scheduler_pool_in_use_bytes` |
| 98 | + (a future metric) and `slab.live_kv_bytes` report real numbers. |
| 99 | +3. `Scheduler.submit()` continues to acquire / release placeholder |
| 100 | + slabs as today; integrators wiring real verifiers into the |
| 101 | + scheduler use `PooledVerifier(verifier, scheduler.pool)` to bind |
| 102 | + the two. |
| 103 | +4. The slab tensors stay as placeholders. The verifier's K/V |
| 104 | + tensors stay in `DynamicCache`. Behavior under model forward is |
| 105 | + bit-identical to v0.1.0. |
| 106 | + |
| 107 | +The intermediate step costs ~150 lines of code + tests. It cannot |
| 108 | +introduce wrong-output bugs because it does not touch the model |
| 109 | +forward. |
| 110 | + |
| 111 | +## 4. Acceptance criteria for v0.3 (the full refactor) |
| 112 | + |
| 113 | +When the full refactor lands in a future PR, it must: |
| 114 | + |
| 115 | +1. **Pass a bit-equivalence test** comparing N tokens of greedy AR |
| 116 | + output between (a) the old `DynamicCache` path and (b) the new |
| 117 | + `SlabBackedCache` path on real Qwen3-1.7B for at least three |
| 118 | + distinct prompts including one ≥ 256 tokens. |
| 119 | +2. **Run on both ends of the supported `transformers` range** |
| 120 | + (currently 4.45.x and 4.52.x; may shift). CI gains a matrix. |
| 121 | +3. **Preserve sink+window trim correctness**: a regression test |
| 122 | + exercises a session that exceeds `sink_size + window_size` by |
| 123 | + ≥ 50 % so the slide path runs. |
| 124 | +4. **Show measurable memory savings** in the |
| 125 | + `bench_mlx_verifier_quant.py`-style comparison: total resident |
| 126 | + memory at `B=N, S=8192` should be ≤ 1.05× of the analytical |
| 127 | + prediction `N * (sink+window) * num_layers * num_heads * head_dim * 2`. |
| 128 | +5. **Be reversible**: a `--legacy-cache` flag on `scripts/serve.py` |
| 129 | + (or a config switch) keeps the `DynamicCache` path available for |
| 130 | + one minor release in case the refactor surfaces a real-world |
| 131 | + issue we miss in CI. |
| 132 | + |
| 133 | +The full refactor has its own ADR (planned 0005) at the time it |
| 134 | +ships, which records the test fixtures, the memory measurements, |
| 135 | +and the version matrix. |
| 136 | + |
| 137 | +## 5. Alternatives Considered |
| 138 | + |
| 139 | +### 5.1 Ship the full refactor in v0.2.0 (rejected — see §2) |
| 140 | + |
| 141 | +### 5.2 Ship nothing for #3 in v0.2.0; tag v0.2.0 without it (rejected) |
| 142 | + |
| 143 | +The user-visible `scheduler_pool_in_use` gauge is misleading today. |
| 144 | +Even a small accuracy improvement is worth shipping. Status-quo |
| 145 | +silence on this asymmetry leaves operators unable to size pool |
| 146 | +capacity from telemetry alone. |
| 147 | + |
| 148 | +### 5.3 Replace `DynamicCache` only on the MLX backend first (deferred) |
| 149 | + |
| 150 | +MLX's `inference_engine.backends.mlx.cache.SinkWindowKVCache` |
| 151 | +already manages slab-like fixed buffers. Unifying it under |
| 152 | +`KVSlab` is structurally cleaner than the PyTorch `DynamicCache` |
| 153 | +path because we control the entire MLX cache implementation. It is |
| 154 | +attractive as a smaller proving ground for the full refactor — but |
| 155 | +deferring it to a separate PR alongside the PyTorch refactor lets |
| 156 | +both share the bit-equivalence harness rather than each inventing |
| 157 | +its own. |
| 158 | + |
| 159 | +## 6. Consequences |
| 160 | + |
| 161 | +### 6.1 Positive |
| 162 | + |
| 163 | +- **Metrics become honest** for v0.2.0 deployments that wire |
| 164 | + `PooledVerifier` into the scheduler. `slab.live_kv_bytes` reports |
| 165 | + real KV memory; `scheduler_pool_in_use` plus a follow-up |
| 166 | + `scheduler_pool_kv_bytes` metric give operators the data to size |
| 167 | + pool capacity. |
| 168 | +- **The full refactor's test infrastructure can be specified |
| 169 | + upfront** (§4) rather than retrofitted after a problem is |
| 170 | + observed in production. |
| 171 | +- **No correctness risk introduced now**. The model forward path is |
| 172 | + unchanged. |
| 173 | + |
| 174 | +### 6.2 Negative / accepted trade-offs |
| 175 | + |
| 176 | +- The slab pool's `kv_bytes` and `total_kv_bytes` properties remain |
| 177 | + reporting placeholder bytes for v0.2.0 deployments that don't |
| 178 | + wire `PooledVerifier`. They become accurate only via the wrapper. |
| 179 | + This is documented in `inference_engine.memory.pool` docstring. |
| 180 | +- Two cache paths coexist in the codebase (DynamicCache via verifier, |
| 181 | + KVSlab via pool) until v0.3. Code reviewers must hold both in |
| 182 | + mind. This is the cost of staging a high-risk refactor. |
| 183 | + |
| 184 | +### 6.3 Implications for code |
| 185 | + |
| 186 | +- `inference_engine/memory/slab.py`: add |
| 187 | + `live_kv_bytes_override: Optional[int]` and modify the |
| 188 | + `live_kv_bytes` property. |
| 189 | +- `inference_engine/scheduler/pooled_verifier.py` (new): the |
| 190 | + wrapper class. |
| 191 | +- `inference_engine/scheduler/__init__.py`: export `PooledVerifier`. |
| 192 | +- README + this ADR cross-referenced from |
| 193 | + `docs/local-inference-engine.md`. |
| 194 | +- Tests: pure-CPU unit tests against a `_FakeVerifier` real |
| 195 | + concrete class. No HF cache required for CI. |
| 196 | + |
| 197 | +## 7. Validation |
| 198 | + |
| 199 | +This ADR is considered validated when: |
| 200 | + |
| 201 | +1. The intermediate step (§3) is implemented with 100% line coverage |
| 202 | + on the new code. |
| 203 | +2. A walkthrough of `inference_engine.memory` and |
| 204 | + `inference_engine.scheduler` documents which paths are |
| 205 | + "placeholder bookkeeping" and which produce real KV byte counts. |
| 206 | +3. The full refactor's acceptance criteria (§4) are restated in |
| 207 | + the future ADR 0005 when that PR opens — this ADR's §4 is |
| 208 | + normative for that future work. |
| 209 | + |
| 210 | +## 8. References |
| 211 | + |
| 212 | +- ADR 0001 — proposer sizing + alignment. |
| 213 | +- ADR 0002 — verifier selection + quantization. |
| 214 | +- `docs/local-inference-engine.md` — original engine architecture. |
| 215 | +- PR #8 (E3 slab pool), PR #9 (E4 scheduler), PR #12 (E2↔E4 |
| 216 | + integration), PR #13 (metrics). |
| 217 | +- `transformers.cache_utils.Cache` — the contract a future |
| 218 | + `SlabBackedCache` must implement. |
0 commit comments