- Status: Accepted
- Date: 2026-05-24
- Decision drivers: Memory accounting accuracy, multi-session serving correctness, engineering risk vs reward at v0.2.0 scope.
- Depends on: ADR 0001, ADR 0002.
- Supersedes: nothing.
ADR 0001 §5.3 and docs/local-inference-engine.md envisioned a
fixed-slab KV pool replacing the verifier's transformers.cache_utils.DynamicCache
entirely. PR #8 shipped the slab pool and admission scheduler; PR
#12 wired HTTP routes through that scheduler; PR #13 added Prometheus
metrics including scheduler_pool_in_use and scheduler_pool_total
gauges.
There is one residual asymmetry: the slab tensors handed out by
SlabPool.acquire() are currently placeholder bookkeeping bytes
(1-element bf16 tensors per slab in the default placeholder pool;
~4 bytes total). The verifier's actual KV cache continues to live in
the DynamicCache that transformers allocates and manages. This
means:
scheduler_pool_in_usereports the count of held slabs honestly, butslab.kv_bytesandslab.live_kv_bytesare misleading: the numbers reflect the placeholder tensors, not the real KV memory the session is consuming.- A multi-session deployment with
max_concurrent=Nactually holdsN × DynamicCache_bytesof KV intransformers-managed memory. None of that shows up in the slab pool'stotal_kv_bytesproperty. - The original design vision — the slab pool's tensors ARE the verifier's KV cache — would close this gap by making the slab tensors hold the real K/V data and having the model forward consume them directly.
The full refactor target replaces DynamicCache with a custom
SlabBackedCache subclass that:
- Implements every method on
transformers.cache_utils.Cachethat the Qwen3 forward uses (update,get_seq_length,crop_past_key_values, layer-iteration, etc.). - Stores K/V layer tensors as views into the slab's pre-allocated
[num_layers, num_heads, capacity, head_dim]buffers rather than allocating fresh per-step tensors. - Routes the sink+window trim through
KVSlab.append/KVSlab.truncate/ the existing window-slide logic. - Preserves RoPE correctness: surviving K vectors keep the rotation they had at their original positions, and new keys rotate at their true global position.
- Preserves the speculative decoder's bit-equivalence with vanilla greedy AR (the existing test contract).
This is a substantial body of work. Two factors push the engineering risk meaningfully higher than a typical refactor:
- Correctness fragility.
transformers4.x'sCacheAPI has documented behaviors but no formal contract. Subtle wrong-output bugs from a slightly offcache_positionorupdate()semantic would not show up in our current test suite — we have no bit-equivalence harness comparing aSlabBackedCacherun against aDynamicCacherun on the same prompt. Without that test infrastructure, "the tests pass" does not mean "the model is generating correctly". - Cross-version churn. Qwen3's modeling code lives inside
transformers; its expectations ofpast_key_valueschange acrosstransformersminor versions. ASlabBackedCachethat works on 4.45 may break silently on 4.52. Maintenance load is unbounded until we add a CI matrix that exercises both ends of our pinnedtransformersrange.
The combination of "high probability of subtle wrong-output bugs" and "no test infrastructure to detect them" makes shipping the full refactor in v0.2.0 a poor risk/reward trade. We defer it.
For v0.2.0, we ship the smallest concrete step that makes the metrics accurate without modifying the verifier's model-forward path:
KVSlabgains alive_kv_bytes_override: Optional[int]attribute and thelive_kv_bytesproperty returns the override when set.- A new
inference_engine/scheduler/pooled_verifier.pydefinesPooledVerifier, a wrapper around any verifier (PyTorchSinkWindowVerifierorMLXSinkWindowVerifier) that:- Holds an optional reference to a
SlabPool. - On
prefill(): acquires a slab (releasing any previously held one). - On
reset(): releases the held slab, if any. - After every forward (
prefill/forward_block/append_token/commit_or_truncate): writes the verifier's realstats.peak_kv_bytessnapshot into the slab'slive_kv_bytes_override, soscheduler_pool_in_use_bytes(a future metric) andslab.live_kv_bytesreport real numbers.
- Holds an optional reference to a
Scheduler.submit()continues to acquire / release placeholder slabs as today; integrators wiring real verifiers into the scheduler usePooledVerifier(verifier, scheduler.pool)to bind the two.- The slab tensors stay as placeholders. The verifier's K/V
tensors stay in
DynamicCache. Behavior under model forward is bit-identical to v0.1.0.
The intermediate step costs ~150 lines of code + tests. It cannot introduce wrong-output bugs because it does not touch the model forward.
When the full refactor lands in a future PR, it must:
- Pass a bit-equivalence test comparing N tokens of greedy AR
output between (a) the old
DynamicCachepath and (b) the newSlabBackedCachepath on real Qwen3-1.7B for at least three distinct prompts including one ≥ 256 tokens. - Run on both ends of the supported
transformersrange (currently 4.45.x and 4.52.x; may shift). CI gains a matrix. - Preserve sink+window trim correctness: a regression test
exercises a session that exceeds
sink_size + window_sizeby ≥ 50 % so the slide path runs. - Show measurable memory savings in the
bench_mlx_verifier_quant.py-style comparison: total resident memory atB=N, S=8192should be ≤ 1.05× of the analytical predictionN * (sink+window) * num_layers * num_heads * head_dim * 2. - Be reversible: a
--legacy-cacheflag onscripts/serve.py(or a config switch) keeps theDynamicCachepath available for one minor release in case the refactor surfaces a real-world issue we miss in CI.
The full refactor has its own ADR (planned 0005) at the time it ships, which records the test fixtures, the memory measurements, and the version matrix.
The user-visible scheduler_pool_in_use gauge is misleading today.
Even a small accuracy improvement is worth shipping. Status-quo
silence on this asymmetry leaves operators unable to size pool
capacity from telemetry alone.
MLX's inference_engine.backends.mlx.cache.SinkWindowKVCache
already manages slab-like fixed buffers. Unifying it under
KVSlab is structurally cleaner than the PyTorch DynamicCache
path because we control the entire MLX cache implementation. It is
attractive as a smaller proving ground for the full refactor — but
deferring it to a separate PR alongside the PyTorch refactor lets
both share the bit-equivalence harness rather than each inventing
its own.
- Metrics become honest for v0.2.0 deployments that wire
PooledVerifierinto the scheduler.slab.live_kv_bytesreports real KV memory;scheduler_pool_in_useplus a follow-upscheduler_pool_kv_bytesmetric give operators the data to size pool capacity. - The full refactor's test infrastructure can be specified upfront (§4) rather than retrofitted after a problem is observed in production.
- No correctness risk introduced now. The model forward path is unchanged.
- The slab pool's
kv_bytesandtotal_kv_bytesproperties remain reporting placeholder bytes for v0.2.0 deployments that don't wirePooledVerifier. They become accurate only via the wrapper. This is documented ininference_engine.memory.pooldocstring. - Two cache paths coexist in the codebase (DynamicCache via verifier, KVSlab via pool) until v0.3. Code reviewers must hold both in mind. This is the cost of staging a high-risk refactor.
inference_engine/memory/slab.py: addlive_kv_bytes_override: Optional[int]and modify thelive_kv_bytesproperty.inference_engine/scheduler/pooled_verifier.py(new): the wrapper class.inference_engine/scheduler/__init__.py: exportPooledVerifier.- README + this ADR cross-referenced from
docs/local-inference-engine.md. - Tests: pure-CPU unit tests against a
_FakeVerifierreal concrete class. No HF cache required for CI.
This ADR is considered validated when:
- The intermediate step (§3) is implemented with 100% line coverage on the new code.
- A walkthrough of
inference_engine.memoryandinference_engine.schedulerdocuments which paths are "placeholder bookkeeping" and which produce real KV byte counts. - The full refactor's acceptance criteria (§4) are restated in the future ADR 0005 when that PR opens — this ADR's §4 is normative for that future work.
- ADR 0001 — proposer sizing + alignment.
- ADR 0002 — verifier selection + quantization.
docs/local-inference-engine.md— original engine architecture.- PR #8 (E3 slab pool), PR #9 (E4 scheduler), PR #12 (E2↔E4 integration), PR #13 (metrics).
transformers.cache_utils.Cache— the contract a futureSlabBackedCachemust implement.