Skip to content

Commit 9d2c8d6

Browse files
Merge PR #16: ADR 0003 + PooledVerifier (intermediate step)
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
2 parents 857ac8a + 182e327 commit 9d2c8d6

8 files changed

Lines changed: 745 additions & 3 deletions

File tree

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -461,3 +461,10 @@ explicitly rejected.
461461
memory rule for choosing bf16 vs 4-bit, and why closed-weight APIs
462462
(GPT/Claude/Gemini) cannot be aligned with EAGLE-3 and are out of
463463
scope for v1 / v2.
464+
- [ADR 0003 — Verifier ↔ slab pool integration: deferred refactor +
465+
intermediate step](docs/adr/0003-verifier-slab-pool-integration.md):
466+
why the full "slab tensors hold the real KV" refactor is deferred
467+
to v0.3 (correctness fragility without a bit-equivalence harness)
468+
and what intermediate step ships in v0.2 — `PooledVerifier`
469+
wrapper that makes pool memory accounting accurate without
470+
touching the model forward.
Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
# ADR 0003 — Verifier ↔ Slab Pool Integration
2+
3+
- **Status**: Accepted
4+
- **Date**: 2026-05-24
5+
- **Decision drivers**: Memory accounting accuracy, multi-session
6+
serving correctness, engineering risk vs reward at v0.2.0 scope.
7+
- **Depends on**: ADR 0001, ADR 0002.
8+
- **Supersedes**: nothing.
9+
10+
## 1. Context
11+
12+
ADR 0001 §5.3 and `docs/local-inference-engine.md` envisioned a
13+
fixed-slab KV pool replacing the verifier's `transformers.cache_utils.DynamicCache`
14+
entirely. PR #8 shipped the slab pool and admission scheduler; PR
15+
#12 wired HTTP routes through that scheduler; PR #13 added Prometheus
16+
metrics including `scheduler_pool_in_use` and `scheduler_pool_total`
17+
gauges.
18+
19+
There is one residual asymmetry: the slab tensors handed out by
20+
`SlabPool.acquire()` are currently **placeholder bookkeeping bytes**
21+
(1-element bf16 tensors per slab in the default placeholder pool;
22+
~4 bytes total). The verifier's actual KV cache continues to live in
23+
the `DynamicCache` that `transformers` allocates and manages. This
24+
means:
25+
26+
- `scheduler_pool_in_use` reports the count of held slabs honestly,
27+
but `slab.kv_bytes` and `slab.live_kv_bytes` are misleading: the
28+
numbers reflect the placeholder tensors, not the real KV memory
29+
the session is consuming.
30+
- A multi-session deployment with `max_concurrent=N` actually holds
31+
`N × DynamicCache_bytes` of KV in `transformers`-managed memory.
32+
None of that shows up in the slab pool's `total_kv_bytes` property.
33+
- The original design vision — *the slab pool's tensors ARE the
34+
verifier's KV cache* — would close this gap by making the slab
35+
tensors hold the real K/V data and having the model forward
36+
consume them directly.
37+
38+
## 2. The full refactor and why we are not doing it now
39+
40+
The full refactor target replaces `DynamicCache` with a custom
41+
`SlabBackedCache` subclass that:
42+
43+
1. Implements every method on `transformers.cache_utils.Cache` that
44+
the Qwen3 forward uses (`update`, `get_seq_length`,
45+
`crop_past_key_values`, layer-iteration, etc.).
46+
2. Stores K/V layer tensors as views into the slab's pre-allocated
47+
`[num_layers, num_heads, capacity, head_dim]` buffers rather
48+
than allocating fresh per-step tensors.
49+
3. Routes the sink+window trim through `KVSlab.append` /
50+
`KVSlab.truncate` / the existing window-slide logic.
51+
4. Preserves RoPE correctness: surviving K vectors keep the rotation
52+
they had at their original positions, and new keys rotate at
53+
their true global position.
54+
5. Preserves the speculative decoder's bit-equivalence with vanilla
55+
greedy AR (the existing test contract).
56+
57+
This is a substantial body of work. Two factors push the engineering
58+
risk meaningfully higher than a typical refactor:
59+
60+
- **Correctness fragility.** `transformers` 4.x's `Cache` API has
61+
documented behaviors but no formal contract. Subtle wrong-output
62+
bugs from a slightly off `cache_position` or `update()` semantic
63+
would not show up in our current test suite — we have no
64+
bit-equivalence harness comparing a `SlabBackedCache` run against
65+
a `DynamicCache` run on the same prompt. Without that test
66+
infrastructure, "the tests pass" does not mean "the model is
67+
generating correctly".
68+
- **Cross-version churn.** Qwen3's modeling code lives inside
69+
`transformers`; its expectations of `past_key_values` change
70+
across `transformers` minor versions. A `SlabBackedCache` that
71+
works on 4.45 may break silently on 4.52. Maintenance load is
72+
unbounded until we add a CI matrix that exercises both ends of
73+
our pinned `transformers` range.
74+
75+
The combination of "high probability of subtle wrong-output bugs"
76+
and "no test infrastructure to detect them" makes shipping the full
77+
refactor in v0.2.0 a poor risk/reward trade. We defer it.
78+
79+
## 3. Decision: ship an intermediate step now, full refactor in v0.3
80+
81+
For v0.2.0, we ship the **smallest concrete step that makes the
82+
metrics accurate** without modifying the verifier's model-forward
83+
path:
84+
85+
1. `KVSlab` gains a `live_kv_bytes_override: Optional[int]` attribute
86+
and the `live_kv_bytes` property returns the override when set.
87+
2. A new `inference_engine/scheduler/pooled_verifier.py` defines
88+
`PooledVerifier`, a wrapper around any verifier (PyTorch
89+
`SinkWindowVerifier` or `MLXSinkWindowVerifier`) that:
90+
- Holds an optional reference to a `SlabPool`.
91+
- On `prefill()`: acquires a slab (releasing any previously
92+
held one).
93+
- On `reset()`: releases the held slab, if any.
94+
- After every forward (`prefill` / `forward_block` / `append_token`
95+
/ `commit_or_truncate`): writes the verifier's real
96+
`stats.peak_kv_bytes` snapshot into the slab's
97+
`live_kv_bytes_override`, so `scheduler_pool_in_use_bytes`
98+
(a future metric) and `slab.live_kv_bytes` report real numbers.
99+
3. `Scheduler.submit()` continues to acquire / release placeholder
100+
slabs as today; integrators wiring real verifiers into the
101+
scheduler use `PooledVerifier(verifier, scheduler.pool)` to bind
102+
the two.
103+
4. The slab tensors stay as placeholders. The verifier's K/V
104+
tensors stay in `DynamicCache`. Behavior under model forward is
105+
bit-identical to v0.1.0.
106+
107+
The intermediate step costs ~150 lines of code + tests. It cannot
108+
introduce wrong-output bugs because it does not touch the model
109+
forward.
110+
111+
## 4. Acceptance criteria for v0.3 (the full refactor)
112+
113+
When the full refactor lands in a future PR, it must:
114+
115+
1. **Pass a bit-equivalence test** comparing N tokens of greedy AR
116+
output between (a) the old `DynamicCache` path and (b) the new
117+
`SlabBackedCache` path on real Qwen3-1.7B for at least three
118+
distinct prompts including one ≥ 256 tokens.
119+
2. **Run on both ends of the supported `transformers` range**
120+
(currently 4.45.x and 4.52.x; may shift). CI gains a matrix.
121+
3. **Preserve sink+window trim correctness**: a regression test
122+
exercises a session that exceeds `sink_size + window_size` by
123+
≥ 50 % so the slide path runs.
124+
4. **Show measurable memory savings** in the
125+
`bench_mlx_verifier_quant.py`-style comparison: total resident
126+
memory at `B=N, S=8192` should be ≤ 1.05× of the analytical
127+
prediction `N * (sink+window) * num_layers * num_heads * head_dim * 2`.
128+
5. **Be reversible**: a `--legacy-cache` flag on `scripts/serve.py`
129+
(or a config switch) keeps the `DynamicCache` path available for
130+
one minor release in case the refactor surfaces a real-world
131+
issue we miss in CI.
132+
133+
The full refactor has its own ADR (planned 0005) at the time it
134+
ships, which records the test fixtures, the memory measurements,
135+
and the version matrix.
136+
137+
## 5. Alternatives Considered
138+
139+
### 5.1 Ship the full refactor in v0.2.0 (rejected — see §2)
140+
141+
### 5.2 Ship nothing for #3 in v0.2.0; tag v0.2.0 without it (rejected)
142+
143+
The user-visible `scheduler_pool_in_use` gauge is misleading today.
144+
Even a small accuracy improvement is worth shipping. Status-quo
145+
silence on this asymmetry leaves operators unable to size pool
146+
capacity from telemetry alone.
147+
148+
### 5.3 Replace `DynamicCache` only on the MLX backend first (deferred)
149+
150+
MLX's `inference_engine.backends.mlx.cache.SinkWindowKVCache`
151+
already manages slab-like fixed buffers. Unifying it under
152+
`KVSlab` is structurally cleaner than the PyTorch `DynamicCache`
153+
path because we control the entire MLX cache implementation. It is
154+
attractive as a smaller proving ground for the full refactor — but
155+
deferring it to a separate PR alongside the PyTorch refactor lets
156+
both share the bit-equivalence harness rather than each inventing
157+
its own.
158+
159+
## 6. Consequences
160+
161+
### 6.1 Positive
162+
163+
- **Metrics become honest** for v0.2.0 deployments that wire
164+
`PooledVerifier` into the scheduler. `slab.live_kv_bytes` reports
165+
real KV memory; `scheduler_pool_in_use` plus a follow-up
166+
`scheduler_pool_kv_bytes` metric give operators the data to size
167+
pool capacity.
168+
- **The full refactor's test infrastructure can be specified
169+
upfront** (§4) rather than retrofitted after a problem is
170+
observed in production.
171+
- **No correctness risk introduced now**. The model forward path is
172+
unchanged.
173+
174+
### 6.2 Negative / accepted trade-offs
175+
176+
- The slab pool's `kv_bytes` and `total_kv_bytes` properties remain
177+
reporting placeholder bytes for v0.2.0 deployments that don't
178+
wire `PooledVerifier`. They become accurate only via the wrapper.
179+
This is documented in `inference_engine.memory.pool` docstring.
180+
- Two cache paths coexist in the codebase (DynamicCache via verifier,
181+
KVSlab via pool) until v0.3. Code reviewers must hold both in
182+
mind. This is the cost of staging a high-risk refactor.
183+
184+
### 6.3 Implications for code
185+
186+
- `inference_engine/memory/slab.py`: add
187+
`live_kv_bytes_override: Optional[int]` and modify the
188+
`live_kv_bytes` property.
189+
- `inference_engine/scheduler/pooled_verifier.py` (new): the
190+
wrapper class.
191+
- `inference_engine/scheduler/__init__.py`: export `PooledVerifier`.
192+
- README + this ADR cross-referenced from
193+
`docs/local-inference-engine.md`.
194+
- Tests: pure-CPU unit tests against a `_FakeVerifier` real
195+
concrete class. No HF cache required for CI.
196+
197+
## 7. Validation
198+
199+
This ADR is considered validated when:
200+
201+
1. The intermediate step (§3) is implemented with 100% line coverage
202+
on the new code.
203+
2. A walkthrough of `inference_engine.memory` and
204+
`inference_engine.scheduler` documents which paths are
205+
"placeholder bookkeeping" and which produce real KV byte counts.
206+
3. The full refactor's acceptance criteria (§4) are restated in
207+
the future ADR 0005 when that PR opens — this ADR's §4 is
208+
normative for that future work.
209+
210+
## 8. References
211+
212+
- ADR 0001 — proposer sizing + alignment.
213+
- ADR 0002 — verifier selection + quantization.
214+
- `docs/local-inference-engine.md` — original engine architecture.
215+
- PR #8 (E3 slab pool), PR #9 (E4 scheduler), PR #12 (E2↔E4
216+
integration), PR #13 (metrics).
217+
- `transformers.cache_utils.Cache` — the contract a future
218+
`SlabBackedCache` must implement.

docs/adr/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,3 +34,4 @@ reader what was *not* chosen.
3434
| ---- | --------------------------------------------------------------- | -------- |
3535
| 0001 | [Proposer sizing, alignment, and verifier decoupling](0001-proposer-sizing-and-alignment.md) | Accepted |
3636
| 0002 | [Verifier selection, quantization, and the open-vs-closed-weight constraint](0002-verifier-selection-and-quantization.md) | Accepted |
37+
| 0003 | [Verifier ↔ slab pool integration: deferred refactor + intermediate step](0003-verifier-slab-pool-integration.md) | Accepted |

inference_engine/memory/slab.py

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,12 @@ def __init__(self, config: SlabConfig) -> None:
110110
self.keys = torch.zeros(shape, dtype=config.dtype, device=config.device)
111111
self.values = torch.zeros(shape, dtype=config.dtype, device=config.device)
112112
self.logical_size = 0
113+
# Override for live_kv_bytes when this slab is a placeholder
114+
# (e.g. used for admission control while the real KV lives in
115+
# transformers' DynamicCache). PooledVerifier sets this from
116+
# the verifier's actual cache size after every forward, so
117+
# /metrics reports real numbers. See ADR 0003.
118+
self.live_kv_bytes_override: int | None = None
113119

114120
# ------------------------------------------------------------------
115121
# Mutators
@@ -187,9 +193,12 @@ def reset(self) -> None:
187193
"""Empty the slab; underlying buffers are kept allocated.
188194
189195
Called by the pool on release so the slab is ready to serve
190-
a fresh session without re-allocating tensors.
196+
a fresh session without re-allocating tensors. Also clears
197+
any ``live_kv_bytes_override`` so a fresh acquirer doesn't
198+
inherit stale accounting from the previous session.
191199
"""
192200
self.logical_size = 0
201+
self.live_kv_bytes_override = None
193202
# We do not zero the buffers; logical_size is the truth.
194203
# Callers must respect logical_size when reading.
195204

@@ -236,9 +245,17 @@ def kv_bytes(self) -> int:
236245
def live_kv_bytes(self) -> int:
237246
"""Bytes for the live region only (logical_size, not capacity).
238247
239-
Useful for reporting "how much KV is currently in use" vs the
240-
physical footprint reported by :attr:`kv_bytes`.
248+
If ``live_kv_bytes_override`` is set (typically by
249+
:class:`PooledVerifier` from the verifier's real
250+
``DynamicCache`` size), that value is returned verbatim. This
251+
lets a placeholder-tensor slab report accurate memory
252+
accounting for the actual cache it tracks. See ADR 0003.
253+
254+
Otherwise this is computed from the slab's own tensors:
255+
``num_layers * num_heads * logical_size * head_dim * 2``.
241256
"""
257+
if self.live_kv_bytes_override is not None:
258+
return int(self.live_kv_bytes_override)
242259
if self.logical_size == 0:
243260
return 0
244261
elem = self.keys.element_size()

inference_engine/scheduler/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,13 @@
2626
"""
2727

2828
from .config import AdmissionPolicy, SchedulerConfig
29+
from .pooled_verifier import PooledVerifier
2930
from .scheduler import RequestRejected, Scheduler
3031
from .session import Session, SessionState
3132

3233
__all__ = [
3334
"AdmissionPolicy",
35+
"PooledVerifier",
3436
"RequestRejected",
3537
"Scheduler",
3638
"SchedulerConfig",

0 commit comments

Comments
 (0)