Server reads KV bytes from engine.kv_state(), not the slab pool

cursoragent · FluffyAIcode · cursoragent · commit add46de21ffd · 2026-05-31T05:44:47.000Z
The 2026-05-30 short test #2 confirmed: - bench in-flight metrics poller works (7313 samples / 58 turns, median 110 / turn, max 429) - orphan-session fix works (idle pool_in_use settles to 0) - slab IS acquired during turns (in-flight pool_in_use peak = 1) But scheduler_kv_live_bytes still read 0.0 in 58/58 turns. Root cause: SlabPool.live_kv_bytes (added in PR #24) sums slabs' live_kv_bytes_override, which is only ever set by PooledVerifier — and PooledVerifier is never wired into scripts/serve.py. Wrapping the verifier in PooledVerifier requires plumbing the slab through Scheduler -> Engine -> SpeculativeDecoder -> Verifier, which is a non-trivial structural change. Cheaper fix ----------- The verifier already holds the real KV cache tensors and is the canonical source of truth for live KV bytes. Expose it directly: - kv_cache_proposer/verifier.py SinkWindowVerifier.live_kv_bytes() -> int Sums layer.keys.numel() * element_size() + same for values across the cache. Returns 0 when cache is None (between reset() and prefill()). _record_peak_kv now reads through it. - inference_engine/backends/mlx/verifier.py MLXSinkWindowVerifier.live_kv_bytes() -> int Same surface as the CPU verifier; reads from cache_ops.total_kv_bytes(self.cache). _record_peak_kv now reads through it too. - inference_engine/server/engine.py Engine protocol: new kv_state() -> int method. SpeculativeEngine: returns decoder.verifier.live_kv_bytes() if exists else 0. Defensive on the verifier surface so legacy verifiers that don't expose the optional method don't break the engine. - inference_engine/server/app.py /metrics handler: replace kv_live_bytes=pool.live_kv_bytes with kv_live_bytes=int(engine.kv_state()) The pool-side gauge is preserved as infrastructure; once PooledVerifier is wired (post-v0.3.0), the slab will report correctly via override and aggregate matches engine. For v0.3, the engine is the source of truth. Thread safety ------------- Both verifiers' live_kv_bytes() reads are int-attribute walks over tensor shape descriptors. CPython torch.Tensor.numel() / mlx array.size are atomic reads — a concurrent worker writing the cache produces some valid intermediate value, never garbage. Documented inline. Tests (no mock; all real concrete classes) ------------------------------------------ tests/core/test_verifier.py + test_live_kv_bytes_zero_before_prefill + test_live_kv_bytes_nonzero_after_prefill + test_live_kv_bytes_returns_zero_when_layer_kv_is_null tests/backends/mlx/test_verifier.py + test_live_kv_bytes_zero_before_prefill + test_live_kv_bytes_nonzero_after_prefill tests/inference_engine/server/test_engine.py + test_kv_state_reads_from_verifier_live_kv_bytes (with concrete _VerifierDouble exposing live_kv_bytes) + test_kv_state_returns_zero_when_verifier_has_no_method + test_kv_state_called_each_invocation (asserts /metrics scrape contract — no caching) tests/inference_engine/server/test_app_metrics_and_auth.py + test_metrics_kv_live_bytes_reflects_engine_kv_state (regression test pinning the fix; uses _KVAwareEngine subclass returning a deterministic non-zero value and asserts it appears in the prometheus text exposition) Test doubles updated (DeterministicEngine in conftest.py, _RaisingEngine in two test files): all return kv_state() == 0 as a no-real-cache default. Verified locally: pytest tests/inference_engine/server/test_engine.py tests/inference_engine/server/test_app_metrics_and_auth.py tests/core/test_verifier.py -> 65 passed pytest tests/inference_engine/ -> 389 passed (no regression) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
diff --git a/inference_engine/backends/mlx/verifier.py b/inference_engine/backends/mlx/verifier.py
@@ -187,10 +187,26 @@ def _cache_buffer_size(self) -> int:
             return 0
         return cache_ops.cache_seq_length(self.cache)
 
-    def _record_peak_kv(self) -> None:
+    def live_kv_bytes(self) -> int:
+        """Return the current size of the verifier's live KV cache in bytes.
+
+        This is the *now* size, not a peak. Reads from any thread:
+        ``cache_ops.total_kv_bytes`` walks the per-layer
+        :class:`SinkWindowKVCache` instances and sums
+        ``keys.size * keys.dtype.size`` + same for values, all of
+        which are integer attributes that don't tear under a
+        concurrent reader. The HTTP ``/metrics`` handler relies on
+        this property to scrape KV usage during in-flight generation.
+
+        Returns 0 when the cache has not been allocated yet (between
+        ``reset()`` and the next ``prefill()``).
+        """
         if self.cache is None:
-            return
-        total = cache_ops.total_kv_bytes(self.cache)
+            return 0
+        return cache_ops.total_kv_bytes(self.cache)
+
+    def _record_peak_kv(self) -> None:
+        total = self.live_kv_bytes()
         if total > self.stats.peak_kv_bytes:
             self.stats.peak_kv_bytes = total
 
diff --git a/inference_engine/server/app.py b/inference_engine/server/app.py
@@ -291,12 +291,21 @@ async def metrics_endpoint() -> Response:
         # Refresh scheduler-state gauges on every scrape so the
         # exposition reflects "now" rather than the last
         # admission/completion event.
+        engine_for_kv: Engine = app.state.engine
+        # Read KV bytes directly from the engine's verifier rather
+        # than from pool.live_kv_bytes. Rationale: in v0.3 the slab
+        # is a session ticket (acquired/released per request) — the
+        # verifier holds the real KV cache tensors and is the
+        # canonical source of truth. Pool-side accounting only
+        # populates once PooledVerifier is wired (a post-v0.3.0
+        # change) and otherwise reads 0 even while the verifier
+        # cache is several MiB.
         metrics.snapshot_scheduler(
             active=scheduler.active_count,
             pool_in_use=pool.in_use_count,
             pool_total=pool.total_count,
             pending=scheduler.pending_count,
-            kv_live_bytes=pool.live_kv_bytes,
+            kv_live_bytes=int(engine_for_kv.kv_state()),
         )
         return PlainTextResponse(
             content=metrics.render(),
diff --git a/inference_engine/server/engine.py b/inference_engine/server/engine.py
@@ -64,6 +64,13 @@ class Engine(Protocol):
             returns ``True``, generation stops at that token boundary.
             The callback is the only way streaming routes inject
             cancellation signals (e.g. client disconnect).
+        kv_state
+            Return the engine's current verifier KV-cache size in
+            bytes, or 0 if the engine has no real KV cache (test
+            doubles). Read on every ``/metrics`` scrape to populate
+            the ``scheduler_kv_live_bytes`` gauge so the ADR 0006
+            §2.3 long-session memory-stability claim is verifiable
+            in production.
     """
 
     @property
@@ -83,6 +90,9 @@ def generate(
     ) -> EngineResult:
         ...  # pragma: no cover - Protocol body, never executed
 
+    def kv_state(self) -> int:
+        ...  # pragma: no cover - Protocol body, never executed
+
 
 class SpeculativeEngine:
     """Concrete :class:`Engine` backed by a real SpeculativeDecoder.
@@ -175,3 +185,20 @@ def generate(
             verifier_forward_calls=int(result.verifier_forward_calls),
             stopped_on_eos=stopped_on_eos,
         )
+
+    def kv_state(self) -> int:
+        """Live KV cache bytes from the underlying verifier.
+
+        Reads ``self._decoder.verifier.live_kv_bytes()`` if the
+        verifier exposes that method (both the CPU and MLX
+        verifiers in this repository do). Returns 0 if the verifier
+        is older / a stub that does not. Called from the
+        ``/metrics`` handler on every scrape and must be safe to
+        call concurrently with the worker thread that is mutating
+        the verifier's cache (see verifier docstrings for the
+        thread-safety argument).
+        """
+        live = getattr(self._decoder.verifier, "live_kv_bytes", None)
+        if live is None:
+            return 0
+        return int(live())
diff --git a/kv_cache_proposer/verifier.py b/kv_cache_proposer/verifier.py
@@ -246,15 +246,32 @@ def _truncate_tail_in_place(self, drop: int) -> None:
             layer.keys = keys[:, :, :keep, :].contiguous()
             layer.values = values[:, :, :keep, :].contiguous()
 
-    def _record_peak_kv(self) -> None:
+    def live_kv_bytes(self) -> int:
+        """Return the current size of the verifier's live KV cache in bytes.
+
+        This is the *now* size, not a peak. Reads cleanly from any
+        thread (no locks): in CPython, walking ``self.cache.layers``
+        and reading ``Tensor.numel()`` / ``element_size()`` on each
+        is safe even while the worker thread is mutating the cache —
+        a concurrent write produces a value somewhere between the
+        two adjacent stable states, never garbage. The HTTP
+        ``/metrics`` handler relies on this property.
+
+        Returns 0 when the cache has not been allocated yet (between
+        ``reset()`` and the next ``prefill()``).
+        """
         if self.cache is None:
-            return
+            return 0
         total = 0
         for layer in self.cache.layers:
             if layer.keys is not None:
                 total += layer.keys.numel() * layer.keys.element_size()
             if layer.values is not None:
                 total += layer.values.numel() * layer.values.element_size()
+        return total
+
+    def _record_peak_kv(self) -> None:
+        total = self.live_kv_bytes()
         self.stats.peak_kv_bytes = max(self.stats.peak_kv_bytes, total)
 
     def _record_peak_activation(self, logits: torch.Tensor) -> None:
diff --git a/tests/backends/mlx/test_verifier.py b/tests/backends/mlx/test_verifier.py
@@ -255,6 +255,24 @@ def test_record_peak_kv_handles_null_cache() -> None:
     assert v.stats.peak_kv_bytes == pre
 
 
+def test_live_kv_bytes_zero_before_prefill() -> None:
+    """The /metrics gauge must read 0 before any prefill."""
+    v = _build_mlx_verifier()
+    assert v.live_kv_bytes() == 0
+
+
+def test_live_kv_bytes_nonzero_after_prefill() -> None:
+    """During in-flight generation the gauge must read the actual
+    bytes — this is what bench_long_session.py polls on each turn
+    to verify the ADR 0006 §2.3 KV-bounded claim."""
+    v = _build_mlx_verifier()
+    v.prefill(list(range(16)))
+    n = v.live_kv_bytes()
+    assert n > 0
+    # Right after prefill, peak == live.
+    assert v.stats.peak_kv_bytes == n
+
+
 def test_record_peak_activation_grows_only() -> None:
     v = _build_mlx_verifier()
     a = mx.zeros((1, 4, 32), dtype=mx.bfloat16)
diff --git a/tests/core/test_verifier.py b/tests/core/test_verifier.py
@@ -317,6 +317,48 @@ def test_record_peak_kv_handles_layers_with_null_kv(fresh_verifier_factory) -> N
         layer0.values = saved_v
 
 
+def test_live_kv_bytes_zero_before_prefill(fresh_verifier_factory) -> None:
+    """Before any prefill, ``live_kv_bytes()`` must read 0. Required
+    by the /metrics scrape contract: the gauge has a stable value at
+    process startup."""
+    verifier = fresh_verifier_factory()
+    assert verifier.live_kv_bytes() == 0
+
+
+def test_live_kv_bytes_nonzero_after_prefill(fresh_verifier_factory) -> None:
+    """After prefill the cache holds tensors; live_kv_bytes returns
+    the sum of bytes across all layers' keys + values. This is the
+    gauge value the bench scrapes during in-flight generation."""
+    verifier = fresh_verifier_factory()
+    verifier.prefill(list(range(16)))
+    n = verifier.live_kv_bytes()
+    assert n > 0
+    # Round-trip: peak_kv_bytes is set from the same source so they
+    # must agree right after prefill.
+    assert verifier.stats.peak_kv_bytes == n
+
+
+def test_live_kv_bytes_returns_zero_when_layer_kv_is_null(
+    fresh_verifier_factory,
+) -> None:
+    """The keys-None branch is taken on cleared layers and must not
+    raise — live_kv_bytes simply skips them in the sum."""
+    verifier = fresh_verifier_factory()
+    verifier.prefill(list(range(4)))
+    layer0 = verifier.cache.layers[0]
+    saved_k, saved_v = layer0.keys, layer0.values
+    layer0.keys = None
+    layer0.values = None
+    try:
+        # Must not raise. Returns the sum across the *remaining*
+        # non-null layers (potentially less than the full prefill total).
+        n = verifier.live_kv_bytes()
+        assert n >= 0
+    finally:
+        layer0.keys = saved_k
+        layer0.values = saved_v
+
+
 def test_record_peak_activation_grows_only(fresh_verifier_factory) -> None:
     verifier = fresh_verifier_factory()
     a = torch.zeros((1, 4, 32), dtype=torch.bfloat16)
diff --git a/tests/inference_engine/server/conftest.py b/tests/inference_engine/server/conftest.py
@@ -189,6 +189,11 @@ def tokenizer(self) -> DeterministicTokenizer:
     def model_id_label(self) -> str:
         return self._model_id_label
 
+    def kv_state(self) -> int:
+        """Test double has no real KV cache — 0 by default. Tests that
+        want to drive a non-zero gauge value override this."""
+        return 0
+
     def generate(
         self,
         prompt_ids: List[int],
diff --git a/tests/inference_engine/server/test_app_metrics_and_auth.py b/tests/inference_engine/server/test_app_metrics_and_auth.py
@@ -92,9 +92,9 @@ async def test_metrics_kv_live_bytes_gauge_present_and_zero_at_idle(
     short_engine,
 ):
     """The KV-live-bytes gauge must be exposed and read 0 on an idle
-    pool (every slab has logical_size == 0). This is the gauge that
-    bench_long_session.py scrapes to verify the ADR 0006 §2.3
-    KV-bounded claim, so its presence is part of the public contract.
+    engine. This is the gauge that bench_long_session.py scrapes to
+    verify the ADR 0006 §2.3 KV-bounded claim, so its presence is
+    part of the public contract.
     """
     app = create_app(short_engine, ServerConfig(max_concurrent=2))
     async with AsyncClient(transport=ASGITransport(app=app),
@@ -105,6 +105,47 @@ async def test_metrics_kv_live_bytes_gauge_present_and_zero_at_idle(
     assert "scheduler_kv_live_bytes 0.0" in text
 
 
+async def test_metrics_kv_live_bytes_reflects_engine_kv_state(tokenizer):
+    """The /metrics handler must read KV bytes from the engine on
+    every scrape (not from the pool). This is the v0.3 wiring that
+    makes bench_long_session.py's in-flight scrape produce a
+    non-zero number on real hardware — without it the gauge
+    unconditionally reads 0 because no production code path sets
+    the slab's live_kv_bytes_override.
+
+    The 2026-05-30 short test #2 (results/.../bench_long_session_mac_short2_
+    1780196477.json) recorded 7313 in-flight samples across 58 turns
+    with pool_in_use=1 throughout, yet kv_live_bytes was 0.0 in every
+    sample. This regression test pins the fix.
+    """
+    from tests.inference_engine.server.conftest import DeterministicEngine
+
+    class _KVAwareEngine(DeterministicEngine):
+        def __init__(self, *args, kv_value: int, **kwargs):
+            super().__init__(*args, **kwargs)
+            self._kv_value = kv_value
+
+        def kv_state(self) -> int:
+            return self._kv_value
+
+    eos = tokenizer.eos_token_id
+    assert eos is not None
+    hello = tokenizer._intern("hi")
+    eng = _KVAwareEngine(
+        fixed_tokens=[hello, eos],
+        tokenizer=tokenizer,
+        model_id_label="kv-aware",
+        kv_value=12345678,
+    )
+    app = create_app(eng, ServerConfig(max_concurrent=1))
+    async with AsyncClient(transport=ASGITransport(app=app),
+                           base_url="http://t") as c:
+        r = await c.get("/metrics")
+    assert r.status_code == 200
+    assert "scheduler_kv_live_bytes 1.2345678e+07" in r.text or \
+           "scheduler_kv_live_bytes 12345678" in r.text
+
+
 # ---------------------------------------------------------------------------
 # OpenAI error envelope
 # ---------------------------------------------------------------------------
diff --git a/tests/inference_engine/server/test_app_streaming.py b/tests/inference_engine/server/test_app_streaming.py
@@ -350,6 +350,9 @@ def tokenizer(self):
     def model_id_label(self):
         return "raising"
 
+    def kv_state(self) -> int:
+        return 0
+
     def generate(self, prompt_ids, max_new_tokens, eos_token_ids, on_token=None):
         raise RuntimeError("synthetic engine failure")
 
diff --git a/tests/inference_engine/server/test_app_with_scheduler.py b/tests/inference_engine/server/test_app_with_scheduler.py
@@ -275,6 +275,9 @@ def tokenizer(self):
     def model_id_label(self):
         return "raising"
 
+    def kv_state(self) -> int:
+        return 0
+
     def generate(self, prompt_ids, max_new_tokens, eos_token_ids, on_token=None):
         raise RuntimeError("synthetic engine failure")
 
diff --git a/tests/inference_engine/server/test_engine.py b/tests/inference_engine/server/test_engine.py
@@ -58,13 +58,19 @@ class _DecoderDouble:
     """
 
     def __init__(self, fixed_tokens: List[int], acceptance: float = 0.5,
-                 proposer_calls: int = 7, verifier_calls: int = 3) -> None:
+                 proposer_calls: int = 7, verifier_calls: int = 3,
+                 verifier=None) -> None:
         self._fixed_tokens = list(fixed_tokens)
         self._acceptance = acceptance
         self._proposer_calls = proposer_calls
         self._verifier_calls = verifier_calls
         self.call_count = 0
         self.last_kwargs: Optional[dict] = None
+        # The engine adapter reads ``decoder.verifier.live_kv_bytes()``
+        # via ``kv_state``. A double object exposing the same one-method
+        # surface is enough for the engine adapter; SpeculativeDecoder
+        # itself has a ``.verifier`` attribute so duck typing matches.
+        self.verifier = verifier
 
     def generate(
         self,
@@ -254,3 +260,62 @@ def test_generate_rejects_empty_eos_token_ids(tokenizer):
     engine = SpeculativeEngine(decoder=decoder, tokenizer=tokenizer, model_id_label="m")
     with pytest.raises(ValueError, match="eos_token_ids must be non-empty"):
         engine.generate(prompt_ids=[1], max_new_tokens=5, eos_token_ids=[])
+
+
+# ---------------------------------------------------------------------------
+# kv_state — read live KV bytes from the underlying verifier
+# ---------------------------------------------------------------------------
+
+
+class _VerifierDouble:
+    """Real concrete verifier surface that exposes ``live_kv_bytes``.
+
+    Mirrors the production verifiers' (CPU + MLX) public method
+    without loading any model. The engine adapter walks
+    ``decoder.verifier.live_kv_bytes()`` via duck typing.
+    """
+
+    def __init__(self, value: int) -> None:
+        self._value = value
+        self.calls = 0
+
+    def live_kv_bytes(self) -> int:
+        self.calls += 1
+        return self._value
+
+
+class _LegacyVerifierDouble:
+    """Older verifier shape that does NOT expose live_kv_bytes.
+
+    Used to verify the engine adapter degrades to 0 (rather than
+    raising) when wrapping a verifier without the optional method.
+    """
+
+
+def test_kv_state_reads_from_verifier_live_kv_bytes(tokenizer):
+    verifier = _VerifierDouble(value=4096)
+    decoder = _DecoderDouble(fixed_tokens=[10], verifier=verifier)
+    engine = SpeculativeEngine(decoder=decoder, tokenizer=tokenizer,
+                               model_id_label="m")
+    assert engine.kv_state() == 4096
+    assert verifier.calls == 1
+
+
+def test_kv_state_returns_zero_when_verifier_has_no_method(tokenizer):
+    decoder = _DecoderDouble(fixed_tokens=[10], verifier=_LegacyVerifierDouble())
+    engine = SpeculativeEngine(decoder=decoder, tokenizer=tokenizer,
+                               model_id_label="m")
+    assert engine.kv_state() == 0
+
+
+def test_kv_state_called_each_invocation(tokenizer):
+    """The /metrics handler calls kv_state on every scrape — must
+    re-read the verifier each time, not cache."""
+    verifier = _VerifierDouble(value=100)
+    decoder = _DecoderDouble(fixed_tokens=[10], verifier=verifier)
+    engine = SpeculativeEngine(decoder=decoder, tokenizer=tokenizer,
+                               model_id_label="m")
+    engine.kv_state()
+    engine.kv_state()
+    engine.kv_state()
+    assert verifier.calls == 3