Fix bench_long_session metric names + expose scheduler_kv_live_bytes gauge#24
Merged
Merged
Conversation
Triage of the 2026-05-30 Mac M4 4h run (results/platform-tests/
bench_long_session_mac_1780130542.aborted.json) surfaced a bench-side
bug that made the entire run produce no KV-memory data.
Root cause
----------
scripts/bench_agentic/bench_long_session.py looked for these metric
names:
inference_engine_scheduler_active_sessions
inference_engine_scheduler_pool_in_use
inference_engine_scheduler_pool_size
inference_engine_scheduler_kv_live_bytes
but the server actually exposes:
scheduler_active_sessions
scheduler_pool_in_use
scheduler_pool_total
scheduler_pending
(no scheduler_kv_live_bytes at all)
So all 58 turns recorded an empty 'metrics: {}' dict and the
KV-bounded check (the headline ADR 0006 §2.3 claim) was effectively
disabled.
Fixes in this commit
--------------------
1. inference_engine/memory/pool.py
SlabPool.live_kv_bytes property — sums RolloutSlab.live_kv_bytes
across the in-use slabs. Free slabs report 0 (logical_size is
reset on release). Verifier-side bytes flow in via
PooledVerifier's existing live_kv_bytes_override mechanism, so
this aggregate matches the verifier's actual footprint.
2. inference_engine/server/metrics.py
New 'scheduler_kv_live_bytes' Gauge with descriptive HELP text
pointing at ADR 0006 §2.3. snapshot_scheduler() takes an extra
kv_live_bytes kwarg (defaults to 0 so existing callers stay
valid; calling without it now also explicitly resets the gauge
so a stale prior value never bleeds into the next scrape).
3. inference_engine/server/app.py
Bootstrap snapshot + the per-/metrics-scrape refresh both now
pass pool.live_kv_bytes through.
4. scripts/bench_agentic/bench_long_session.py
- Corrected metric names everywhere they appear (5 sites:
_METRIC_NAMES, docstring, progress line, bucketize aggregate,
global aggregate).
- Added a 'note on what this bench measures and what it
doesn't' to the module docstring documenting the prefill-
grows-linearly-with-history finding from the same triage.
That observation is a *protocol-level* limitation of stateless
OpenAI chat-completions, NOT a sink+window failure: KV memory
is bounded but prefill cost is O(history). The bench reports
both metrics independently — KV bounded is a hard claim,
latency drift is a measurement.
Tests
-----
tests/inference_engine/memory/test_pool.py +3 tests for
live_kv_bytes
aggregation
tests/inference_engine/server/test_metrics.py +1 test for
default-zero kwarg,
+scheduler_kv_live_bytes
in expected-name set,
+assertions in
existing snapshot
tests
tests/.../test_app_metrics_and_auth.py +1 integration test
asserting /metrics
exposes the new
gauge at idle
Verified:
pytest tests/inference_engine/ → 389 passed
python3 scripts/bench_agentic/bench_long_session.py --dry-run → OK
parser smoke test on real metric-name sample → parses all 5
Out of scope
------------
- Server-side fixes for the orphan-session-on-disconnect bug
(already applied locally on the Mac per the aborted.json
fixes_applied list; will be a separate PR from that work).
- Architectural decision on cross-request KV reuse (the actual
fix for prefill latency growth) — needs ADR 0006 amendment
and likely a v0.4 implementation, separate work line.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This was referenced May 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Triage + fixes for the 2026-05-30 Mac M4 4-hour
bench_long_session.pyrun that stopped advancing at 29.4 min. Fixes the bench-script bug that
made the run record no KV memory data at all, and adds the
server-side
scheduler_kv_live_bytesgauge the bench was looking for.Results data lives at
results/platform-tests/bench_long_session_mac_1780130542.{partial,aborted}.jsonon
AgentMemory/bench-long-session-mac-results-8e7f.Triage
The bench wrote a partial checkpoint with 58 successful turns over
29.4 minutes, 0 errors recorded, but
metrics: {}on every turn.After turn 57 the server-side logs show 3+ hours of
429 Too Many Requestswith no client-side advance — i.e. the bench was hung on asingle in-flight HTTP call while a previous orphan session pinned the
only scheduler slot.
Three independent issues turned up:
fixes_applied(separate PR from that work line)Issue 1 — bench/server metric-name mismatch
inference_engine_scheduler_active_sessionsscheduler_active_sessionsinference_engine_scheduler_pool_in_usescheduler_pool_in_useinference_engine_scheduler_pool_sizescheduler_pool_totalinference_engine_scheduler_kv_live_bytesprometheus-clientdoes not add a service prefix, and the KV-live-bytesgauge was simply missing from the exporter — both ends of the contract
needed to move.
Issue 3 — prefill cost growth (architectural)
Documented inline in the bench docstring. KV memory IS bounded
(verifier sink+window holds), but per-turn latency is not because
the OpenAI chat-completions protocol is stateless: every turn the
server tokenizes the full history and runs the verifier prefill
from token 0. Sink+window only bounds generation-phase memory, not
prefill cost. Cross-request KV reuse is a v0.4 feature; the
recommended user-side mitigation is summarization or sliding-window
prompt management.
The bench now reports KV-bounded as a hard claim and latency drift
as a measurement, not a gate — so a long-session run produces a
useful report even when latency walks up.
Changes
Server side (3 files, +36 lines)
inference_engine/memory/pool.py—SlabPool.live_kv_bytespropertyaggregating across in-use slabs.
inference_engine/server/metrics.py— newscheduler_kv_live_bytesGauge with HELP pointing at ADR 0006 §2.3;
snapshot_schedulernowtakes (and explicitly resets) the
kv_live_byteskwarg.inference_engine/server/app.py— both bootstrap snapshot and theper-
/metrics-scrape refresh passpool.live_kv_bytesthrough.Bench script (1 file, +52 lines / -13)
scripts/bench_agentic/bench_long_session.py_METRIC_NAMES,docstring, progress line, bucketize, aggregate).
the prefill-vs-memory distinction surfaced by this triage.
Tests (3 files, +51 lines)
tests/inference_engine/memory/test_pool.pytest_live_kv_bytes_zero_when_empty,test_live_kv_bytes_counts_only_in_use_slabs,test_live_kv_bytes_aggregates_across_multiple_in_usetests/inference_engine/server/test_metrics.pytest_snapshot_scheduler_kv_live_bytes_default_zero+scheduler_kv_live_bytesadded to expected-names set + assertions in existing snapshot teststests/inference_engine/server/test_app_metrics_and_auth.pytest_metrics_kv_live_bytes_gauge_present_and_zero_at_idle(HTTP-level integration)Verification
Recommended next-run path
Once this PR + the server-side orphan-session fix from aborted.json
are both merged, a 30-minute short-run is the right validation step
before another 4h attempt:
PYTHONPATH=. python3 scripts/bench_agentic/bench_long_session.py \ --kakeya-url http://127.0.0.1:8000 --kakeya-model kakeya-v1 \ --duration-s 1800 --turn-spacing-s 5 --max-tokens 64 \ --report results/platform-tests/bench_long_session_mac_short_$(date +%s).jsonSuccess criteria for the short run:
metricsdict with 5 keys includingscheduler_kv_live_bytes.scheduler_kv_live_bytesstays roughly constant (<10% spread) —the §2.3 KV-bounded claim is verified directly.
agg.kv_bounded == Truein the final JSON.If both check out, a fresh 4-hour run becomes the GA-evidence step.
Quality bars
SlabPool/Metrics/ASGI app; no
unittest.mock.kv_live_bytesalways set on every snapshot;the kwarg defaults to 0 explicitly so a stale prior value cannot
leak into the next scrape.
registered, exposed, summed, defaults reset), not specific
byte counts.
References
results/platform-tests/bench_long_session_mac_1780130542.aborted.json(on
AgentMemory/bench-long-session-mac-results-8e7f)live_kv_bytes_overridemechanism onKVSlabthatthis aggregation reads through