Fix bench_long_session metric names + expose scheduler_kv_live_bytes gauge by FluffyAIcode · Pull Request #24 · FluffyAIcode/Kakeya-LLM-Inference-engine

FluffyAIcode · 2026-05-30T12:27:26Z

Summary

Triage + fixes for the 2026-05-30 Mac M4 4-hour bench_long_session.py
run that stopped advancing at 29.4 min. Fixes the bench-script bug that
made the run record no KV memory data at all, and adds the
server-side scheduler_kv_live_bytes gauge the bench was looking for.

Results data lives at results/platform-tests/bench_long_session_mac_1780130542.{partial,aborted}.json
on AgentMemory/bench-long-session-mac-results-8e7f.

Triage

The bench wrote a partial checkpoint with 58 successful turns over
29.4 minutes, 0 errors recorded, but metrics: {} on every turn.
After turn 57 the server-side logs show 3+ hours of 429 Too Many Requests with no client-side advance — i.e. the bench was hung on a
single in-flight HTTP call while a previous orphan session pinned the
only scheduler slot.

Three independent issues turned up:

Issue	Layer	Status
1. Bench scraped wrong metric names → all metrics empty → KV-bounded claim couldn't be checked	bench script	Fixed in this PR
2. Orphan session on client disconnect → 429 saturation	server	Fixed locally per aborted.json `fixes_applied` (separate PR from that work line)
3. p50 latency grew 14.7 s → 55.5 s in 30 min (prefill is re-done every turn)	architecture	Documented; out of scope for v0.3.0

Issue 1 — bench/server metric-name mismatch

Bench expected	Server actually exposes
`inference_engine_scheduler_active_sessions`	`scheduler_active_sessions`
`inference_engine_scheduler_pool_in_use`	`scheduler_pool_in_use`
`inference_engine_scheduler_pool_size`	`scheduler_pool_total`
`inference_engine_scheduler_kv_live_bytes`	didn't exist

prometheus-client does not add a service prefix, and the KV-live-bytes
gauge was simply missing from the exporter — both ends of the contract
needed to move.

Issue 3 — prefill cost growth (architectural)

Documented inline in the bench docstring. KV memory IS bounded
(verifier sink+window holds), but per-turn latency is not because
the OpenAI chat-completions protocol is stateless: every turn the
server tokenizes the full history and runs the verifier prefill
from token 0. Sink+window only bounds generation-phase memory, not
prefill cost. Cross-request KV reuse is a v0.4 feature; the
recommended user-side mitigation is summarization or sliding-window
prompt management.

The bench now reports KV-bounded as a hard claim and latency drift
as a measurement, not a gate — so a long-session run produces a
useful report even when latency walks up.

Changes

Server side (3 files, +36 lines)

inference_engine/memory/pool.py — SlabPool.live_kv_bytes property
aggregating across in-use slabs.
inference_engine/server/metrics.py — new scheduler_kv_live_bytes
Gauge with HELP pointing at ADR 0006 §2.3; snapshot_scheduler now
takes (and explicitly resets) the kv_live_bytes kwarg.
inference_engine/server/app.py — both bootstrap snapshot and the
per-/metrics-scrape refresh pass pool.live_kv_bytes through.

Bench script (1 file, +52 lines / -13)

scripts/bench_agentic/bench_long_session.py
- Corrected metric names everywhere (5 sites: _METRIC_NAMES,
  docstring, progress line, bucketize, aggregate).
- Added a 'what this bench measures and doesn't' note documenting
  the prefill-vs-memory distinction surfaced by this triage.

Tests (3 files, +51 lines)

File	New tests
`tests/inference_engine/memory/test_pool.py`	`test_live_kv_bytes_zero_when_empty`, `test_live_kv_bytes_counts_only_in_use_slabs`, `test_live_kv_bytes_aggregates_across_multiple_in_use`
`tests/inference_engine/server/test_metrics.py`	`test_snapshot_scheduler_kv_live_bytes_default_zero` + `scheduler_kv_live_bytes` added to expected-names set + assertions in existing snapshot tests
`tests/inference_engine/server/test_app_metrics_and_auth.py`	`test_metrics_kv_live_bytes_gauge_present_and_zero_at_idle` (HTTP-level integration)

Verification

$ pytest tests/inference_engine/server/test_metrics.py \
         tests/inference_engine/memory/test_pool.py \
         tests/inference_engine/server/test_app_metrics_and_auth.py -q
52 passed in 1.37s

$ pytest tests/inference_engine/ -q
389 passed in 10.60s

$ python3 scripts/bench_agentic/bench_long_session.py --dry-run
[bench] dry-run: argparse OK; would drive single agent for 14400s @ turn_spacing=5.0s

$ python3 -c "from scripts.bench_agentic.bench_long_session import _parse_prom_text; \
    print(_parse_prom_text('scheduler_kv_live_bytes 1048576.0\\n'))"
{'scheduler_kv_live_bytes': 1048576.0}

Recommended next-run path

Once this PR + the server-side orphan-session fix from aborted.json
are both merged, a 30-minute short-run is the right validation step
before another 4h attempt:

PYTHONPATH=. python3 scripts/bench_agentic/bench_long_session.py \
    --kakeya-url http://127.0.0.1:8000 --kakeya-model kakeya-v1 \
    --duration-s 1800 --turn-spacing-s 5 --max-tokens 64 \
    --report results/platform-tests/bench_long_session_mac_short_$(date +%s).json

Success criteria for the short run:

All turns record non-empty metrics dict with 5 keys including
scheduler_kv_live_bytes.
scheduler_kv_live_bytes stays roughly constant (<10% spread) —
the §2.3 KV-bounded claim is verified directly.
No sustained 429 — orphan-session fix verified.
agg.kv_bounded == True in the final JSON.

If both check out, a fresh 4-hour run becomes the GA-evidence step.

Quality bars

No mock: every test exercises real SlabPool / Metrics /
ASGI app; no unittest.mock.
No fallback: kv_live_bytes always set on every snapshot;
the kwarg defaults to 0 explicitly so a stale prior value cannot
leak into the next scrape.
No overfit: tests assert structural invariants (gauge is
registered, exposed, summed, defaults reset), not specific
byte counts.

References

Triage data: results/platform-tests/bench_long_session_mac_1780130542.aborted.json
(on AgentMemory/bench-long-session-mac-results-8e7f)
ADR 0006 §2.3 — long-session memory-stability claim that this gauge verifies
ADR 0003 — live_kv_bytes_override mechanism on KVSlab that
this aggregation reads through

Triage of the 2026-05-30 Mac M4 4h run (results/platform-tests/ bench_long_session_mac_1780130542.aborted.json) surfaced a bench-side bug that made the entire run produce no KV-memory data. Root cause ---------- scripts/bench_agentic/bench_long_session.py looked for these metric names: inference_engine_scheduler_active_sessions inference_engine_scheduler_pool_in_use inference_engine_scheduler_pool_size inference_engine_scheduler_kv_live_bytes but the server actually exposes: scheduler_active_sessions scheduler_pool_in_use scheduler_pool_total scheduler_pending (no scheduler_kv_live_bytes at all) So all 58 turns recorded an empty 'metrics: {}' dict and the KV-bounded check (the headline ADR 0006 §2.3 claim) was effectively disabled. Fixes in this commit -------------------- 1. inference_engine/memory/pool.py SlabPool.live_kv_bytes property — sums RolloutSlab.live_kv_bytes across the in-use slabs. Free slabs report 0 (logical_size is reset on release). Verifier-side bytes flow in via PooledVerifier's existing live_kv_bytes_override mechanism, so this aggregate matches the verifier's actual footprint. 2. inference_engine/server/metrics.py New 'scheduler_kv_live_bytes' Gauge with descriptive HELP text pointing at ADR 0006 §2.3. snapshot_scheduler() takes an extra kv_live_bytes kwarg (defaults to 0 so existing callers stay valid; calling without it now also explicitly resets the gauge so a stale prior value never bleeds into the next scrape). 3. inference_engine/server/app.py Bootstrap snapshot + the per-/metrics-scrape refresh both now pass pool.live_kv_bytes through. 4. scripts/bench_agentic/bench_long_session.py - Corrected metric names everywhere they appear (5 sites: _METRIC_NAMES, docstring, progress line, bucketize aggregate, global aggregate). - Added a 'note on what this bench measures and what it doesn't' to the module docstring documenting the prefill- grows-linearly-with-history finding from the same triage. That observation is a *protocol-level* limitation of stateless OpenAI chat-completions, NOT a sink+window failure: KV memory is bounded but prefill cost is O(history). The bench reports both metrics independently — KV bounded is a hard claim, latency drift is a measurement. Tests ----- tests/inference_engine/memory/test_pool.py +3 tests for live_kv_bytes aggregation tests/inference_engine/server/test_metrics.py +1 test for default-zero kwarg, +scheduler_kv_live_bytes in expected-name set, +assertions in existing snapshot tests tests/.../test_app_metrics_and_auth.py +1 integration test asserting /metrics exposes the new gauge at idle Verified: pytest tests/inference_engine/ → 389 passed python3 scripts/bench_agentic/bench_long_session.py --dry-run → OK parser smoke test on real metric-name sample → parses all 5 Out of scope ------------ - Server-side fixes for the orphan-session-on-disconnect bug (already applied locally on the Mac per the aborted.json fixes_applied list; will be a separate PR from that work). - Architectural decision on cross-request KV reuse (the actual fix for prefill latency growth) — needs ADR 0006 amendment and likely a v0.4 implementation, separate work line. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

FluffyAIcode merged commit a99f8b2 into main May 31, 2026
3 checks passed

FluffyAIcode mentioned this pull request May 31, 2026

ADR 0007 (Proposed): Cross-request KV cache reuse for long sessions #30

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bench_long_session metric names + expose scheduler_kv_live_bytes gauge#24

Fix bench_long_session metric names + expose scheduler_kv_live_bytes gauge#24
FluffyAIcode merged 1 commit into
mainfrom
AgentMemory/bench-long-session-fixes-8e7f

FluffyAIcode commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented May 30, 2026

Summary

Triage

Issue 1 — bench/server metric-name mismatch

Issue 3 — prefill cost growth (architectural)

Changes

Server side (3 files, +36 lines)

Bench script (1 file, +52 lines / -13)

Tests (3 files, +51 lines)

Verification

Recommended next-run path

Quality bars

References

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants