Skip to content

Fix bench_long_session metric names + expose scheduler_kv_live_bytes gauge#24

Merged
FluffyAIcode merged 1 commit into
mainfrom
AgentMemory/bench-long-session-fixes-8e7f
May 31, 2026
Merged

Fix bench_long_session metric names + expose scheduler_kv_live_bytes gauge#24
FluffyAIcode merged 1 commit into
mainfrom
AgentMemory/bench-long-session-fixes-8e7f

Conversation

@FluffyAIcode

Copy link
Copy Markdown
Owner

Summary

Triage + fixes for the 2026-05-30 Mac M4 4-hour bench_long_session.py
run that stopped advancing at 29.4 min. Fixes the bench-script bug that
made the run record no KV memory data at all, and adds the
server-side scheduler_kv_live_bytes gauge the bench was looking for.

Results data lives at results/platform-tests/bench_long_session_mac_1780130542.{partial,aborted}.json
on AgentMemory/bench-long-session-mac-results-8e7f.

Triage

The bench wrote a partial checkpoint with 58 successful turns over
29.4 minutes, 0 errors recorded, but metrics: {} on every turn
.
After turn 57 the server-side logs show 3+ hours of 429 Too Many Requests with no client-side advance — i.e. the bench was hung on a
single in-flight HTTP call while a previous orphan session pinned the
only scheduler slot.

Three independent issues turned up:

Issue Layer Status
1. Bench scraped wrong metric names → all metrics empty → KV-bounded claim couldn't be checked bench script Fixed in this PR
2. Orphan session on client disconnect → 429 saturation server Fixed locally per aborted.json fixes_applied (separate PR from that work line)
3. p50 latency grew 14.7 s → 55.5 s in 30 min (prefill is re-done every turn) architecture Documented; out of scope for v0.3.0

Issue 1 — bench/server metric-name mismatch

Bench expected Server actually exposes
inference_engine_scheduler_active_sessions scheduler_active_sessions
inference_engine_scheduler_pool_in_use scheduler_pool_in_use
inference_engine_scheduler_pool_size scheduler_pool_total
inference_engine_scheduler_kv_live_bytes didn't exist

prometheus-client does not add a service prefix, and the KV-live-bytes
gauge was simply missing from the exporter — both ends of the contract
needed to move.

Issue 3 — prefill cost growth (architectural)

Documented inline in the bench docstring. KV memory IS bounded
(verifier sink+window holds), but per-turn latency is not because
the OpenAI chat-completions protocol is stateless: every turn the
server tokenizes the full history and runs the verifier prefill
from token 0. Sink+window only bounds generation-phase memory, not
prefill cost. Cross-request KV reuse is a v0.4 feature; the
recommended user-side mitigation is summarization or sliding-window
prompt management.

The bench now reports KV-bounded as a hard claim and latency drift
as a measurement, not a gate — so a long-session run produces a
useful report even when latency walks up.

Changes

Server side (3 files, +36 lines)

  • inference_engine/memory/pool.pySlabPool.live_kv_bytes property
    aggregating across in-use slabs.
  • inference_engine/server/metrics.py — new scheduler_kv_live_bytes
    Gauge with HELP pointing at ADR 0006 §2.3; snapshot_scheduler now
    takes (and explicitly resets) the kv_live_bytes kwarg.
  • inference_engine/server/app.py — both bootstrap snapshot and the
    per-/metrics-scrape refresh pass pool.live_kv_bytes through.

Bench script (1 file, +52 lines / -13)

  • scripts/bench_agentic/bench_long_session.py
    • Corrected metric names everywhere (5 sites: _METRIC_NAMES,
      docstring, progress line, bucketize, aggregate).
    • Added a 'what this bench measures and doesn't' note documenting
      the prefill-vs-memory distinction surfaced by this triage.

Tests (3 files, +51 lines)

File New tests
tests/inference_engine/memory/test_pool.py test_live_kv_bytes_zero_when_empty, test_live_kv_bytes_counts_only_in_use_slabs, test_live_kv_bytes_aggregates_across_multiple_in_use
tests/inference_engine/server/test_metrics.py test_snapshot_scheduler_kv_live_bytes_default_zero + scheduler_kv_live_bytes added to expected-names set + assertions in existing snapshot tests
tests/inference_engine/server/test_app_metrics_and_auth.py test_metrics_kv_live_bytes_gauge_present_and_zero_at_idle (HTTP-level integration)

Verification

$ pytest tests/inference_engine/server/test_metrics.py \
         tests/inference_engine/memory/test_pool.py \
         tests/inference_engine/server/test_app_metrics_and_auth.py -q
52 passed in 1.37s

$ pytest tests/inference_engine/ -q
389 passed in 10.60s

$ python3 scripts/bench_agentic/bench_long_session.py --dry-run
[bench] dry-run: argparse OK; would drive single agent for 14400s @ turn_spacing=5.0s

$ python3 -c "from scripts.bench_agentic.bench_long_session import _parse_prom_text; \
    print(_parse_prom_text('scheduler_kv_live_bytes 1048576.0\\n'))"
{'scheduler_kv_live_bytes': 1048576.0}

Recommended next-run path

Once this PR + the server-side orphan-session fix from aborted.json
are both merged, a 30-minute short-run is the right validation step
before another 4h attempt:

PYTHONPATH=. python3 scripts/bench_agentic/bench_long_session.py \
    --kakeya-url http://127.0.0.1:8000 --kakeya-model kakeya-v1 \
    --duration-s 1800 --turn-spacing-s 5 --max-tokens 64 \
    --report results/platform-tests/bench_long_session_mac_short_$(date +%s).json

Success criteria for the short run:

  • All turns record non-empty metrics dict with 5 keys including
    scheduler_kv_live_bytes.
  • scheduler_kv_live_bytes stays roughly constant (<10% spread) —
    the §2.3 KV-bounded claim is verified directly.
  • No sustained 429 — orphan-session fix verified.
  • agg.kv_bounded == True in the final JSON.

If both check out, a fresh 4-hour run becomes the GA-evidence step.

Quality bars

  • No mock: every test exercises real SlabPool / Metrics /
    ASGI app; no unittest.mock.
  • No fallback: kv_live_bytes always set on every snapshot;
    the kwarg defaults to 0 explicitly so a stale prior value cannot
    leak into the next scrape.
  • No overfit: tests assert structural invariants (gauge is
    registered, exposed, summed, defaults reset), not specific
    byte counts.

References

  • Triage data: results/platform-tests/bench_long_session_mac_1780130542.aborted.json
    (on AgentMemory/bench-long-session-mac-results-8e7f)
  • ADR 0006 §2.3 — long-session memory-stability claim that this gauge verifies
  • ADR 0003 — live_kv_bytes_override mechanism on KVSlab that
    this aggregation reads through
Open in Web Open in Cursor 

Triage of the 2026-05-30 Mac M4 4h run (results/platform-tests/
bench_long_session_mac_1780130542.aborted.json) surfaced a bench-side
bug that made the entire run produce no KV-memory data.

Root cause
----------
scripts/bench_agentic/bench_long_session.py looked for these metric
names:

    inference_engine_scheduler_active_sessions
    inference_engine_scheduler_pool_in_use
    inference_engine_scheduler_pool_size
    inference_engine_scheduler_kv_live_bytes

but the server actually exposes:

    scheduler_active_sessions
    scheduler_pool_in_use
    scheduler_pool_total
    scheduler_pending
    (no scheduler_kv_live_bytes at all)

So all 58 turns recorded an empty 'metrics: {}' dict and the
KV-bounded check (the headline ADR 0006 §2.3 claim) was effectively
disabled.

Fixes in this commit
--------------------
1. inference_engine/memory/pool.py
     SlabPool.live_kv_bytes property — sums RolloutSlab.live_kv_bytes
     across the in-use slabs. Free slabs report 0 (logical_size is
     reset on release). Verifier-side bytes flow in via
     PooledVerifier's existing live_kv_bytes_override mechanism, so
     this aggregate matches the verifier's actual footprint.

2. inference_engine/server/metrics.py
     New 'scheduler_kv_live_bytes' Gauge with descriptive HELP text
     pointing at ADR 0006 §2.3. snapshot_scheduler() takes an extra
     kv_live_bytes kwarg (defaults to 0 so existing callers stay
     valid; calling without it now also explicitly resets the gauge
     so a stale prior value never bleeds into the next scrape).

3. inference_engine/server/app.py
     Bootstrap snapshot + the per-/metrics-scrape refresh both now
     pass pool.live_kv_bytes through.

4. scripts/bench_agentic/bench_long_session.py
     - Corrected metric names everywhere they appear (5 sites:
       _METRIC_NAMES, docstring, progress line, bucketize aggregate,
       global aggregate).
     - Added a 'note on what this bench measures and what it
       doesn't' to the module docstring documenting the prefill-
       grows-linearly-with-history finding from the same triage.
       That observation is a *protocol-level* limitation of stateless
       OpenAI chat-completions, NOT a sink+window failure: KV memory
       is bounded but prefill cost is O(history). The bench reports
       both metrics independently — KV bounded is a hard claim,
       latency drift is a measurement.

Tests
-----
  tests/inference_engine/memory/test_pool.py      +3 tests for
                                                  live_kv_bytes
                                                  aggregation
  tests/inference_engine/server/test_metrics.py   +1 test for
                                                  default-zero kwarg,
                                                  +scheduler_kv_live_bytes
                                                  in expected-name set,
                                                  +assertions in
                                                  existing snapshot
                                                  tests
  tests/.../test_app_metrics_and_auth.py          +1 integration test
                                                  asserting /metrics
                                                  exposes the new
                                                  gauge at idle

Verified:
  pytest tests/inference_engine/   →  389 passed
  python3 scripts/bench_agentic/bench_long_session.py --dry-run  →  OK
  parser smoke test on real metric-name sample  →  parses all 5

Out of scope
------------
  - Server-side fixes for the orphan-session-on-disconnect bug
    (already applied locally on the Mac per the aborted.json
    fixes_applied list; will be a separate PR from that work).
  - Architectural decision on cross-request KV reuse (the actual
    fix for prefill latency growth) — needs ADR 0006 amendment
    and likely a v0.4 implementation, separate work line.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants