Cancel scheduler session on HTTP disconnect / drain error by FluffyAIcode · Pull Request #25 · FluffyAIcode/Kakeya-LLM-Inference-engine

FluffyAIcode · 2026-05-30T12:33:57Z

Summary

Server-side fix for the orphan-session bug surfaced by the 2026-05-30
Mac M4 4-hour bench_long_session.py run. This is the second of the
two PRs needed to unblock a clean re-run; PR #24 fixed the bench-side
metric-name mismatch and added the scheduler_kv_live_bytes gauge.

The aborted run report (results/platform-tests/bench_long_session_mac_1780130542.aborted.json
on AgentMemory/bench-long-session-mac-results-8e7f) recorded:

58 successful turns over 29.4 minutes
Server-side logs: 3+ hours of POST /v1/chat/completions 429 Too Many Requests
Bench client showed 0 errors in errors[] because it was hung on a
single in-flight HTTP request

Diagnosis: when a non-streaming chat request was cancelled or timed
out client-side, the scheduler worker kept running until generation
finished naturally, holding the only slab. With --max-concurrent 1
every subsequent request returned 429.

Changes

`inference_engine/scheduler/scheduler.py` (+5 / -1 lines)

The cross-thread token enqueue inside _run_session now blocks on
the run_coroutine_threadsafe future before returning. Without this,
the worker thread can keep producing tokens while the event loop
serially drains them, but on generate() return the terminal sentinel
is pushed back on the loop and can race ahead of still-pending token
puts — corrupting the SSE chunk order.

enqueue = asyncio.run_coroutine_threadsafe(
    session.token_queue.put(int(tok_id)), loop
)
enqueue.result()  # ← preserve token-before-sentinel ordering

`inference_engine/server/app.py` (+38 / -5 lines)

Two changes to the non-streaming /v1/chat/completions path:

Cancel on disconnect / cancellation. A new helper
_collect_non_streaming_tokens polls
request.is_disconnected() every 50 ms while draining the
session's token queue, and calls scheduler.cancel_session(session)
when the client has gone away. Streaming responses already had
this loop; the non-streaming path was the orphan-session source.
Cancel on engine error. The BaseException catch around the
drain now calls scheduler.cancel_session(session) before
converting to HTTP 500, so a verifier crash mid-generate also
releases the slab.
The asyncio.CancelledError path (transport-level cancel from
the test client / a real timed-out client) cancels the session
and re-raises, so the route handler never returns from a
half-drained generation with the slab still held.

`tests/inference_engine/server/test_app_streaming.py` (+32 lines)

test_collect_non_streaming_tokens_cancels_on_disconnect exercises
the new helper directly with a _FakeRequest that returns
is_disconnected() == True on first poll; asserts the session
state transitions to CANCELLED and scheduler.active_count
returns to zero. Real concrete classes only — no mocks.

Verification

$ pytest tests/inference_engine/server/ tests/inference_engine/scheduler/ -q
278 passed in 3.98s

$ pytest tests/inference_engine/ -q   (with PR #24 merged on top)
390 passed in 10.57s

Compatibility with PR #24

Both PRs touch inference_engine/server/app.py but on different
lines (PR #24: bootstrap snapshot + /metrics scrape; this PR: non-
streaming chat handler). I tested the auto-merge locally:

$ git merge origin/AgentMemory/scheduler-orphan-session-fix-8e7f \
       (onto AgentMemory/bench-long-session-fixes-8e7f)
Auto-merging inference_engine/server/app.py
Automatic merge went well

$ pytest tests/inference_engine/ -q
390 passed

Merge order does not matter; either PR can land first.

Recommended re-run path

After both PR #24 and this PR land, the validation sequence is:

# 30-min short run first
PYTHONPATH=. python3 scripts/bench_agentic/bench_long_session.py \
    --kakeya-url http://127.0.0.1:8000 --kakeya-model kakeya-v1 \
    --duration-s 1800 --turn-spacing-s 5 --max-tokens 64 \
    --report results/platform-tests/bench_long_session_mac_short_$(date +%s).json

Success criteria:

All turns record non-empty metrics dict with 5 keys including
scheduler_kv_live_bytes. (PR Fix bench_long_session metric names + expose scheduler_kv_live_bytes gauge #24)
scheduler_kv_live_bytes stays bounded — verifies ADR 0006 §2.3.
No sustained 429. This PR is the gate on that one.
agg.kv_bounded == True in the final JSON.

If the short run is clean, a 4-hour run is the GA-evidence step.

References

Triage data: results/platform-tests/bench_long_session_mac_1780130542.aborted.json
on AgentMemory/bench-long-session-mac-results-8e7f
Companion PR: Fix bench_long_session metric names + expose scheduler_kv_live_bytes gauge #24 (bench metric names + scheduler_kv_live_bytes gauge)
ADR 0003 — slab pool / verifier integration
ADR 0006 §2.3 — long-session memory-stability claim

Co-authored-by: Cursor <cursoragent@cursor.com>

CI on PR #25 reported 99.83% coverage with 2 statements missing on inference_engine/server/app.py — specifically the route handler's 'except asyncio.CancelledError' branch (line 384): try: output_token_ids = await _collect_non_streaming_tokens(...) except asyncio.CancelledError: await scheduler.cancel_session(session) # ← 384, uncovered raise This branch fires when the route handler's awaiting Future is cancelled externally (uvicorn shutdown, transport-level client timeout cancelling the request task) — distinct from the disconnect-poll path covered by the previous test. Adds two tests, both real concrete classes / no mocks: test_collect_non_streaming_tokens_propagates_cancel_and_cleans_up — unit-level: drives _collect_non_streaming_tokens in an asyncio.Task; cancels the task externally; asserts CancelledError propagates and that cancel_session releases the slab. test_route_handler_cancelled_error_branch_releases_slab — integration-level via httpx.ASGITransport: cancels an in-flight POST /v1/chat/completions task; asserts a follow-up request returns 200 (not 429), proving the route handler's CancelledError catch ran cancel_session and released the slab. Verified locally: 387 inference_engine tests pass; 14 cancel-related streaming tests all green. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

The 2026-05-31 4-hour Mac M4 run completed and produced two independent pieces of evidence: §2.3.a: same 7,798,784-byte KV peak on 58/58 first-30-min turns (identical to short test #3 — observed/expected 100.0000% on two separate runs). The 4h run additionally confirmed the orphan-session fix (PR #25) holds for 4 hours of continuous server uptime: idle_pool_in_use_settles_to_zero=True throughout. §2.3.b: empirical confirmation that long-session latency outpaces any fixed client timeout. After the first 30 min, the run produced 182 errors over 3.5 hours — 96 ReadTimeouts (prefill cost > 120s default timeout) and 86 HTTP 429s (the timed-out request's slab still held while server worker finished prefill). This is exactly the §2.3.b pattern made manifest. Updates ------- §2.3.a Status block: Replace single-run citation with a comparison table listing both runs. Add the orphan-session invariant observation from the 4h run. Both evidence file paths now in §2.3.a. §2.3.b Status block: Add the empirical 4h timeout/429 pattern as direct evidence of the protocol-level limit. Quantify the practical envelope on Mac M4: ~30 min / ~60 turns of useful work for a single continuous session at default config, after which client-side prompt management is required. §5 GA gate item 4: Cite both runs as joint validation evidence. Diff: +24 / -10 lines, ADR markdown only. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

fluffy314 and others added 2 commits May 30, 2026 20:30

Cancel scheduler session on HTTP disconnect / drain error

52237c4

Co-authored-by: Cursor <cursoragent@cursor.com>

FluffyAIcode merged commit 066d2c3 into main May 31, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cancel scheduler session on HTTP disconnect / drain error#25

Cancel scheduler session on HTTP disconnect / drain error#25
FluffyAIcode merged 2 commits into
mainfrom
AgentMemory/scheduler-orphan-session-fix-8e7f

FluffyAIcode commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented May 30, 2026

Summary

Changes

inference_engine/scheduler/scheduler.py (+5 / -1 lines)

inference_engine/server/app.py (+38 / -5 lines)

tests/inference_engine/server/test_app_streaming.py (+32 lines)

Verification

Compatibility with PR #24

Recommended re-run path

References

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`inference_engine/scheduler/scheduler.py` (+5 / -1 lines)

`inference_engine/server/app.py` (+38 / -5 lines)

`tests/inference_engine/server/test_app_streaming.py` (+32 lines)