Cancel scheduler session on HTTP disconnect / drain error#25
Merged
FluffyAIcode merged 2 commits intoMay 31, 2026
Conversation
Co-authored-by: Cursor <cursoragent@cursor.com>
CI on PR #25 reported 99.83% coverage with 2 statements missing on inference_engine/server/app.py — specifically the route handler's 'except asyncio.CancelledError' branch (line 384): try: output_token_ids = await _collect_non_streaming_tokens(...) except asyncio.CancelledError: await scheduler.cancel_session(session) # ← 384, uncovered raise This branch fires when the route handler's awaiting Future is cancelled externally (uvicorn shutdown, transport-level client timeout cancelling the request task) — distinct from the disconnect-poll path covered by the previous test. Adds two tests, both real concrete classes / no mocks: test_collect_non_streaming_tokens_propagates_cancel_and_cleans_up — unit-level: drives _collect_non_streaming_tokens in an asyncio.Task; cancels the task externally; asserts CancelledError propagates and that cancel_session releases the slab. test_route_handler_cancelled_error_branch_releases_slab — integration-level via httpx.ASGITransport: cancels an in-flight POST /v1/chat/completions task; asserts a follow-up request returns 200 (not 429), proving the route handler's CancelledError catch ran cancel_session and released the slab. Verified locally: 387 inference_engine tests pass; 14 cancel-related streaming tests all green. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This was referenced May 30, 2026
cursor Bot
pushed a commit
that referenced
this pull request
May 31, 2026
The 2026-05-31 4-hour Mac M4 run completed and produced two independent pieces of evidence: §2.3.a: same 7,798,784-byte KV peak on 58/58 first-30-min turns (identical to short test #3 — observed/expected 100.0000% on two separate runs). The 4h run additionally confirmed the orphan-session fix (PR #25) holds for 4 hours of continuous server uptime: idle_pool_in_use_settles_to_zero=True throughout. §2.3.b: empirical confirmation that long-session latency outpaces any fixed client timeout. After the first 30 min, the run produced 182 errors over 3.5 hours — 96 ReadTimeouts (prefill cost > 120s default timeout) and 86 HTTP 429s (the timed-out request's slab still held while server worker finished prefill). This is exactly the §2.3.b pattern made manifest. Updates ------- §2.3.a Status block: Replace single-run citation with a comparison table listing both runs. Add the orphan-session invariant observation from the 4h run. Both evidence file paths now in §2.3.a. §2.3.b Status block: Add the empirical 4h timeout/429 pattern as direct evidence of the protocol-level limit. Quantify the practical envelope on Mac M4: ~30 min / ~60 turns of useful work for a single continuous session at default config, after which client-side prompt management is required. §5 GA gate item 4: Cite both runs as joint validation evidence. Diff: +24 / -10 lines, ADR markdown only. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Server-side fix for the orphan-session bug surfaced by the 2026-05-30
Mac M4 4-hour
bench_long_session.pyrun. This is the second of thetwo PRs needed to unblock a clean re-run; PR #24 fixed the bench-side
metric-name mismatch and added the
scheduler_kv_live_bytesgauge.The aborted run report (
results/platform-tests/bench_long_session_mac_1780130542.aborted.jsonon
AgentMemory/bench-long-session-mac-results-8e7f) recorded:POST /v1/chat/completions 429 Too Many Requestserrors[]because it was hung on asingle in-flight HTTP request
Diagnosis: when a non-streaming chat request was cancelled or timed
out client-side, the scheduler worker kept running until generation
finished naturally, holding the only slab. With
--max-concurrent 1every subsequent request returned 429.
Changes
inference_engine/scheduler/scheduler.py(+5 / -1 lines)The cross-thread token enqueue inside
_run_sessionnow blocks onthe
run_coroutine_threadsafefuture before returning. Without this,the worker thread can keep producing tokens while the event loop
serially drains them, but on
generate()return the terminal sentinelis pushed back on the loop and can race ahead of still-pending token
puts — corrupting the SSE chunk order.
inference_engine/server/app.py(+38 / -5 lines)Two changes to the non-streaming
/v1/chat/completionspath:_collect_non_streaming_tokenspollsrequest.is_disconnected()every 50 ms while draining thesession's token queue, and calls
scheduler.cancel_session(session)when the client has gone away. Streaming responses already had
this loop; the non-streaming path was the orphan-session source.
BaseExceptioncatch around thedrain now calls
scheduler.cancel_session(session)beforeconverting to HTTP 500, so a verifier crash mid-generate also
releases the slab.
asyncio.CancelledErrorpath (transport-level cancel fromthe test client / a real timed-out client) cancels the session
and re-raises, so the route handler never returns from a
half-drained generation with the slab still held.
tests/inference_engine/server/test_app_streaming.py(+32 lines)test_collect_non_streaming_tokens_cancels_on_disconnectexercisesthe new helper directly with a
_FakeRequestthat returnsis_disconnected() == Trueon first poll; asserts the sessionstate transitions to
CANCELLEDandscheduler.active_countreturns to zero. Real concrete classes only — no mocks.
Verification
Compatibility with PR #24
Both PRs touch
inference_engine/server/app.pybut on differentlines (PR #24: bootstrap snapshot +
/metricsscrape; this PR: non-streaming chat handler). I tested the auto-merge locally:
Merge order does not matter; either PR can land first.
Recommended re-run path
After both PR #24 and this PR land, the validation sequence is:
Success criteria:
metricsdict with 5 keys includingscheduler_kv_live_bytes. (PR Fix bench_long_session metric names + expose scheduler_kv_live_bytes gauge #24)scheduler_kv_live_bytesstays bounded — verifies ADR 0006 §2.3.agg.kv_bounded == Truein the final JSON.If the short run is clean, a 4-hour run is the GA-evidence step.
References
results/platform-tests/bench_long_session_mac_1780130542.aborted.jsonon
AgentMemory/bench-long-session-mac-results-8e7fscheduler_kv_live_bytesgauge)