Skip to content

Cancel scheduler session on HTTP disconnect / drain error#25

Merged
FluffyAIcode merged 2 commits into
mainfrom
AgentMemory/scheduler-orphan-session-fix-8e7f
May 31, 2026
Merged

Cancel scheduler session on HTTP disconnect / drain error#25
FluffyAIcode merged 2 commits into
mainfrom
AgentMemory/scheduler-orphan-session-fix-8e7f

Conversation

@FluffyAIcode

Copy link
Copy Markdown
Owner

Summary

Server-side fix for the orphan-session bug surfaced by the 2026-05-30
Mac M4 4-hour bench_long_session.py run. This is the second of the
two PRs needed to unblock a clean re-run; PR #24 fixed the bench-side
metric-name mismatch and added the scheduler_kv_live_bytes gauge.

The aborted run report (results/platform-tests/bench_long_session_mac_1780130542.aborted.json
on AgentMemory/bench-long-session-mac-results-8e7f) recorded:

  • 58 successful turns over 29.4 minutes
  • Server-side logs: 3+ hours of POST /v1/chat/completions 429 Too Many Requests
  • Bench client showed 0 errors in errors[] because it was hung on a
    single in-flight HTTP request

Diagnosis: when a non-streaming chat request was cancelled or timed
out client-side, the scheduler worker kept running until generation
finished naturally, holding the only slab. With --max-concurrent 1
every subsequent request returned 429.

Changes

inference_engine/scheduler/scheduler.py (+5 / -1 lines)

The cross-thread token enqueue inside _run_session now blocks on
the run_coroutine_threadsafe future before returning. Without this,
the worker thread can keep producing tokens while the event loop
serially drains them, but on generate() return the terminal sentinel
is pushed back on the loop and can race ahead of still-pending token
puts — corrupting the SSE chunk order.

enqueue = asyncio.run_coroutine_threadsafe(
    session.token_queue.put(int(tok_id)), loop
)
enqueue.result()  # ← preserve token-before-sentinel ordering

inference_engine/server/app.py (+38 / -5 lines)

Two changes to the non-streaming /v1/chat/completions path:

  1. Cancel on disconnect / cancellation. A new helper
    _collect_non_streaming_tokens polls
    request.is_disconnected() every 50 ms while draining the
    session's token queue, and calls scheduler.cancel_session(session)
    when the client has gone away. Streaming responses already had
    this loop; the non-streaming path was the orphan-session source.
  2. Cancel on engine error. The BaseException catch around the
    drain now calls scheduler.cancel_session(session) before
    converting to HTTP 500, so a verifier crash mid-generate also
    releases the slab.
  3. The asyncio.CancelledError path (transport-level cancel from
    the test client / a real timed-out client) cancels the session
    and re-raises, so the route handler never returns from a
    half-drained generation with the slab still held.

tests/inference_engine/server/test_app_streaming.py (+32 lines)

test_collect_non_streaming_tokens_cancels_on_disconnect exercises
the new helper directly with a _FakeRequest that returns
is_disconnected() == True on first poll; asserts the session
state transitions to CANCELLED and scheduler.active_count
returns to zero. Real concrete classes only — no mocks.

Verification

$ pytest tests/inference_engine/server/ tests/inference_engine/scheduler/ -q
278 passed in 3.98s

$ pytest tests/inference_engine/ -q   (with PR #24 merged on top)
390 passed in 10.57s

Compatibility with PR #24

Both PRs touch inference_engine/server/app.py but on different
lines
(PR #24: bootstrap snapshot + /metrics scrape; this PR: non-
streaming chat handler). I tested the auto-merge locally:

$ git merge origin/AgentMemory/scheduler-orphan-session-fix-8e7f \
       (onto AgentMemory/bench-long-session-fixes-8e7f)
Auto-merging inference_engine/server/app.py
Automatic merge went well

$ pytest tests/inference_engine/ -q
390 passed

Merge order does not matter; either PR can land first.

Recommended re-run path

After both PR #24 and this PR land, the validation sequence is:

# 30-min short run first
PYTHONPATH=. python3 scripts/bench_agentic/bench_long_session.py \
    --kakeya-url http://127.0.0.1:8000 --kakeya-model kakeya-v1 \
    --duration-s 1800 --turn-spacing-s 5 --max-tokens 64 \
    --report results/platform-tests/bench_long_session_mac_short_$(date +%s).json

Success criteria:

If the short run is clean, a 4-hour run is the GA-evidence step.

References

Open in Web Open in Cursor 

fluffy314 and others added 2 commits May 30, 2026 20:30
Co-authored-by: Cursor <cursoragent@cursor.com>
CI on PR #25 reported 99.83% coverage with 2 statements missing on
inference_engine/server/app.py — specifically the route handler's
'except asyncio.CancelledError' branch (line 384):

    try:
        output_token_ids = await _collect_non_streaming_tokens(...)
    except asyncio.CancelledError:
        await scheduler.cancel_session(session)   # ← 384, uncovered
        raise

This branch fires when the route handler's awaiting Future is
cancelled externally (uvicorn shutdown, transport-level client
timeout cancelling the request task) — distinct from the
disconnect-poll path covered by the previous test.

Adds two tests, both real concrete classes / no mocks:

  test_collect_non_streaming_tokens_propagates_cancel_and_cleans_up
    — unit-level: drives _collect_non_streaming_tokens in an
      asyncio.Task; cancels the task externally; asserts
      CancelledError propagates and that cancel_session releases
      the slab.

  test_route_handler_cancelled_error_branch_releases_slab
    — integration-level via httpx.ASGITransport: cancels an
      in-flight POST /v1/chat/completions task; asserts a
      follow-up request returns 200 (not 429), proving the
      route handler's CancelledError catch ran cancel_session
      and released the slab.

Verified locally: 387 inference_engine tests pass; 14
cancel-related streaming tests all green.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
cursor Bot pushed a commit that referenced this pull request May 31, 2026
The 2026-05-31 4-hour Mac M4 run completed and produced two
independent pieces of evidence:

§2.3.a: same 7,798,784-byte KV peak on 58/58 first-30-min turns
  (identical to short test #3 — observed/expected 100.0000% on
  two separate runs). The 4h run additionally confirmed the
  orphan-session fix (PR #25) holds for 4 hours of continuous
  server uptime: idle_pool_in_use_settles_to_zero=True throughout.

§2.3.b: empirical confirmation that long-session latency outpaces
  any fixed client timeout. After the first 30 min, the run
  produced 182 errors over 3.5 hours — 96 ReadTimeouts (prefill
  cost > 120s default timeout) and 86 HTTP 429s (the timed-out
  request's slab still held while server worker finished
  prefill). This is exactly the §2.3.b pattern made manifest.

Updates
-------
§2.3.a Status block:
  Replace single-run citation with a comparison table listing
  both runs. Add the orphan-session invariant observation from
  the 4h run. Both evidence file paths now in §2.3.a.

§2.3.b Status block:
  Add the empirical 4h timeout/429 pattern as direct evidence
  of the protocol-level limit. Quantify the practical envelope
  on Mac M4: ~30 min / ~60 turns of useful work for a single
  continuous session at default config, after which client-side
  prompt management is required.

§5 GA gate item 4:
  Cite both runs as joint validation evidence.

Diff: +24 / -10 lines, ADR markdown only.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@FluffyAIcode FluffyAIcode merged commit 066d2c3 into main May 31, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants