Commit 724be1b
PR-D2 (ADR 0008 Phase D): refactor HTTP shim onto SessionStore
Retires the Scheduler + PooledVerifier + SpeculativeEngine machinery
from the HTTP shim's request path. Each /v1/chat/completions request
is now a single-shot session under SessionStore: CreateSession \u2192
AppendTokens(prompt) \u2192 Generate \u2192 CloseSession. Same semantics as
the gRPC RuntimeService surface; ADR 0008 \u00a72.7 deprecation.
Three architectural changes
---------------------------
1. Speculative decoding is no longer applied on the HTTP path.
The session-bound runtime is pure AR against the verifier;
the proposer is wired into the v0.4 alignment work
(ADR 0004). Pre-PR-D2 the HTTP shim used SpeculativeEngine
(proposer + verifier together); post-PR-D2 it's roughly the
same speed as transformers-vanilla AR. Migrate to gRPC for
v0.3's full perf story.
2. Admission control is now an asyncio.Semaphore instead of a
full Scheduler. REJECT vs QUEUE policy with queue_max_wait_s
is preserved (queue_max_wait_s=0 means wait forever); the
in-flight slab-pool bookkeeping moved into SessionStore. The
Scheduler module + integration tests stay (used by other
callers), but the HTTP shim no longer wires it.
3. ADR 0008 \u00a72.7 deprecation headers are stamped onto every
response by a new _DeprecationHeadersMiddleware:
Deprecation: true
Sunset: Wed, 31 Dec 2025 00:00:00 GMT
Link: </docs/adr/0008-...>; rel="successor-version"
Production-side changes
-----------------------
inference_engine/server/app.py ~rewrite, +330 / -300 net
- create_app's signature changed: now takes (verifier, config,
*, slab_pool=None, model_id_label=None) instead of
(engine, config, pool=None). Caller passes the underlying
SinkWindowVerifier directly.
- Internal: builds SessionStore + AppendTokensCoordinator +
GenerationCoordinator. asyncio.Semaphore for admission.
- Route handler: tokenize \u2192 CreateSession \u2192 append \u2192 generate
(sync gen run in asyncio.to_thread for disconnect-poll
responsiveness) \u2192 CloseSession on success/error.
- SSE streaming: same pattern; queue-bridged from the sync
generator coordinator. HistoryTruncatedEvent is consumed
silently (no OpenAI analog).
- app.state.engine \u2192 app.state.{verifier, store, append_coord,
gen_coord, model_id_label, admission_sem}.
inference_engine/scheduler/__init__.py -1 line export
Dropped 'PooledVerifier' from __all__.
inference_engine/scheduler/pooled_verifier.py DELETED, -150 lines
scripts/serve.py ~rewrite, +12 / -50 net
- _build_engine \u2192 _build_verifier (returns SinkWindowVerifier
or MLXSinkWindowVerifier).
- main() builds the verifier and passes to create_app(verifier,
config). Mirrors PR-E1b's start_grpc_runtime_server.py.
- --block-size and --num-diffusion-steps flags retained for CLI
compat but documented as ignored.
- Banner now says 'DEPRECATED HTTP shim' and points at the
gRPC entrypoint.
Tests
-----
tests/inference_engine/scheduler/test_pooled_verifier.py DELETED, -250 lines
PR-N1 had marked this file exempt from no-doubles cleanup
precisely because PR-D2 was going to retire the module. PR-D2
delivers; the file goes with it.
tests/inference_engine/server/test_grpc_app.py +120 lines, 3 new tests
Coverage of grpc_app.py's success paths after the test_app_*
files (which previously hit them via the FakeVerifier-backed
SchedulerEngine path) were retired by PR-N3:
test_append_tokens_session_not_found_returns_not_found
Coordinator override raises SessionNotFoundError. Covers
grpc_app.py:208 (NOT_FOUND abort branch).
test_append_tokens_success_returns_response
Coordinator override returns a synthetic history_length;
asserts the response carries it. Covers grpc_app.py:213
(return AppendTokensResponse on success).
test_generate_yields_history_truncated_then_done
Generator override yields HistoryTruncated + Token + Done
events; asserts the wire frames in order. Covers
grpc_app.py:295-310 (HistoryTruncatedEvent yield + DoneEvent
yield).
tests/integration/test_http_shim_real.py ~30 line update
Fixture wiring: real_speculative_engine \u2192
real_speculative_engine._decoder.verifier (since create_app's
signature changed). Tests reading
real_app.state.engine.model_id_label \u2192 real_app.state.model_id_label.
CI workflow
-----------
.github/workflows/ci.yaml: dropped pooled_verifier.py from the
--include= filter (it no longer exists).
Linux verification
------------------
PYTHONPATH=.:sdks/python coverage run -m pytest <Linux gate paths>:
476 passed (was 473 on main; +3 net = added 3 grpc_app
success-path tests).
100% coverage on 915 stmts (was 987 on main; -72 net = the
deleted PooledVerifier module).
Mac M4 evidence (REQUIRED for merge)
------------------------------------
This is the single most invasive PR in the v0.3 sequence \u2014 it
rewrites the deprecated HTTP shim's entire request path. The
integration suite's test_http_shim_real.py is the binding gate.
Reviewer runs:
bash scripts/review_pr_d2_on_mac.sh
git add results/platform-tests/pr-d2-mac-*
git commit -m 'Mac M4 review evidence for PR-D2'
git push
Acceptance: all integration tests pass against real Qwen3-0.6B,
including the now-rewired test_http_shim_real.py which exercises
chat-completions (streaming + non-streaming), auth, /healthz,
/metrics, /v1/models against the new SessionStore-driven path.
Stack
-----
PR-D2 is branched off post-N1..N4 main. Independent of PR-E2 (#57)
which adds CI workflow YAML; the two can merge in either order.
Next PR
-------
v0.4 brings the proposer back into the session-bound path:
PR-V0.4-A wires SparseLogitsProposer into a new
SpeculativeAppendTokensCoordinator (or extends the existing one)
so speculative decoding is restored on both gRPC and HTTP paths.
The ADR 0001/0004 alignment training feeds into that work.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>1 parent e8e8415 commit 724be1b
9 files changed
Lines changed: 779 additions & 878 deletions
File tree
- .github/workflows
- inference_engine
- scheduler
- server
- scripts
- tests
- inference_engine
- scheduler
- server
- integration
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
100 | 100 | | |
101 | 101 | | |
102 | 102 | | |
103 | | - | |
| 103 | + | |
104 | 104 | | |
105 | 105 | | |
106 | | - | |
| 106 | + | |
107 | 107 | | |
108 | 108 | | |
109 | 109 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
29 | | - | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
30 | 33 | | |
31 | 34 | | |
32 | 35 | | |
33 | 36 | | |
34 | 37 | | |
35 | | - | |
36 | 38 | | |
37 | 39 | | |
38 | 40 | | |
| |||
This file was deleted.
0 commit comments