Skip to content

Commit 724be1b

Browse files
PR-D2 (ADR 0008 Phase D): refactor HTTP shim onto SessionStore
Retires the Scheduler + PooledVerifier + SpeculativeEngine machinery from the HTTP shim's request path. Each /v1/chat/completions request is now a single-shot session under SessionStore: CreateSession \u2192 AppendTokens(prompt) \u2192 Generate \u2192 CloseSession. Same semantics as the gRPC RuntimeService surface; ADR 0008 \u00a72.7 deprecation. Three architectural changes --------------------------- 1. Speculative decoding is no longer applied on the HTTP path. The session-bound runtime is pure AR against the verifier; the proposer is wired into the v0.4 alignment work (ADR 0004). Pre-PR-D2 the HTTP shim used SpeculativeEngine (proposer + verifier together); post-PR-D2 it's roughly the same speed as transformers-vanilla AR. Migrate to gRPC for v0.3's full perf story. 2. Admission control is now an asyncio.Semaphore instead of a full Scheduler. REJECT vs QUEUE policy with queue_max_wait_s is preserved (queue_max_wait_s=0 means wait forever); the in-flight slab-pool bookkeeping moved into SessionStore. The Scheduler module + integration tests stay (used by other callers), but the HTTP shim no longer wires it. 3. ADR 0008 \u00a72.7 deprecation headers are stamped onto every response by a new _DeprecationHeadersMiddleware: Deprecation: true Sunset: Wed, 31 Dec 2025 00:00:00 GMT Link: </docs/adr/0008-...>; rel="successor-version" Production-side changes ----------------------- inference_engine/server/app.py ~rewrite, +330 / -300 net - create_app's signature changed: now takes (verifier, config, *, slab_pool=None, model_id_label=None) instead of (engine, config, pool=None). Caller passes the underlying SinkWindowVerifier directly. - Internal: builds SessionStore + AppendTokensCoordinator + GenerationCoordinator. asyncio.Semaphore for admission. - Route handler: tokenize \u2192 CreateSession \u2192 append \u2192 generate (sync gen run in asyncio.to_thread for disconnect-poll responsiveness) \u2192 CloseSession on success/error. - SSE streaming: same pattern; queue-bridged from the sync generator coordinator. HistoryTruncatedEvent is consumed silently (no OpenAI analog). - app.state.engine \u2192 app.state.{verifier, store, append_coord, gen_coord, model_id_label, admission_sem}. inference_engine/scheduler/__init__.py -1 line export Dropped 'PooledVerifier' from __all__. inference_engine/scheduler/pooled_verifier.py DELETED, -150 lines scripts/serve.py ~rewrite, +12 / -50 net - _build_engine \u2192 _build_verifier (returns SinkWindowVerifier or MLXSinkWindowVerifier). - main() builds the verifier and passes to create_app(verifier, config). Mirrors PR-E1b's start_grpc_runtime_server.py. - --block-size and --num-diffusion-steps flags retained for CLI compat but documented as ignored. - Banner now says 'DEPRECATED HTTP shim' and points at the gRPC entrypoint. Tests ----- tests/inference_engine/scheduler/test_pooled_verifier.py DELETED, -250 lines PR-N1 had marked this file exempt from no-doubles cleanup precisely because PR-D2 was going to retire the module. PR-D2 delivers; the file goes with it. tests/inference_engine/server/test_grpc_app.py +120 lines, 3 new tests Coverage of grpc_app.py's success paths after the test_app_* files (which previously hit them via the FakeVerifier-backed SchedulerEngine path) were retired by PR-N3: test_append_tokens_session_not_found_returns_not_found Coordinator override raises SessionNotFoundError. Covers grpc_app.py:208 (NOT_FOUND abort branch). test_append_tokens_success_returns_response Coordinator override returns a synthetic history_length; asserts the response carries it. Covers grpc_app.py:213 (return AppendTokensResponse on success). test_generate_yields_history_truncated_then_done Generator override yields HistoryTruncated + Token + Done events; asserts the wire frames in order. Covers grpc_app.py:295-310 (HistoryTruncatedEvent yield + DoneEvent yield). tests/integration/test_http_shim_real.py ~30 line update Fixture wiring: real_speculative_engine \u2192 real_speculative_engine._decoder.verifier (since create_app's signature changed). Tests reading real_app.state.engine.model_id_label \u2192 real_app.state.model_id_label. CI workflow ----------- .github/workflows/ci.yaml: dropped pooled_verifier.py from the --include= filter (it no longer exists). Linux verification ------------------ PYTHONPATH=.:sdks/python coverage run -m pytest <Linux gate paths>: 476 passed (was 473 on main; +3 net = added 3 grpc_app success-path tests). 100% coverage on 915 stmts (was 987 on main; -72 net = the deleted PooledVerifier module). Mac M4 evidence (REQUIRED for merge) ------------------------------------ This is the single most invasive PR in the v0.3 sequence \u2014 it rewrites the deprecated HTTP shim's entire request path. The integration suite's test_http_shim_real.py is the binding gate. Reviewer runs: bash scripts/review_pr_d2_on_mac.sh git add results/platform-tests/pr-d2-mac-* git commit -m 'Mac M4 review evidence for PR-D2' git push Acceptance: all integration tests pass against real Qwen3-0.6B, including the now-rewired test_http_shim_real.py which exercises chat-completions (streaming + non-streaming), auth, /healthz, /metrics, /v1/models against the new SessionStore-driven path. Stack ----- PR-D2 is branched off post-N1..N4 main. Independent of PR-E2 (#57) which adds CI workflow YAML; the two can merge in either order. Next PR ------- v0.4 brings the proposer back into the session-bound path: PR-V0.4-A wires SparseLogitsProposer into a new SpeculativeAppendTokensCoordinator (or extends the existing one) so speculative decoding is restored on both gRPC and HTTP paths. The ADR 0001/0004 alignment training feeds into that work. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
1 parent e8e8415 commit 724be1b

9 files changed

Lines changed: 779 additions & 878 deletions

File tree

.github/workflows/ci.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -100,10 +100,10 @@ jobs:
100100
--junitxml=junit.xml \
101101
-v
102102
coverage report \
103-
--include='inference_engine/server/auth.py,inference_engine/server/config.py,inference_engine/server/errors.py,inference_engine/server/grpc_app.py,inference_engine/server/metrics.py,inference_engine/server/schemas.py,inference_engine/server/proto_gen/**/*.py,inference_engine/memory/*,inference_engine/scheduler/config.py,inference_engine/scheduler/session.py,inference_engine/scheduler/pooled_verifier.py,inference_engine/pipeline/*,inference_engine/session/store.py,sdks/python/kakeya/__init__.py,sdks/python/kakeya/errors.py,training/repr_align/*' \
103+
--include='inference_engine/server/auth.py,inference_engine/server/config.py,inference_engine/server/errors.py,inference_engine/server/grpc_app.py,inference_engine/server/metrics.py,inference_engine/server/schemas.py,inference_engine/server/proto_gen/**/*.py,inference_engine/memory/*,inference_engine/scheduler/config.py,inference_engine/scheduler/session.py,inference_engine/pipeline/*,inference_engine/session/store.py,sdks/python/kakeya/__init__.py,sdks/python/kakeya/errors.py,training/repr_align/*' \
104104
--fail-under=100
105105
coverage xml -o coverage.xml \
106-
--include='inference_engine/server/auth.py,inference_engine/server/config.py,inference_engine/server/errors.py,inference_engine/server/grpc_app.py,inference_engine/server/metrics.py,inference_engine/server/schemas.py,inference_engine/server/proto_gen/**/*.py,inference_engine/memory/*,inference_engine/scheduler/config.py,inference_engine/scheduler/session.py,inference_engine/scheduler/pooled_verifier.py,inference_engine/pipeline/*,inference_engine/session/store.py,sdks/python/kakeya/__init__.py,sdks/python/kakeya/errors.py,training/repr_align/*'
106+
--include='inference_engine/server/auth.py,inference_engine/server/config.py,inference_engine/server/errors.py,inference_engine/server/grpc_app.py,inference_engine/server/metrics.py,inference_engine/server/schemas.py,inference_engine/server/proto_gen/**/*.py,inference_engine/memory/*,inference_engine/scheduler/config.py,inference_engine/scheduler/session.py,inference_engine/pipeline/*,inference_engine/session/store.py,sdks/python/kakeya/__init__.py,sdks/python/kakeya/errors.py,training/repr_align/*'
107107
108108
- name: Upload coverage artifact
109109
if: always()

inference_engine/scheduler/__init__.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,13 +26,15 @@
2626
"""
2727

2828
from .config import AdmissionPolicy, SchedulerConfig
29-
from .pooled_verifier import PooledVerifier
29+
# PooledVerifier was retired by PR-D2; the HTTP shim now drives
30+
# SessionStore + AppendTokensCoordinator directly. Imports kept
31+
# stable by removing the export entirely (no soft-deprecation
32+
# layer — the symbol is gone from the package).
3033
from .scheduler import RequestRejected, Scheduler
3134
from .session import Session, SessionState
3235

3336
__all__ = [
3437
"AdmissionPolicy",
35-
"PooledVerifier",
3638
"RequestRejected",
3739
"Scheduler",
3840
"SchedulerConfig",

inference_engine/scheduler/pooled_verifier.py

Lines changed: 0 additions & 175 deletions
This file was deleted.

0 commit comments

Comments
 (0)