PR-D1 (ADR 0008 Phase D): remove ADR 0007 server-side dead code

cursoragent · FluffyAIcode · cursoragent · commit ac533d3f619c · 2026-06-01T15:46:21.000Z
ADR 0008 \u00a76.4 PR-D1 originally proposed a coupled change \u2014 (a)
remove ADR-0007-vintage server dead code, (b) refactor the HTTP
shim's chat-completions handler onto SessionStore. Implementation
revealed (a) is a pure subtraction with no behavior dependence on
(b), so they're split (same pattern as PR-A3 / PR-A3b). This PR is
(a); (b) becomes PR-D2 (queued, not in this diff).

Net diff: -540 deletions, +78 insertions. ADR 0008 \u00a76.4 amended
to record the split.

Files modified \u2014 production:

  inference_engine/server/engine.py
    -12 lines: EngineResult.path_selection / .tokens_skipped /
    .prefill_duration_seconds fields removed; SpeculativeEngine
    no longer forwards them from SpeculativeRunResult.

  inference_engine/server/metrics.py
    -74 lines: path_selection_total, continuation_tokens_skipped_total,
    verifier_prefill_duration_seconds, cache_invariant_violations_total
    metrics removed from Metrics + factory; record_path_selection and
    record_cache_invariant_violation methods removed.

  inference_engine/server/app.py
    -53 lines: _session_acceptance_rate and _emit_path_selection_metric
    helpers removed; the two call sites in the streaming + non-
    streaming completion paths now pass acceptance_rate=None to
    record_completion. The OpenAI response loses its acceptance_rate
    field as a result \u2014 acceptable on a feature-frozen deprecated
    shim per ADR 0008 \u00a72.7. Migrate to gRPC for richer telemetry.

  inference_engine/scheduler/session.py
    -7 lines: engine_result field on Session removed. The scheduler
    worker (scheduler.py) no longer stashes engine.generate()'s
    result on the session \u2014 the only reader was app.py's removed
    helpers.

  inference_engine/scheduler/scheduler.py
    \u00b13 lines (renamed assignment to del): the line that wrote
    session.engine_result = result is gone.

  scripts/bench_agentic/bench_long_session.py
    -211/+30 lines: removed _PATH_SELECTION_METRIC /
    _CONTINUATION_TOKENS_SKIPPED_METRIC / _CACHE_INVARIANT_VIOLATIONS_METRIC
    constants, the labeled-line regex, the labeled-metric branch in
    _parse_prom_text, the _extract_label helper (no callers after
    PR-D1), the _adr_0007_summary aggregator, the adr_0007 payload
    field, and the §2.10 block in render_summary.
    Module docstring updated to point at PR-E1's
    bench_session_long_run.py for the replacement bench.

Files modified \u2014 tests:

  tests/inference_engine/server/test_metrics.py
    -107 lines: 4 entries dropped from test_build_registers_all_documented_metrics
    expected set; 9 tests removed from the 'ADR 0007 \u00a72.10 \u2014
    path_selection observability' section.

  tests/inference_engine/server/test_app_metrics_and_auth.py
    -98 lines: 4 ADR-0007-specific tests removed
    (test_metrics_path_selection_metrics_present_on_idle_metrics_scrape,
    test_metrics_path_selection_recorded_after_completion,
    test_session_acceptance_rate_returns_none_when_result_missing_rate,
    test_emit_path_selection_metric_noop_when_path_unset).

Local verification (Linux VM, py3.12):
  PYTHONPATH=.:sdks/python pytest &lt;Linux CI gate set&gt;
  682 passed (was 695 - 13 ADR-0007-specific tests),
  TOTAL  1660 stmts  100.00 % coverage (was 1694 - 34 dead stmts).

Per ADR 0008 \u00a79: this PR is pure deletion / cleanup on the Linux-
runnable surface. Zero MLX runtime code touched. \u00a79 carve-out
applies; no Mac M4 report needed.

Next PR after merge:
  PR-D2 (\u00a76.4 amended, queued): HTTP-shim refactor onto SessionStore
        proper. Each /v1/chat/completions request becomes a single-
        shot session; PooledVerifier retires; Deprecation / Sunset
        headers added per \u00a72.7.

Co-authored-by: FluffyAIcode &lt;FluffyAIcode@users.noreply.github.com&gt;
diff --git a/docs/adr/0008-session-bound-runtime-and-grpc-protocol.md b/docs/adr/0008-session-bound-runtime-and-grpc-protocol.md
@@ -775,12 +775,44 @@ parallelize.
 
 ### 6.4 Phase D — Deprecated HTTP+SSE shim
 
-- **PR-D1**: Update `inference_engine/server/app.py` so each
-  `/v1/chat/completions` request creates a single-shot session under
-  the new `SessionStore`, prefills, generates, and closes. Removes
-  any path-selection / cross-request logic (none of which exists on
-  `main` after C3). Adds `Deprecation` / `Sunset` headers. Updates
-  the existing 461-test integration suite to match.
+*(scope split, recorded 2026-06-01 during implementation of PR-D1.)*
+
+The original PR-D1 entry conflated two coupled changes:
+
+  (a) Remove the ADR 0007 dead code from the server-side surface
+      (path_selection metrics, `_emit_path_selection_metric` helper,
+      `engine_result` field on the scheduler session, etc.).
+  (b) Refactor the HTTP shim's chat-completions handler onto the new
+      `SessionStore` so each request becomes a single-shot session
+      (prefill → generate → close) instead of being driven by the
+      legacy `PooledVerifier`.
+
+(a) is a pure subtraction: the dead code was reachable only from the
+ADR 0007 path_select stack that PR-A3 already removed from the
+verifier side; the server-side metrics and helpers it left behind
+are unreachable at runtime in any healthy completion. (b) is a
+larger refactor of feature-frozen code (per §2.7), with a
+corresponding test-update tail.
+
+The two are split, same pattern as PR-A3 / PR-A3b:
+
+- **PR-D1** (this PR, dead-code removal): cleans up §6.6 rows for
+  `app.py` / `engine.py` / `metrics.py` / `scheduler/session.py` /
+  `bench_long_session.py`. The HTTP shim continues to use
+  `PooledVerifier` exactly as before; nothing user-observable
+  changes except the disappearance of the four ADR 0007 metrics
+  from `/metrics` and the `acceptance_rate` field from the OpenAI
+  response (the latter was sourced from `engine_result`, which is
+  gone). 100% Linux unit coverage.
+
+- **PR-D2** (queued, not in PR-D1's diff): the HTTP-shim refactor
+  proper. Each `/v1/chat/completions` request creates a single-shot
+  session under `SessionStore`, prefills, generates, and closes;
+  `PooledVerifier` is retired. Adds `Deprecation` / `Sunset`
+  headers per §2.7. Updates the existing integration suite to
+  match. Linux-only path; §9 carve-out continues to apply. PR-D2
+  is non-blocking for v0.3 GA — the deprecated shim works on
+  `main` post-PR-D1 in its v0.3.0-rc1 shape, just lighter.
 
 ### 6.5 Phase E — Mac M4 integration test marker + CI workflow
 
diff --git a/inference_engine/scheduler/scheduler.py b/inference_engine/scheduler/scheduler.py
@@ -358,12 +358,12 @@ def on_token(tok_id: int) -> bool:
                     session.eos_token_ids, on_token,
                 )
 
-            # Out of engine lock — finalize state.
-            # Stash the engine result on the session so route handlers
-            # can read path-selection observability fields (ADR 0007
-            # §2.10) and acceptance rate. tokens were already streamed
-            # via on_token.
-            session.engine_result = result
+            # Out of engine lock — finalize state. Tokens were already
+            # streamed via on_token; the engine result is otherwise
+            # discarded (PR-D1 of ADR 0008 removed the engine_result
+            # stash that ADR 0007 §2.10 used for path-selection
+            # observability).
+            del result
             if session.state == SessionState.CANCELLED:
                 # Already counted by cancel_session caller; we just
                 # observe the terminal state here.
diff --git a/inference_engine/scheduler/session.py b/inference_engine/scheduler/session.py
@@ -64,13 +64,6 @@ class Session:
     # the scheduler.iter_tokens() async iterator drain this; the
     # scheduler's worker pushes into it.
     token_queue: asyncio.Queue = field(default_factory=lambda: asyncio.Queue())
-    # The engine's full result, set by the scheduler worker after
-    # ``engine.generate()`` returns. Route handlers read this to
-    # populate ADR 0007 §2.10 path-selection observability metrics
-    # (path_selection, tokens_skipped, prefill_duration_seconds) and
-    # acceptance-rate stats. ``None`` until the engine returns —
-    # callers must check before reading.
-    engine_result: Optional[object] = None
 
     def __post_init__(self) -> None:
         if not self.prompt_ids:
diff --git a/inference_engine/server/app.py b/inference_engine/server/app.py
@@ -442,9 +442,8 @@ async def chat_completions(req: ChatCompletionRequest, request: Request):
         metrics.record_completion(
             finish_reason=finish_reason,
             n_tokens=len(output_token_ids),
-            acceptance_rate=_session_acceptance_rate(scheduler, session),
+            acceptance_rate=None,
         )
-        _emit_path_selection_metric(metrics, session)
 
         return JSONResponse(
             content=ChatCompletionResponse(
@@ -495,53 +494,6 @@ def _encode_prompt(engine: Engine, req: ChatCompletionRequest) -> List[int]:
     return prompt_ids
 
 
-def _session_acceptance_rate(
-    scheduler: Scheduler, session: Session,
-) -> Optional[float]:
-    """Per-session acceptance rate from the stashed EngineResult.
-
-    The scheduler worker stores ``engine.generate()``'s result on
-    ``session.engine_result`` after generation completes (PR 7-4).
-    Returns ``None`` if the result is unavailable (session was
-    cancelled / failed before the engine returned, or the engine
-    is a test double that doesn't expose the field).
-    """
-    _ = scheduler  # kept for signature stability with existing callers
-    result = getattr(session, "engine_result", None)
-    if result is None:
-        return None
-    rate = getattr(result, "acceptance_rate", None)
-    if rate is None:
-        return None
-    return float(rate)
-
-
-def _emit_path_selection_metric(
-    metrics: "Metrics", session: Session,
-) -> None:
-    """Emit ADR 0007 §2.10 path-selection observability for one
-    completed session, if the engine reported the relevant fields.
-
-    Called from both the streaming and non-streaming completion
-    paths after the session reaches a terminal state. No-op when
-    the engine result is unavailable (e.g., test doubles that
-    don't populate path_selection).
-    """
-    result = getattr(session, "engine_result", None)
-    if result is None:
-        return
-    path = getattr(result, "path_selection", None)
-    if path not in ("continuation", "new_session"):
-        return
-    metrics.record_path_selection(
-        path=path,
-        tokens_skipped=int(getattr(result, "tokens_skipped", 0)),
-        prefill_duration_s=float(
-            getattr(result, "prefill_duration_seconds", 0.0)
-        ),
-    )
-
-
 async def _collect_non_streaming_tokens(
     *,
     scheduler: Scheduler,
@@ -662,7 +614,6 @@ def envelope(content_delta, role_delta, finish_reason) -> dict:
     metrics.record_completion(
         finish_reason=finish_reason,
         n_tokens=len(session.output_token_ids),
-        acceptance_rate=_session_acceptance_rate(scheduler, session),
+        acceptance_rate=None,
     )
-    _emit_path_selection_metric(metrics, session)
     yield {"data": "[DONE]"}
diff --git a/inference_engine/server/engine.py b/inference_engine/server/engine.py
@@ -44,13 +44,6 @@ class EngineResult:
     proposer_forward_calls: int
     verifier_forward_calls: int
     stopped_on_eos: bool
-    # ADR 0007 §2.10 observability — populated by the speculative
-    # engine; test doubles default to ``new_session`` / 0 so the
-    # route layer's metric emission code path is exercisable
-    # against either backend.
-    path_selection: str = "new_session"  # "continuation" | "new_session"
-    tokens_skipped: int = 0
-    prefill_duration_seconds: float = 0.0
 
 
 @runtime_checkable
@@ -191,11 +184,6 @@ def generate(
             proposer_forward_calls=int(result.proposer_forward_calls),
             verifier_forward_calls=int(result.verifier_forward_calls),
             stopped_on_eos=stopped_on_eos,
-            path_selection=str(getattr(result, "path_selection", "new_session")),
-            tokens_skipped=int(getattr(result, "tokens_skipped", 0)),
-            prefill_duration_seconds=float(
-                getattr(result, "prefill_duration_seconds", 0.0)
-            ),
         )
 
     def kv_state(self) -> int:
diff --git a/inference_engine/server/metrics.py b/inference_engine/server/metrics.py
@@ -107,13 +107,6 @@ class Metrics:
     scheduler_pending: Gauge
     scheduler_kv_live_bytes: Gauge
     scheduler_admission_total: Counter
-    # ADR 0007 §2.10 — cross-request KV reuse observability.
-    # Both ``path`` labels are first-class outcomes; neither is an
-    # "error" or "fallback" (per ADR 0007 §2.4.c).
-    path_selection_total: Counter
-    continuation_tokens_skipped_total: Counter
-    verifier_prefill_duration_seconds: Histogram
-    cache_invariant_violations_total: Counter
 
     @classmethod
     def build(cls) -> "Metrics":
@@ -194,47 +187,6 @@ def build(cls) -> "Metrics":
                 labelnames=["result"],
                 registry=registry,
             ),
-            path_selection_total=Counter(
-                "path_selection_total",
-                "Total path-selection decisions made by the verifier "
-                "for cross-request KV cache reuse (ADR 0007 §2.4). "
-                "Both 'continuation' and 'new_session' are first-class "
-                "first-class outcomes; neither is an 'error' or "
-                "'fallback' (§2.4.c). Healthy long-session agent "
-                "workloads see continuation rate >= 95%.",
-                labelnames=["path"],
-                registry=registry,
-            ),
-            continuation_tokens_skipped_total=Counter(
-                "continuation_tokens_skipped_total",
-                "Cumulative prompt tokens that the continuation path "
-                "did not need to re-prefill (ADR 0007 §2.10). Sums "
-                "ContinuationPlan.skip_n across every continuation-"
-                "path request the server has handled. The win.",
-                registry=registry,
-            ),
-            verifier_prefill_duration_seconds=Histogram(
-                "verifier_prefill_duration_seconds",
-                "Wall time of the prefill phase of a single request, "
-                "partitioned by path. Continuation-path histogram "
-                "centers around per-incremental-token cost; "
-                "new-session-path histogram tracks full-prefill cost "
-                "(O(history_length)).",
-                labelnames=["path"],
-                buckets=(
-                    0.001, 0.005, 0.01, 0.05, 0.1, 0.5,
-                    1.0, 5.0, 10.0, 30.0, 60.0, 120.0, 300.0,
-                ),
-                registry=registry,
-            ),
-            cache_invariant_violations_total=Counter(
-                "cache_invariant_violations_total",
-                "Count of ADR 0007 §2.9 INV-1 / INV-2 detections at "
-                "runtime. Should always read 0; any non-zero value is "
-                "a critical operational alert (page on it).",
-                labelnames=["kind"],
-                registry=registry,
-            ),
         )
 
     # ------------------------------------------------------------------
@@ -255,32 +207,6 @@ def record_admission(self, *, admitted: bool) -> None:
             result="admitted" if admitted else "rejected"
         ).inc()
 
-    def record_path_selection(self, *, path: str, tokens_skipped: int,
-                              prefill_duration_s: float) -> None:
-        """Record one path-selection decision (ADR 0007 §2.10).
-
-        ``path`` must be ``"continuation"`` or ``"new_session"``. The
-        method does not validate the label set explicitly because
-        prometheus-client's ``labels()`` already raises for unknown
-        labels; we want such a violation to surface loudly per the
-        no-silent-failure principle.
-        """
-        self.path_selection_total.labels(path=path).inc()
-        if tokens_skipped > 0:
-            self.continuation_tokens_skipped_total.inc(tokens_skipped)
-        self.verifier_prefill_duration_seconds.labels(path=path).observe(
-            float(prefill_duration_s)
-        )
-
-    def record_cache_invariant_violation(self, *, kind: str) -> None:
-        """Record an INV-1 or INV-2 detection (ADR 0007 §2.9).
-
-        ``kind`` must be ``"inv1"`` or ``"inv2"``. Should never be
-        called in healthy operation; any increment of this counter
-        is a critical alert.
-        """
-        self.cache_invariant_violations_total.labels(kind=kind).inc()
-
     def record_completion(self, *, finish_reason: str, n_tokens: int,
                           acceptance_rate: Optional[float]) -> None:
         self.inference_completions_total.labels(
diff --git a/scripts/bench_agentic/bench_long_session.py b/scripts/bench_agentic/bench_long_session.py
diff --git a/tests/inference_engine/server/test_app_metrics_and_auth.py b/tests/inference_engine/server/test_app_metrics_and_auth.py
diff --git a/tests/inference_engine/server/test_metrics.py b/tests/inference_engine/server/test_metrics.py