Skip to content

Commit c4ce331

Browse files
ADR 0007 revision: remove fallback semantics per project no-fallback principle
User feedback (2026-05-31): the previous draft's '失配处理 / 自动 fallback 到 reset' framing in §2.4 / §2.8 violates the project's engineering principle 'no mock, no fallback, no overfit'. Mismatch between expected and actual cache state should never be a silent graceful-degradation path; it must be either (a) a different but equally first-class state transition, or (b) a critical bug raised to the operator. The previous draft conflated two separate things into one fallback-shaped pattern. This revision separates them. Changes ------- §2.4 — restructured from 'reset criteria' to 'path selection' Two named first-class paths: 2.4.a Continuation path (incremental prefill on the new tail) 2.4.b New-session path (cold-start full prefill) 2.4.c Path semantics — both produce bit-identical output; selecting new-session is NOT degradation, it is the correct action when the input does not satisfy the continuation precondition. The path-selection function is total over the input space, so 'mismatch' as a runtime concept does not exist. §2.5 — 'eviction' wording removed Cache state lifecycle described in terms of §2.4's two transitions (extend vs replace). 'Implicit eviction' was a fallback-shaped phrase; replaced with 'state is overwritten via path-selection'. §2.8 — 'graceful degradation' explicitly rejected Renamed to 'path totality'. The phrase 'graceful degradation' is now called out as deliberately not used, with a one-paragraph rationale citing the project's no-fallback principle. §2.9 NEW — 'Anomaly invariants (these are bugs, not states)' Three required invariants: INV-1 parallel-sequence consistency between cached_token_sequence and K/V tensor seq dim INV-2 cache_position_start monotonic non-decreasing within a continuation chain INV-3 continuation-path output bit-identical to full-prefill output for any input satisfying §2.4.a INV-1 / INV-2 enforced by Python assert statements; violations raise AssertionError to the route handler, which surfaces as HTTP 500 with the OpenAI error envelope and a unique error id for log correlation. The implementation does NOT retry, NOT fall back, NOT silently choose the new-session path to recover. INV-3 enforced offline by the §2.7 determinism gate test (mandatory before merge). §2.10 — observability metrics renamed to match the new framing cross_request_kv_reuse_decisions_total{outcome=hit|partial|miss} → path_selection_total{path=continuation|new_session} Both labels are first-class outcomes; neither is an 'error' or 'fallback'. cross_request_kv_reuse_tokens_skipped_total → continuation_tokens_skipped_total verifier_prefill_duration_seconds{path=continuation|new_session} (added path label so the per-path cost profile is observable) cache_invariant_violations_total{kind=inv1|inv2} NEW counter; should always be 0; non-zero is a critical operational alert (page on it). §2.7 — OQ on Mac Metal numerical determinism strengthened The previous OQ said 'if strict bit-identical fails, fall back to ULP-equivalent'. That is itself a fallback in disguise. Revised: any relaxation must be written into this ADR explicitly before the gate is changed; tests that adapt their strictness based on whether the strict path passes are prohibited. §5 — implementation plan annotated for the new invariants PR 7-1: INV-1 assert at every cache mutation site PR 7-2: rename find_reusable_prefix → path_select returning a ContinuationPlan | NewSession sum type; INV-2 assert in path_select; tests must cover both path outputs explicitly (not just 'most common path') PR 7-4: route INV violations to OpenAI error envelope per §2.9 PR 7-5: §2.7 determinism gate also covers all path-selection branches §6 — validation criteria updated New gate item: 'INV-1 and INV-2 assertions never fire during validation runs'. cache_invariant_violations_total must be 0 across the determinism test, the synthetic suite, and the 4h Mac M4 run. Any non-zero value is a release blocker. Verification ------------ $ grep -i 'fallback\|fall back\|graceful degr' docs/adr/0007-cross-request-kv-reuse.md All remaining hits are explicit prohibitions of the pattern, not designs invoking it. Verified by inspection. Diff: +166 / -27 lines, all docs/adr/0007-*.md. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
1 parent a2cb801 commit c4ce331

1 file changed

Lines changed: 247 additions & 81 deletions

File tree

docs/adr/0007-cross-request-kv-reuse.md

Lines changed: 247 additions & 81 deletions
Original file line numberDiff line numberDiff line change
@@ -168,47 +168,97 @@ asst_1_edited, user_2_edited]. Two paths:
168168
cached state monotonically, reset. Edge case rare enough that the
169169
conservative path is correct.
170170

171-
### 2.4 Reset criteria (when prefix matching fails)
172-
173-
**Decision**: the verifier resets and runs full prefill in any of:
174-
175-
1. The cache is empty (first request after server start).
176-
2. `find_reusable_prefix` returns 0 (no overlap with cache).
177-
3. The matched prefix length is below a threshold (default 4 tokens):
178-
the savings of skipping 1–3 tokens of prefill don't justify the
179-
logic complexity, just reset.
180-
4. `cache_position_start > 0` (cache has already evicted earlier
181-
tokens) AND `len(new_prompt) < cache_position_start`: the new
182-
prompt is too short to overlap with what the cache currently
183-
holds.
184-
5. The new request's `system` message differs from the system
185-
message that was active when the cache state was built (covered
186-
by automatic mismatch in matching, but called out for clarity).
187-
188-
**Open question 2.4.a**: should we add a `force_reset=True` request-
189-
level escape hatch? Default no — automatic prefix matching is correct
190-
in 100% of cases (mismatch → fall back to reset). An escape hatch
191-
adds protocol surface for marginal benefit. Revisit in v0.4 if a real
192-
user-facing need surfaces.
193-
194-
### 2.5 Cache eviction policy (single-tenant scope)
195-
196-
**Decision** for v0.3: **the cache lives as long as the server
197-
process**. There is exactly one cache state at any time
198-
(`max_concurrent=1`), and it gets replaced via the prefix-match
199-
algorithm on every request. No idle timeout, no LRU, no explicit
200-
eviction.
201-
202-
**Rationale**: with a single cache slot and prefix-match-based reuse,
203-
"eviction" happens implicitly: if a new request doesn't match, the
204-
cache resets. Adding idle timeout machinery for v0.3 is premature
205-
optimization.
206-
207-
**Forward-compatibility**: in v0.4 (multi-tenant), eviction becomes
208-
a real concern (many sessions, finite memory). v0.4's ADR will
209-
specify LRU + idle timeout. The session abstraction we introduce
210-
here is structured so v0.4 can extend it without rewriting the v0.3
211-
core.
171+
### 2.4 Path selection: continuation vs new-session
172+
173+
**Decision**: every request takes exactly one of two deterministic
174+
paths. Both paths are first-class correct actions for their input
175+
class. There is no "fallback" semantic — the project's engineering
176+
principles forbid fallback as a design pattern (alongside no-mock
177+
and no-overfit), and this section's split between continuation and
178+
new-session reflects that prohibition.
179+
180+
#### 2.4.a Continuation path
181+
182+
Triggered when the new request's prompt is a strict monotonic
183+
extension of the cached state. Formally, both must hold:
184+
185+
- `len(new_prompt) >= cache_position_start + len(cached_token_sequence)`
186+
(the new prompt extends at or past the cached region's logical end), AND
187+
- `new_prompt[cache_position_start : cache_position_start + len(cached_token_sequence)]
188+
== cached_token_sequence` (every cached position matches the new
189+
prompt at the same logical position).
190+
191+
Action: skip prefill of the matched logical positions; run
192+
incremental prefill on
193+
`new_prompt[cache_position_start + len(cached_token_sequence):]`.
194+
195+
#### 2.4.b New-session path
196+
197+
Triggered when the request is **not** a continuation, i.e. fails
198+
either of the §2.4.a conditions. Concrete sub-cases:
199+
200+
1. **Cold start**: the cache is empty (first request after server
201+
boot, or after the previous request's incremental path
202+
completed and truncated the cache to empty by trim).
203+
2. **Shorter history**: the new prompt is shorter than the cached
204+
state's logical end. Caused by the client deliberately
205+
shortening conversation history (e.g., the user opened a new
206+
chat tab in the agent UI).
207+
3. **Diverging history**: the cached state's tokens disagree with
208+
the new prompt at one or more cached logical positions. Caused
209+
by the client switching to a different conversation that may
210+
share an early prefix (e.g., the same system prompt) but
211+
diverges before the cache window's end.
212+
213+
Action: reset the verifier and run full prefill on `new_prompt`. The
214+
cache state is replaced with the new session's state.
215+
216+
#### 2.4.c Path semantics
217+
218+
Both paths produce **bit-identical** output for the same input
219+
prompt (per §2.7). The only difference is computational cost: the
220+
continuation path skips already-prefilled tokens.
221+
222+
Selecting the new-session path is **not** a degradation of the
223+
continuation path; it is the **correct** action when the input
224+
does not satisfy the continuation precondition. A new conversation
225+
genuinely requires a fresh prefill — there is no shortcut, and
226+
choosing to fresh-prefill is not a "fall back to a worse path".
227+
228+
The two-path structure exhausts the input space: every valid input
229+
prompt satisfies the continuation precondition or it does not. The
230+
selection function is total. There is no third path. Inputs that
231+
violate the path's preconditions at runtime cannot exist by
232+
construction; if such an input appears, it is an anomaly invariant
233+
violation per §2.9, which is a bug not a fallback.
234+
235+
### 2.5 Cache state lifecycle (single-tenant scope)
236+
237+
**Decision** for v0.3: **the cache holds exactly one state at any
238+
time, lives as long as the server process, and is overwritten via
239+
the path-selection function on every request.** There is no LRU,
240+
no idle timeout, no explicit eviction call.
241+
242+
**Rationale**: with a single cache slot (`max_concurrent=1`), the
243+
state lifecycle is fully described by §2.4's two paths:
244+
245+
- Continuation path: cache state is **extended** with the
246+
incremental tokens.
247+
- New-session path: cache state is **replaced** with the fresh
248+
prefill's output.
249+
250+
Both transitions are deterministic correct actions per §2.4.c. No
251+
state is left "stale" — the state at any moment reflects whichever
252+
session was most recently observed. Adding idle-timeout machinery
253+
for v0.3 single-tenant is premature optimization without a
254+
concrete user need.
255+
256+
**Forward-compatibility**: in v0.4 (multi-tenant), the cache space
257+
holds N states (one per concurrent session), and the lifecycle
258+
includes eviction (LRU + idle timeout) because the state count is
259+
bounded but the request stream is not. v0.4's ADR will specify
260+
those policies. The session abstraction this ADR introduces is
261+
structured so v0.4 can extend it without rewriting the v0.3 core.
212262

213263
### 2.6 Concurrency: explicit single-tenant scope for v0.3
214264

@@ -261,43 +311,151 @@ mandatory before v0.3.0 GA.
261311
**Open question 2.7.a**: numerical determinism on Apple Metal —
262312
mlx_lm sometimes produces tiny floating-point differences across
263313
runs even with greedy decoding. If the bit-identical test is too
264-
strict on Mac M4, fall back to "logits agree to within float16
265-
ULPs" and document the relaxation.
314+
strict on Mac M4, the test gate should be **relaxed once,
315+
explicitly, in this ADR** to "logits agree to within float16 ULPs"
316+
— with the relaxation written down here, not silently applied at
317+
test runtime. A test that adapts its strictness based on whether
318+
the strict path passes is a fallback in disguise.
266319

267-
### 2.8 Backward compatibility: graceful degradation
320+
### 2.8 Backward compatibility: path totality
268321

269322
**Decision**: cross-request reuse is **transparent and automatic**;
270-
there is no opt-out. If a request comes in that doesn't share a
271-
prefix with the cache (e.g. a totally new conversation, or
272-
multi-tenant traffic in v0.4), the server falls back to reset +
273-
full prefill. Behavior is then identical to v0.3.0-rc1.
323+
there is no opt-out. The path-selection function (§2.4) is total
324+
over all valid inputs. There is no fallback semantic.
274325

275-
**Rationale**: removing the opt-out keeps the protocol surface
276-
small. The fallback path is the v0.3.0-rc1 behavior, which is
277-
already tested and shipped.
326+
**Rationale**:
327+
328+
- A client whose request happens to satisfy the continuation
329+
precondition (the dominant case for an agent in a multi-turn
330+
loop) takes the continuation path. The server's behavior on its
331+
output is bit-identical to v0.3.0-rc1's per-turn-reset behavior.
332+
- A client whose request does not satisfy the continuation
333+
precondition (cold start, new chat, edited history) takes the
334+
new-session path. The server's behavior on its output is **also**
335+
bit-identical to v0.3.0-rc1's per-turn-reset behavior — because
336+
full prefill is exactly what v0.3.0-rc1 always did.
337+
338+
Therefore the upgrade is observably indistinguishable from
339+
v0.3.0-rc1 on output (same tokens, same `usage` block, same
340+
`/healthz`), with the single observable change being **per-turn
341+
latency**: continuation path turns are O(new tokens), new-session
342+
path turns are O(history length, same as v0.3.0-rc1).
343+
344+
The phrase "graceful degradation" is deliberately not used.
345+
Degradation implies a primary correct path and a less-correct
346+
backup. Both paths here are equally correct for their input
347+
classes, just with different cost profiles. This framing is
348+
required by the project's no-fallback principle (alongside no-mock
349+
and no-overfit).
350+
351+
### 2.9 Anomaly invariants (these are bugs, not states)
352+
353+
The path-selection function (§2.4) is total over the input space,
354+
but the verifier's internal state has invariants that the
355+
implementation is responsible for maintaining. Their violation is
356+
a **bug**, not a path. Violations must surface immediately as
357+
runtime errors; the implementation must not silently recover, retry,
358+
or take an alternate path.
359+
360+
**Required invariants**:
361+
362+
- **INV-1: parallel-sequence consistency.** For every layer's
363+
`SinkWindowKVCache`,
364+
`len(cached_token_sequence) == cache.cache_seq_length()` must
365+
hold after every cache mutation (`update_and_fetch`, `trim`,
366+
`reset`). The parallel token sequence must never drift from the
367+
K/V tensor sequence dimension.
368+
- **INV-2: position monotonicity within a session.** During a
369+
continuation chain (consecutive continuation-path requests for
370+
the same session), `cache_position_start` is monotonically non-
371+
decreasing across requests. A continuation that decreases
372+
`cache_position_start` indicates a cache-management bug.
373+
- **INV-3: continuation-path determinism.** For inputs that satisfy
374+
the continuation precondition (§2.4.a), the incremental-prefill
375+
output must be bit-identical (or float-precision-equivalent per
376+
§2.7) to the full-prefill output for the same input. This is the
377+
contract that makes §2.4.c's "both paths correct for their
378+
inputs" claim hold.
379+
380+
**Detection and response**:
381+
382+
- INV-1 is checked at every cache mutation via Python `assert`
383+
statements (cheap, in-process). Violation raises `AssertionError`
384+
to the route handler, which surfaces as an HTTP 500 with the
385+
OpenAI error envelope and a unique error id for log correlation.
386+
- INV-2 is checked when path selection runs; violation raises
387+
`RuntimeError` with the offending values for the bug report.
388+
- INV-3 is checked offline via the §2.7 determinism gate test
389+
(mandatory before merge). It is not a runtime check because the
390+
comparison requires running both paths on the same input, which
391+
is too expensive in production.
392+
393+
A violation of any of these is a **critical bug**. The
394+
implementation does not retry, does not fall back, does not silently
395+
choose the new-session path to "recover". It raises. Operators
396+
encountering an INV-1 or INV-2 violation should file a bug report
397+
and restart the server. The next request after restart takes the
398+
cold-start sub-case of the new-session path (§2.4.b case 1), which
399+
is correct in its own right — but the assertion that surfaced the
400+
bug must be investigated, never papered over.
401+
402+
The OpenAI error envelope returned for INV-1 and INV-2 violations
403+
follows the convention from PR #13:
404+
405+
```json
406+
{
407+
"error": {
408+
"message": "internal cache invariant violation; bug id <UUID>",
409+
"type": "internal_error",
410+
"code": "kv_cache_inv_violation"
411+
}
412+
}
413+
```
414+
415+
The bug id is a UUID logged alongside the assertion stack trace so
416+
the report can be correlated with server logs.
278417

279-
### 2.9 Observability
418+
### 2.10 Observability
280419

281420
**Decision**: extend Prometheus metrics with:
282421

283-
- `cross_request_kv_reuse_decisions_total{outcome="hit|partial|miss"}`
284-
— counter of per-request decisions: `hit` (full reuse, prefill
285-
bypassed), `partial` (some prefix reused), `miss` (full reset).
286-
- `cross_request_kv_reuse_tokens_skipped_total` — counter of cumulative
287-
prompt tokens that did not need to be prefilled because of prefix
288-
match.
289-
- `verifier_prefill_duration_seconds` — histogram of prefill wall
290-
time per request, for observing the win.
422+
- `path_selection_total{path="continuation|new_session"}`
423+
counter of per-request path-selection decisions. Both labels are
424+
first-class outcomes; neither is an "error" or "fallback". The
425+
ratio `continuation / (continuation + new_session)` over a long
426+
session indicates how well the upstream client preserves
427+
prefix-extending history.
428+
- `continuation_tokens_skipped_total` — counter of cumulative
429+
prompt tokens that the continuation path did not need to
430+
re-prefill across the lifetime of the server. Concretely
431+
measures the win.
432+
- `verifier_prefill_duration_seconds{path="continuation|new_session"}`
433+
— histogram of prefill wall time per request, partitioned by
434+
path. Continuation-path histogram should center around the
435+
per-incremental-token cost; new-session-path histogram tracks
436+
full-prefill cost.
437+
- `cache_invariant_violations_total{kind="inv1|inv2"}` — counter
438+
of INV-1 / INV-2 anomaly detections (per §2.9). Should always
439+
be 0. Any non-zero value is a critical alert.
291440

292441
These are net additions; existing `scheduler_kv_live_bytes` and
293442
friends keep their semantics.
294443

295-
**Operational use**: in production, an operator should see hit-rate
296-
≥ 95% for a healthy long-session agent. A drop to < 50% means
297-
either (a) prompt-management code on the client side is breaking
298-
the prefix (e.g. inserting timestamps) or (b) different sessions are
299-
multiplexed onto one server (a v0.4 deployment running under v0.3
300-
infrastructure).
444+
**Operational use**:
445+
446+
- **Healthy long-session agent**: continuation rate is high (e.g.
447+
≥ 95% of requests for a multi-turn LangChain conversation).
448+
- **Healthy mixed workload**: continuation rate may be lower —
449+
agents spawning short-lived conversations, multiple parallel
450+
threads, or system-prompt rotation will all legitimately
451+
generate new-session-path requests. A "low" continuation rate
452+
by itself is not a problem; it is a workload characterization.
453+
- **Critical alert**: any non-zero `cache_invariant_violations_total`.
454+
This is a bug, not a degraded state. Page on this metric.
455+
- **Performance regression alert**: a previously-high continuation
456+
rate dropping unexpectedly. This indicates an upstream change
457+
(client started inserting timestamps in history, framework
458+
upgrade changed message format) that broke the prefix.
301459

302460
## 3. Alternatives Considered
303461

@@ -451,12 +609,12 @@ tests. PR breakdown in §5 below.
451609

452610
| # | PR | Scope | Coverage gate |
453611
|---|---|---|---|
454-
| 7-1 | `SinkWindowKVCache` + parallel token sequence (MLX + CPU) | logical_token_sequence + logical_position_start, update/trim invariants, unit tests | 100% on touched modules |
455-
| 7-2 | `Verifier.find_reusable_prefix` + `prefill_incremental` (MLX + CPU) | the prefix-match algorithm + the incremental prefill path; unit tests with synthetic verifier | 100% |
456-
| 7-3 | `SpeculativeDecoder` integration | accept reusable-prefix hint; route between full-prefill and incremental-prefill paths | 100% |
457-
| 7-4 | `SpeculativeEngine` route-handler integration | call find_reusable_prefix before delegating to decoder.generate; emit decision to metrics | 100% |
458-
| 7-5 | Determinism gate test | bit-identical comparison between reuse path and always-reset path on a 30-turn synthetic conversation | mandatory before merge |
459-
| 7-6 | bench_long_session_v2 + 4h Mac re-run | bench observes per-turn cost stable at O(new_message) and §2.3.a still holds | 4h Mac evidence |
612+
| 7-1 | `SinkWindowKVCache` + parallel token sequence (MLX + CPU) | `logical_token_sequence` + `logical_position_start`; `update_and_fetch` / `trim` / `reset` paths sync the parallel sequence; **INV-1 assert at every mutation site**; unit tests | 100% on touched modules |
613+
| 7-2 | `Verifier.path_select(prompt) -> ContinuationPlan \| NewSession` + `prefill_incremental(skip_n)` (MLX + CPU) | the path-selection function (§2.4) + the incremental prefill path; **INV-2 assert in path-select**; unit tests with synthetic verifier covering both paths' inputs explicitly | 100% |
614+
| 7-3 | `SpeculativeDecoder` integration | accept the path-selection result; route between full-prefill and incremental-prefill paths | 100% |
615+
| 7-4 | `SpeculativeEngine` route-handler integration | call verifier.path_select before delegating to decoder.generate; emit `path_selection_total` metric; route INV violations to OpenAI error envelope per §2.9 | 100% |
616+
| 7-5 | Determinism gate test (§2.7 + INV-3) | bit-identical comparison between continuation path and always-reset path on a 30-turn synthetic conversation; covers all path-selection branches; mandatory before merge | mandatory before merge |
617+
| 7-6 | bench_long_session_v2 + 4h Mac re-run | bench observes per-turn cost stable at O(new_message); §2.3.a still holds; INV violations counter is 0 over 4h | 4h Mac evidence |
460618
| 7-7 | ADR 0006 §2.3.b deletion + §2.3.a expansion | delete the no-longer-valid caveat; expand §2.3.a with v2 bench evidence | doc-only |
461619

462620
Estimated: 7 PRs total. Each independently reviewable. PRs 7-1 →
@@ -468,20 +626,28 @@ validation + docs.
468626
This ADR is considered validated when:
469627

470628
1. All 7 implementation PRs land on main.
471-
2. The §2.7 determinism gate passes: bit-identical (or float16-ULP-
472-
identical on Metal) output between reuse and always-reset paths
473-
over a 30-turn synthetic test.
474-
3. A 4-hour Mac M4 run with `bench_long_session_v2.py` produces
629+
2. The §2.7 determinism gate passes: bit-identical (or
630+
float16-ULP-identical on Metal, per the §2.7 OQ resolution) output
631+
between continuation path and always-reset path over a 30-turn
632+
synthetic test. **The relaxation, if any, is recorded in this
633+
ADR before the gate is changed — never silently relaxed at test
634+
runtime.**
635+
3. INV-1 and INV-2 (§2.9) assertions never fire during the §6
636+
validation runs. The `cache_invariant_violations_total` counter
637+
stays at 0 across the determinism test, the synthetic suite, and
638+
the 4h Mac M4 run. Any non-zero value is a release blocker.
639+
4. A 4-hour Mac M4 run with `bench_long_session_v2.py` produces
475640
≥ 200 successful turns (vs the 58 turns of v0.3.0-rc1's 4h run)
476641
with `agg.kv_bounded == True` and `agg.n_errors < 5`.
477-
4. Per-turn p50 latency drift over the 4-hour run is ≤ 5 seconds
642+
5. Per-turn p50 latency drift over the 4-hour run is ≤ 5 seconds
478643
(vs the +39.74 s drift of v0.3.0-rc1).
479-
5. ADR 0006 §2.3.b is deleted and replaced with a paragraph in §2.3.a
644+
6. ADR 0006 §2.3.b is deleted and replaced with a paragraph in §2.3.a
480645
citing this ADR's evidence.
481-
6. `cross_request_kv_reuse_decisions_total{outcome="hit"}` reports
482-
≥ 95% hit rate on the 4h bench.
646+
7. `path_selection_total{path="continuation"}` reports ≥ 95% of
647+
total path selections on the 4h bench.
483648

484-
Items 2-4 are GA gates; v0.3.0 cannot promote to GA without them.
649+
Items 2–5 are GA gates; v0.3.0 cannot promote to GA without all of
650+
them.
485651

486652
## 7. Open questions (require decision before implementation)
487653

0 commit comments

Comments
 (0)