Skip to content

Commit c360dff

Browse files
committed
extension/llm/server: warm append-only session resume (V2b.1)
Builds on the isolated named sessions: a named session now keeps its decoded context across requests. When the next request's prompt tokens are an exact prefix extension of the session's resident tokens (the same conversation plus a new turn), the worker prefills ONLY the new suffix -- continuing the KV/recurrent state in place -- instead of resetting and re-prefilling the whole prompt. The match is exact-token (never re-tokenized text), so it is always correct: a token mismatch, a stop-string trim, or a prior error falls back to a full reset + prefill. This is per-session resume, not global prefix caching. The decision lives in a dependency-free pure helper (worker_prefill_plan.h, unit-tested standalone); worker_loop.h tracks each session's resident token ids (invariant: resident size == session position) and executes the plan, and the `done` event reports reused_prompt_tokens / prefilled_prompt_tokens / session_reset_reason for measuring the hit rate. POST /v1/sessions/{id}/reset clears a session's context while keeping its slot. The qwen worker's --warm_resume (serve.py --no-warm-resume) gates the behavior for A/B measurement. Review order: worker_prefill_plan.h + its test; then worker_loop.h (resident tracking + plan execution); then the control-plane reset op + metrics; then docs. Part of #20001 ghstack-source-id: 5f5e0a2 ghstack-comment-id: 4661783803 Pull-Request: #20160
1 parent e4f4ab2 commit c360dff

12 files changed

Lines changed: 480 additions & 67 deletions

File tree

examples/models/qwen3_5_moe/README.md

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -197,6 +197,7 @@ is safe under asyncio.
197197
| `--max-context` | (none) | Reject prompts that exceed it with 400 |
198198
| `--no-think` | off | Default reasoning off (`enable_thinking=False`) |
199199
| `--max-sessions` | `1` | Isolated sessions on one weight load (see Sessions) |
200+
| `--warm-resume` / `--no-warm-resume` | on | Reuse a session's KV across turns (see Sessions) |
200201

201202
### Sessions
202203

@@ -211,24 +212,41 @@ aliases, the `X-ExecuTorch-Session-ID` / `session_id` / `x-session-affinity`
211212
headers (body wins, then that header order). The header aliases let a client that
212213
already emits a stable per-conversation affinity id (e.g. pi's
213214
`sendSessionAffinityHeaders`) route with no extra config. Requests without any
214-
share a transient scratch session. Free a session with `DELETE /v1/sessions/{id}`.
215+
share a transient scratch session.
215216

216217
```bash
217218
curl http://127.0.0.1:8000/v1/chat/completions \
218219
-H 'Content-Type: application/json' \
219220
-d '{"model":"qwen3.5-moe","session_id":"alice",
220221
"messages":[{"role":"user","content":"hi"}]}'
222+
223+
curl -X POST http://127.0.0.1:8000/v1/sessions/alice/reset # clear context, keep the slot
224+
curl -X DELETE http://127.0.0.1:8000/v1/sessions/alice # free context + slot (VRAM)
221225
```
222226

223227
Admission is up front: an explicit `session_id` on a single-session server
224228
returns **400** (`unsupported_session`); past capacity it returns **429**
225229
(`capacity_exhausted`) before any response bytes.
226230

227-
This is **isolation, not concurrency or warm resume**: execution is still
231+
**Warm append-only resume** (on by default): when a named session's next request
232+
is an exact-token extension of its resident context (e.g. the same conversation
233+
plus a new turn), the worker prefills **only the new suffix** instead of
234+
re-prefilling the whole prompt — continuing the KV/recurrent state in place. The
235+
check is exact-token (never re-tokenized text), so it is always correct: anything
236+
that can't be proven an exact extension (token mismatch, a stop-string trim, a
237+
prior error) falls back to a full reset + prefill. This is **per-session** warm
238+
append-only resume, **not** global prefix caching: there is no cross-session
239+
prefix sharing, so a system prompt common to two different `session_id`s is
240+
prefilled independently for each (unlike vLLM/llama.cpp global prefix reuse).
241+
Each `done` event reports
242+
`reused_prompt_tokens`, `prefilled_prompt_tokens`, and `session_reset_reason`
243+
(`new`/`exact_prefix`/`dirty`/`mismatch`/`equal`) for measuring the hit rate.
244+
`--no-warm-resume` forces a full prefill every request (for A/B comparison).
245+
246+
This is **isolation + warm resume, not concurrency**: execution is still
228247
synchronous (one in-flight request; `--num-runners > 1` is rejected since more
229-
workers would duplicate the weights), and each request resets its session — the
230-
recurrent/conv state cannot be rewound by position (`seek()` is NotSupported), so
231-
turn-to-turn KV reuse (append-only warm resume) is a follow-up.
248+
workers would duplicate the weights). Fair interleaving across in-flight requests
249+
is a follow-up.
232250

233251
### Other limitations
234252

examples/models/qwen3_5_moe/qwen35_moe_worker.cpp

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,11 @@
1818
// process segfaults in the int4 matmul (validated). Here the model runs in a
1919
// plain synchronous loop in its own process, which is reliable.
2020
//
21-
// Multi-session (isolation): the engine loads weights once and hosts multiple
22-
// isolated sessions on that one ~18GB allocation; the shared worker loop
23-
// (worker_loop.h) routes requests to per-session_id state, up to
24-
// --max_sessions. Execution is still synchronous (one in-flight request); warm
25-
// context reuse across requests is a follow-up.
21+
// Multi-session: the engine loads weights once and hosts multiple isolated
22+
// sessions on that one ~18GB allocation; the shared worker loop (worker_loop.h)
23+
// routes requests to per-session_id state (up to --max_sessions) and warm-
24+
// resumes each session's context across requests (append-only suffix prefill).
25+
// Execution is synchronous (one in-flight request).
2626

2727
#include <gflags/gflags.h>
2828

@@ -41,6 +41,12 @@ DEFINE_int32(
4141
"Max physical sessions to host on the one weight allocation (CUDA "
4242
"per-session mutable rebinding). Clamped to 1 if the backend cannot "
4343
"rebind.");
44+
DEFINE_bool(
45+
warm_resume,
46+
true,
47+
"Warm append-only resume for named sessions: prefill only the suffix when a "
48+
"request's tokens extend the session's resident context. Off resets every "
49+
"request (useful for A/B measurement).");
4450

4551
namespace {
4652
namespace llm = ::executorch::extension::llm;
@@ -73,5 +79,6 @@ int main(int argc, char** argv) {
7379
// ids back to text internally. The shared loop owns per-session_id state.
7480
::tokenizers::Tokenizer* tokenizer = engine->tokenizer();
7581

76-
return llm::run_worker_stdio_loop(*engine, *tokenizer, engine->metadata());
82+
return llm::run_worker_stdio_loop(
83+
*engine, *tokenizer, engine->metadata(), FLAGS_warm_resume);
7784
}

examples/models/qwen3_5_moe/serve.py

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,10 @@
2323
requests share a scratch session). See --max-sessions.
2424
* Execution is synchronous: one in-flight request at a time, concurrent HTTP
2525
requests queue. Sessions provide isolation, not concurrent throughput.
26-
* No warm context reuse yet: each request resets its session (Qwen seek() is
27-
NotSupported; append-only reuse is a follow-up).
26+
* Warm append-only resume is on by default (--warm-resume): a named session
27+
reuses its resident context across turns when the prompt is an exact-token
28+
extension, including tool-call turns via token-ID prompt segments. Anonymous
29+
(scratch) requests always reset.
2830
* The control plane only does blocking pipe I/O on its executor thread (no
2931
CUDA), which is safe under asyncio.
3032
@@ -83,6 +85,7 @@ def _spawn(args):
8385
if args.data_path:
8486
cmd += ["--data_path", args.data_path]
8587
cmd += ["--max_sessions", str(args.max_sessions)]
88+
cmd += [f"--warm_resume={'true' if args.warm_resume else 'false'}"]
8689
logger.info("Starting Qwen worker subprocess (loads the model once)...")
8790
return spawn_worker(cmd, env=env)
8891

@@ -162,6 +165,14 @@ def main() -> None:
162165
"cannot rebind. One slot is reserved for anonymous requests, so the "
163166
"number of addressable session_ids is max-sessions - 1.",
164167
)
168+
p.add_argument(
169+
"--warm-resume",
170+
action=argparse.BooleanOptionalAction,
171+
default=True,
172+
help="Warm append-only resume for named sessions: a request whose tokens "
173+
"extend the session's resident context prefills only the suffix. "
174+
"--no-warm-resume resets every request (for A/B measurement).",
175+
)
165176
p.add_argument(
166177
"--worker-bin",
167178
default=None,

examples/models/qwen3_5_moe/test_serve.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ def fake_spawn(cmd, env=None):
7676
tokenizer_path="t.json",
7777
data_path="d.ptd",
7878
max_sessions=4,
79+
warm_resume=True,
7980
)
8081
)
8182
assert captured["cmd"] == [
@@ -88,6 +89,7 @@ def fake_spawn(cmd, env=None):
8889
"d.ptd",
8990
"--max_sessions",
9091
"4",
92+
"--warm_resume=true",
9193
]
9294

9395

@@ -103,6 +105,7 @@ def test_spawn_defaults_worker_bin_and_omits_empty_data_path(monkeypatch):
103105
tokenizer_path="t.json",
104106
data_path=None,
105107
max_sessions=4,
108+
warm_resume=True,
106109
)
107110
)
108111
cmd = captured["cmd"]

extension/llm/server/cpp/CMakeLists.txt

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,3 +86,12 @@ if(NOT CMAKE_BUILD_TYPE STREQUAL "Debug")
8686
target_link_options_gc_sections(text_llm_worker)
8787
target_link_options(text_llm_worker PRIVATE "LINKER:-s")
8888
endif()
89+
90+
# Pure unit test for the warm-resume prefill planner (worker_prefill_plan.h). No
91+
# ET/model/tokenizer dependency, so it builds and runs standalone via ctest.
92+
enable_testing()
93+
add_executable(test_worker_prefill_plan test_worker_prefill_plan.cpp)
94+
target_include_directories(
95+
test_worker_prefill_plan PUBLIC ${_common_include_directories}
96+
)
97+
add_test(NAME worker_prefill_plan COMMAND test_worker_prefill_plan)
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
/*
2+
* Copyright (c) Meta Platforms, Inc. and affiliates.
3+
* All rights reserved.
4+
*
5+
* This source code is licensed under the BSD-style license found in the
6+
* LICENSE file in the root directory of this source tree.
7+
*/
8+
9+
// Unit tests for plan_prefill() (warm-resume decision). No model/session/ET
10+
// runtime dependency -- the header is pure, so this compiles and runs
11+
// standalone. Self-contained assertions (no gtest) so it has no build deps.
12+
13+
#include <executorch/extension/llm/server/cpp/worker_prefill_plan.h>
14+
15+
#include <cstdio>
16+
#include <cstring>
17+
#include <string>
18+
#include <vector>
19+
20+
using executorch::extension::llm::plan_prefill;
21+
using executorch::extension::llm::PrefillPlan;
22+
23+
namespace {
24+
int g_failures = 0;
25+
26+
void expect(
27+
const char* name,
28+
const PrefillPlan& p,
29+
PrefillPlan::Action action,
30+
size_t suffix_start,
31+
const char* reason) {
32+
bool ok = p.action == action && p.suffix_start == suffix_start &&
33+
std::strcmp(p.reason, reason) == 0;
34+
if (!ok) {
35+
++g_failures;
36+
printf(
37+
" [FAIL] %s: got action=%d suffix_start=%zu reason=%s\n",
38+
name,
39+
(int)p.action,
40+
p.suffix_start,
41+
p.reason);
42+
} else {
43+
printf(" [PASS] %s\n", name);
44+
}
45+
}
46+
} // namespace
47+
48+
int main() {
49+
using V = std::vector<uint64_t>;
50+
51+
// First request: nothing resident -> full prefill, "new".
52+
expect(
53+
"new (resident empty)",
54+
plan_prefill(V{}, V{1, 2, 3}, false),
55+
PrefillPlan::kFull,
56+
0,
57+
"new");
58+
59+
// Exact token extension -> prefill only the suffix.
60+
expect(
61+
"exact_prefix (suffix reuse)",
62+
plan_prefill(V{1, 2, 3}, V{1, 2, 3, 4, 5}, false),
63+
PrefillPlan::kSuffix,
64+
3,
65+
"exact_prefix");
66+
67+
// Single-token extension still reuses.
68+
expect(
69+
"exact_prefix (one-token suffix)",
70+
plan_prefill(V{1, 2, 3}, V{1, 2, 3, 4}, false),
71+
PrefillPlan::kSuffix,
72+
3,
73+
"exact_prefix");
74+
75+
// Divergent token -> mismatch, full reset.
76+
expect(
77+
"mismatch (divergent token)",
78+
plan_prefill(V{1, 2, 3}, V{1, 2, 9, 4}, false),
79+
PrefillPlan::kFull,
80+
0,
81+
"mismatch");
82+
83+
// Prompt shorter than resident (rewind) -> mismatch, full reset.
84+
expect(
85+
"mismatch (prompt shorter)",
86+
plan_prefill(V{1, 2, 3}, V{1, 2}, false),
87+
PrefillPlan::kFull,
88+
0,
89+
"mismatch");
90+
91+
// Dirty wins even over an otherwise-exact extension.
92+
expect(
93+
"dirty (overrides exact prefix)",
94+
plan_prefill(V{1, 2, 3}, V{1, 2, 3, 4}, true),
95+
PrefillPlan::kFull,
96+
0,
97+
"dirty");
98+
99+
// Prompt identical to resident -> reset + full (no empty-suffix prefill).
100+
expect(
101+
"equal (prompt == resident)",
102+
plan_prefill(V{1, 2, 3}, V{1, 2, 3}, false),
103+
PrefillPlan::kFull,
104+
0,
105+
"equal");
106+
107+
// Dirty + empty resident still resets as dirty (dirty checked first).
108+
expect(
109+
"dirty (empty resident)",
110+
plan_prefill(V{}, V{1, 2}, true),
111+
PrefillPlan::kFull,
112+
0,
113+
"dirty");
114+
115+
printf(
116+
"\n%s (%d failure(s))\n",
117+
g_failures == 0 ? "ALL PASS" : "FAILED",
118+
g_failures);
119+
return g_failures == 0 ? 0 : 1;
120+
}

0 commit comments

Comments
 (0)