Skip to content

Commit e4f4ab2

Browse files
committed
extension/llm/server: isolated multi-session serving (V2a)
A worker now hosts one LLMEngine (weights loaded once) and serves multiple isolated sessions keyed by session_id, each with its own KV/recurrent state, up to the engine's serving capacity -- so one ~18GB model load backs many independent conversations instead of one. Execution stays synchronous (one in-flight request; the control plane serializes): this is isolation, not concurrent streaming. The shared worker loop (worker_loop.h) owns the sessions. Named sessions are created on first use (or an `open` op) and capped at capacity-1; one slot is reserved for a scratch session that serves anonymous, session-less requests. Over-capacity or single-session backends return structured errors (capacity_exhausted / unsupported_session); there is no eviction/TTL in the MVP, so capacity_exhausted stands when named sessions exceed worker capacity. Both workers (text_llm_worker, qwen3_5_moe_worker) now pass their LLMEngine to the loop. The control plane (over the SessionRuntime boundary introduced earlier) routes a request to a session via the session_id body field or, as header aliases, X-ExecuTorch-Session-ID / session_id / x-session-affinity (body wins, then that header order). The aliases let a client that already emits a stable per-conversation affinity id (e.g. pi's sendSessionAffinityHeaders) route with no client-specific server code. The session is admitted up front (so a capacity refusal is HTTP 429/400 before any stream bytes), and DELETE /v1/sessions/{id} frees one. Review order: worker_loop.h (session ownership + protocol), then the two workers; then the control plane (protocol.py, errors.py, serving_chat.py, server.py, serve.py); then tests. Part of #20001 ghstack-source-id: fca5361 ghstack-comment-id: 4661783584 Pull-Request: #20159
1 parent 4d68d06 commit e4f4ab2

12 files changed

Lines changed: 541 additions & 71 deletions

File tree

examples/models/qwen3_5_moe/README.md

Lines changed: 36 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -173,8 +173,8 @@ serve.py (control plane: FastAPI/asyncio, OpenAI protocol, chat
173173
templating, tool parsing, validation — NO CUDA, NO pybind)
174174
│ JSONL over stdin/stdout
175175
176-
qwen3_5_moe_worker (C++ binary: one Qwen35MoEEngine + one session, synchronous
177-
loop — the CUDA model; NO asyncio server)
176+
qwen3_5_moe_worker (C++ binary: one Qwen35MoEEngine, many isolated sessions,
177+
synchronous loop — the CUDA model; NO asyncio server)
178178
```
179179

180180
The model runs in a **separate worker process** because executing the AOTI CUDA
@@ -196,13 +196,42 @@ is safe under asyncio.
196196
| `--host` / `--port` | `127.0.0.1` / `8000` | Bind address |
197197
| `--max-context` | (none) | Reject prompts that exceed it with 400 |
198198
| `--no-think` | off | Default reasoning off (`enable_thinking=False`) |
199+
| `--max-sessions` | `1` | Isolated sessions on one weight load (see Sessions) |
199200

200-
### V1 limitations
201+
### Sessions
202+
203+
One worker loads the weights once (~18 GB) and hosts multiple **isolated**
204+
sessions on that single allocation — each with its own KV/recurrent state, via
205+
CUDA per-session mutable rebinding. Set `--max-sessions N` (clamped to 1 if the
206+
backend cannot rebind); one slot is reserved for anonymous requests, so up to
207+
`N - 1` named `session_id`s are addressable.
208+
209+
Route a request to a persistent session with the `session_id` body field or, as
210+
aliases, the `X-ExecuTorch-Session-ID` / `session_id` / `x-session-affinity`
211+
headers (body wins, then that header order). The header aliases let a client that
212+
already emits a stable per-conversation affinity id (e.g. pi's
213+
`sendSessionAffinityHeaders`) route with no extra config. Requests without any
214+
share a transient scratch session. Free a session with `DELETE /v1/sessions/{id}`.
215+
216+
```bash
217+
curl http://127.0.0.1:8000/v1/chat/completions \
218+
-H 'Content-Type: application/json' \
219+
-d '{"model":"qwen3.5-moe","session_id":"alice",
220+
"messages":[{"role":"user","content":"hi"}]}'
221+
```
222+
223+
Admission is up front: an explicit `session_id` on a single-session server
224+
returns **400** (`unsupported_session`); past capacity it returns **429**
225+
(`capacity_exhausted`) before any response bytes.
226+
227+
This is **isolation, not concurrency or warm resume**: execution is still
228+
synchronous (one in-flight request; `--num-runners > 1` is rejected since more
229+
workers would duplicate the weights), and each request resets its session — the
230+
recurrent/conv state cannot be rewound by position (`seek()` is NotSupported), so
231+
turn-to-turn KV reuse (append-only warm resume) is a follow-up.
232+
233+
### Other limitations
201234

202-
- **Single-slot** (`serving_capacity=1`): one worker, one session, one model
203-
load. `--num-runners > 1` is rejected; concurrent requests queue on the worker.
204-
- **No prefix cache**: the recurrent/conv state cannot be rewound by position
205-
(`seek()` is NotSupported), so turn-to-turn KV reuse is off.
206235
- Supports the chat-completions contract of the generic server; `top_p != 1`,
207236
`seed`, `top_k`, `logprobs`, etc. are rejected (only temperature is plumbed).
208237

examples/models/qwen3_5_moe/qwen35_moe_worker.cpp

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -18,23 +18,29 @@
1818
// process segfaults in the int4 matmul (validated). Here the model runs in a
1919
// plain synchronous loop in its own process, which is reliable.
2020
//
21-
// Single-slot serving: this worker creates one session and the control plane
22-
// queues concurrent requests on it. (The engine itself can host multiple
23-
// sessions on the one ~18GB weight allocation; exposing that over the worker
24-
// protocol is a follow-up.)
21+
// Multi-session (isolation): the engine loads weights once and hosts multiple
22+
// isolated sessions on that one ~18GB allocation; the shared worker loop
23+
// (worker_loop.h) routes requests to per-session_id state, up to
24+
// --max_sessions. Execution is still synchronous (one in-flight request); warm
25+
// context reuse across requests is a follow-up.
2526

2627
#include <gflags/gflags.h>
2728

2829
#include <executorch/examples/models/qwen3_5_moe/qwen35_moe_engine.h>
29-
#include <executorch/extension/llm/runner/llm_session.h>
3030
#include <executorch/extension/llm/server/cpp/worker_loop.h>
3131
#include <executorch/runtime/platform/log.h>
3232

33-
#include <utility>
33+
#include <cstdint>
3434

3535
DEFINE_string(model_path, "", "Model .pte file path.");
3636
DEFINE_string(tokenizer_path, "", "HuggingFace tokenizer.json path.");
3737
DEFINE_string(data_path, "", "Data file (.ptd) for the CUDA backend.");
38+
DEFINE_int32(
39+
max_sessions,
40+
1,
41+
"Max physical sessions to host on the one weight allocation (CUDA "
42+
"per-session mutable rebinding). Clamped to 1 if the backend cannot "
43+
"rebind.");
3844

3945
namespace {
4046
namespace llm = ::executorch::extension::llm;
@@ -54,6 +60,7 @@ int main(int argc, char** argv) {
5460
config.model_path = FLAGS_model_path;
5561
config.data_path = FLAGS_data_path;
5662
config.tokenizer_path = FLAGS_tokenizer_path;
63+
config.max_sessions = FLAGS_max_sessions;
5764

5865
auto engine_result = llm::Qwen35MoEEngine::create(config);
5966
if (engine_result.error() != Error::Ok) {
@@ -62,16 +69,9 @@ int main(int argc, char** argv) {
6269
}
6370
auto engine = std::move(engine_result.get());
6471

65-
auto session_result = engine->create_session();
66-
if (session_result.error() != Error::Ok) {
67-
ET_LOG(Error, "qwen35_moe_worker: failed to create session");
68-
return 1;
69-
}
70-
auto session = std::move(session_result.get());
71-
72-
// The engine's tokenizer encodes the rendered prompt to ids; the session
73-
// decodes ids back to text internally.
72+
// The engine's tokenizer encodes the rendered prompt to ids; sessions decode
73+
// ids back to text internally. The shared loop owns per-session_id state.
7474
::tokenizers::Tokenizer* tokenizer = engine->tokenizer();
7575

76-
return llm::run_worker_stdio_loop(*session, *tokenizer, engine->metadata());
76+
return llm::run_worker_stdio_loop(*engine, *tokenizer, engine->metadata());
7777
}

examples/models/qwen3_5_moe/serve.py

Lines changed: 23 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,14 @@
1717
CUDA execution while a live asyncio loop is resident). Isolating CUDA in a plain
1818
(no-asyncio) C++ worker process is the reliable shape, and it loads weights once.
1919
20-
V1 constraints:
21-
* single-slot: one worker, one session; concurrent HTTP requests queue.
22-
* prefix cache off (Qwen seek() is NotSupported).
20+
Sessions and constraints:
21+
* One worker hosts many isolated sessions on a single ~18GB weight load (CUDA
22+
per-session mutable rebinding); requests route by session_id (anonymous
23+
requests share a scratch session). See --max-sessions.
24+
* Execution is synchronous: one in-flight request at a time, concurrent HTTP
25+
requests queue. Sessions provide isolation, not concurrent throughput.
26+
* No warm context reuse yet: each request resets its session (Qwen seek() is
27+
NotSupported; append-only reuse is a follow-up).
2328
* The control plane only does blocking pipe I/O on its executor thread (no
2429
CUDA), which is safe under asyncio.
2530
@@ -77,6 +82,7 @@ def _spawn(args):
7782
]
7883
if args.data_path:
7984
cmd += ["--data_path", args.data_path]
85+
cmd += ["--max_sessions", str(args.max_sessions)]
8086
logger.info("Starting Qwen worker subprocess (loads the model once)...")
8187
return spawn_worker(cmd, env=env)
8288

@@ -88,7 +94,7 @@ def build_app_from_args(args):
8894
args.hf_tokenizer, default_template_kwargs=default_template_kwargs
8995
)
9096

91-
worker = _spawn(args) # one worker == one session (single-slot V1)
97+
worker = _spawn(args) # one worker, weights once, many isolated sessions
9298
runtime = SessionRuntime(worker)
9399
serving = ServingChat(
94100
runtime,
@@ -144,7 +150,17 @@ def main() -> None:
144150
"--num-runners",
145151
type=int,
146152
default=1,
147-
help="V1 supports 1 only (single-slot).",
153+
help="Workers (processes). 1 only: a worker hosts many isolated sessions "
154+
"on one weight load; more workers would duplicate the ~18GB weights.",
155+
)
156+
p.add_argument(
157+
"--max-sessions",
158+
type=int,
159+
default=1,
160+
help="Isolated sessions the one worker hosts on a single weight load "
161+
"(CUDA per-session mutable rebinding); clamped to 1 if the backend "
162+
"cannot rebind. One slot is reserved for anonymous requests, so the "
163+
"number of addressable session_ids is max-sessions - 1.",
148164
)
149165
p.add_argument(
150166
"--worker-bin",
@@ -157,8 +173,8 @@ def main() -> None:
157173

158174
if args.num_runners != 1:
159175
p.error(
160-
"Qwen3.5 MoE V1 is single-slot: one worker serves one session; "
161-
"concurrent requests queue."
176+
"Only 1 worker process is supported (it hosts many isolated sessions "
177+
"on one ~18GB weight load); more workers would duplicate the weights."
162178
)
163179

164180
app, _ = build_app_from_args(args)

examples/models/qwen3_5_moe/test_serve.py

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
99
Hermetic: no model, GPU, or worker subprocess. Covers layering (Qwen stays an
1010
example; the control plane runs no CUDA and imports no model pybind), the worker
11-
spawn command, and the single-slot CLI guard. The generic JSONL protocol is
11+
spawn command, and the single-worker CLI guard. The generic JSONL protocol is
1212
covered by extension/llm/server/python/tests/test_worker_client.py; the live
1313
HTTP smoke test is documented in README.md and run on a CUDA box.
1414
"""
@@ -75,6 +75,7 @@ def fake_spawn(cmd, env=None):
7575
model_path="m.pte",
7676
tokenizer_path="t.json",
7777
data_path="d.ptd",
78+
max_sessions=4,
7879
)
7980
)
8081
assert captured["cmd"] == [
@@ -85,6 +86,8 @@ def fake_spawn(cmd, env=None):
8586
"t.json",
8687
"--data_path",
8788
"d.ptd",
89+
"--max_sessions",
90+
"4",
8891
]
8992

9093

@@ -95,7 +98,11 @@ def test_spawn_defaults_worker_bin_and_omits_empty_data_path(monkeypatch):
9598
)
9699
serve._spawn(
97100
SimpleNamespace(
98-
worker_bin=None, model_path="m.pte", tokenizer_path="t.json", data_path=None
101+
worker_bin=None,
102+
model_path="m.pte",
103+
tokenizer_path="t.json",
104+
data_path=None,
105+
max_sessions=4,
99106
)
100107
)
101108
cmd = captured["cmd"]

extension/llm/server/cpp/text_llm_worker.cpp

Lines changed: 5 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -12,25 +12,24 @@
1212
// the stable serving abstraction) — no Python model code, no pybind, no
1313
// in-process Python serving. The OpenAI control plane (Python) spawns this
1414
// process and drives it over JSONL on stdin/stdout (see worker_client.py). The
15-
// JSONL protocol and the decode loop are shared across all workers in
16-
// worker_loop.h; this file only constructs the engine/session/tokenizer.
15+
// JSONL protocol, session management, and the decode loop are shared across all
16+
// workers in worker_loop.h; this file only constructs the engine/tokenizer.
17+
// TextLLMEngine hosts a single session, so the worker serves anonymous requests
18+
// via the shared loop's scratch session and reports no named sessions.
1719

1820
#include <gflags/gflags.h>
1921

2022
#include <executorch/extension/llm/runner/llm_runner_helper.h>
21-
#include <executorch/extension/llm/runner/llm_session.h>
2223
#include <executorch/extension/llm/server/cpp/worker_loop.h>
2324
#include <executorch/runtime/platform/log.h>
2425

2526
#include <optional>
26-
#include <utility>
2727

2828
DEFINE_string(model_path, "", "Self-contained model .pte file path.");
2929
DEFINE_string(tokenizer_path, "", "HuggingFace tokenizer.json path.");
3030

3131
namespace {
3232
namespace llm = ::executorch::extension::llm;
33-
using ::executorch::runtime::Error;
3433
} // namespace
3534

3635
int main(int argc, char** argv) {
@@ -50,12 +49,6 @@ int main(int argc, char** argv) {
5049
ET_LOG(Error, "text_llm_worker: failed to create engine");
5150
return 1;
5251
}
53-
auto session_result = engine->create_session();
54-
if (session_result.error() != Error::Ok) {
55-
ET_LOG(Error, "text_llm_worker: failed to create session");
56-
return 1;
57-
}
58-
auto session = std::move(session_result.get());
5952

6053
// The session decodes token ids to text internally; this tokenizer encodes
6154
// the rendered prompt to ids. Same tokenizer.json -> same vocabulary.
@@ -65,5 +58,5 @@ int main(int argc, char** argv) {
6558
return 1;
6659
}
6760

68-
return llm::run_worker_stdio_loop(*session, *tokenizer, engine->metadata());
61+
return llm::run_worker_stdio_loop(*engine, *tokenizer, engine->metadata());
6962
}

0 commit comments

Comments
 (0)