Commit 68286cb
committed
Qwen3.5-MoE CUDA V2 foundation: one model, many isolated sessions
Proves the V2 core independently of any serving change: one physical AOTI-CUDA
Qwen model with weights loaded once can host multiple logical sessions, each with
its own KV/conv/recurrent state, with no cross-session bleed. The mechanism is
the AOTI user-managed-constant rebind, kept CUDA-backend-private; the public
surface stays LLMEngine/LLMSession. No multi-session serving wiring here (that is
a follow-up PR) — the worker/runner still create a single session.
Review order:
1. examples/models/qwen3_5_moe/export.py - the model-facing change: emits
get_mutable_buffer_metadata ({"version":1,"mutable_buffers":[fqn...]}) naming
the model's per-session state buffers (FQN list only; the backend owns the
tensor descriptors). The FQNs come from the model's explicit module contract.
2. backends/cuda/runtime/cuda_mutable_state.h - CUDA-private API, keyed by a
per-engine context (not a generic Module/Method/BackendInterface API, and not
process-global).
3. backends/cuda/runtime/cuda_mutable_state.cpp - context-keyed manager:
descriptor table + initial-template capture at load, per-session GPU buffers
cloned from the template, rebind via update_user_managed_constant_buffer_pairs,
and declared-vs-discovered FQN coverage validation.
4. backends/cuda/runtime/cuda_backend.cpp - two hooks: note_handle in init(),
rebind_for_execute in execute().
5. examples/models/qwen3_5_moe/qwen35_moe_engine.{h,cpp} - the engine owns one
shared Module and its context; sessions rebind their own state under the
engine lock; adds serving_capacity, capacity enforcement, the coverage check,
and context teardown. Also removes the now-incompatible cuda_graph flag (a
captured graph's baked pointers would ignore per-session rebinds), dropping it
from the engine config, the qwen3_5_moe_worker/runner (main.cpp) CLIs, and the
README/model.md docs.
6. backends/cuda/{CMakeLists.txt,runtime/TARGETS} and the qwen CMakeLists - build
wiring (CMake + Buck).
Validated on the real 35B-A3B model (HQQ-INT4) with two interleaved sessions:
after engine load: 17983 MB | capacity max_sessions=4, est 117309440 B/session
A solo : 348 10 4838 1665 15 16 17 18 19 20 21 22
A inter: 348 10 4838 1665 15 16 17 18 19 20 21 22 (identical -> no bleed)
GPU: engine=17983MB, +2 sessions=+108MB (weights once; ~112 MiB state/session)
Falls closed to single-session capacity if the AOTI constant-management symbols
are absent or the declared mutable FQNs do not fully match the loaded methods.
MLX V2 (shared constants + per-session MutableBufferData) is the next backend and
is not addressed here.
ghstack-source-id: 6bd59c4
ghstack-comment-id: 4652591759
Pull-Request: #201171 parent 8fcec77 commit 68286cb
13 files changed
Lines changed: 976 additions & 68 deletions
File tree
- backends/cuda
- runtime
- examples/models/qwen3_5_moe
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
179 | 179 | | |
180 | 180 | | |
181 | 181 | | |
182 | | - | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
183 | 185 | | |
184 | 186 | | |
185 | 187 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
105 | 105 | | |
106 | 106 | | |
107 | 107 | | |
| 108 | + | |
108 | 109 | | |
109 | 110 | | |
110 | 111 | | |
| 112 | + | |
111 | 113 | | |
112 | 114 | | |
113 | 115 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
42 | 42 | | |
43 | 43 | | |
44 | 44 | | |
| 45 | + | |
45 | 46 | | |
46 | 47 | | |
47 | 48 | | |
| |||
466 | 467 | | |
467 | 468 | | |
468 | 469 | | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
469 | 474 | | |
470 | 475 | | |
471 | 476 | | |
| |||
514 | 519 | | |
515 | 520 | | |
516 | 521 | | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
517 | 528 | | |
518 | 529 | | |
519 | 530 | | |
| |||
0 commit comments