Skip to content

Commit 68286cb

Browse files
committed
Qwen3.5-MoE CUDA V2 foundation: one model, many isolated sessions
Proves the V2 core independently of any serving change: one physical AOTI-CUDA Qwen model with weights loaded once can host multiple logical sessions, each with its own KV/conv/recurrent state, with no cross-session bleed. The mechanism is the AOTI user-managed-constant rebind, kept CUDA-backend-private; the public surface stays LLMEngine/LLMSession. No multi-session serving wiring here (that is a follow-up PR) — the worker/runner still create a single session. Review order: 1. examples/models/qwen3_5_moe/export.py - the model-facing change: emits get_mutable_buffer_metadata ({"version":1,"mutable_buffers":[fqn...]}) naming the model's per-session state buffers (FQN list only; the backend owns the tensor descriptors). The FQNs come from the model's explicit module contract. 2. backends/cuda/runtime/cuda_mutable_state.h - CUDA-private API, keyed by a per-engine context (not a generic Module/Method/BackendInterface API, and not process-global). 3. backends/cuda/runtime/cuda_mutable_state.cpp - context-keyed manager: descriptor table + initial-template capture at load, per-session GPU buffers cloned from the template, rebind via update_user_managed_constant_buffer_pairs, and declared-vs-discovered FQN coverage validation. 4. backends/cuda/runtime/cuda_backend.cpp - two hooks: note_handle in init(), rebind_for_execute in execute(). 5. examples/models/qwen3_5_moe/qwen35_moe_engine.{h,cpp} - the engine owns one shared Module and its context; sessions rebind their own state under the engine lock; adds serving_capacity, capacity enforcement, the coverage check, and context teardown. Also removes the now-incompatible cuda_graph flag (a captured graph's baked pointers would ignore per-session rebinds), dropping it from the engine config, the qwen3_5_moe_worker/runner (main.cpp) CLIs, and the README/model.md docs. 6. backends/cuda/{CMakeLists.txt,runtime/TARGETS} and the qwen CMakeLists - build wiring (CMake + Buck). Validated on the real 35B-A3B model (HQQ-INT4) with two interleaved sessions: after engine load: 17983 MB | capacity max_sessions=4, est 117309440 B/session A solo : 348 10 4838 1665 15 16 17 18 19 20 21 22 A inter: 348 10 4838 1665 15 16 17 18 19 20 21 22 (identical -> no bleed) GPU: engine=17983MB, +2 sessions=+108MB (weights once; ~112 MiB state/session) Falls closed to single-session capacity if the AOTI constant-management symbols are absent or the declared mutable FQNs do not fully match the loaded methods. MLX V2 (shared constants + per-session MutableBufferData) is the next backend and is not addressed here. ghstack-source-id: 6bd59c4 ghstack-comment-id: 4652591759 Pull-Request: #20117
1 parent 8fcec77 commit 68286cb

13 files changed

Lines changed: 976 additions & 68 deletions

File tree

backends/cuda/CMakeLists.txt

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -179,7 +179,9 @@ install(
179179
)
180180

181181
# CUDA backend implementation
182-
set(_aoti_cuda_backend_sources runtime/cuda_backend.cpp)
182+
set(_aoti_cuda_backend_sources runtime/cuda_backend.cpp
183+
runtime/cuda_mutable_state.cpp
184+
)
183185
if(_cuda_is_msvc_toolchain)
184186
# MSVC links aoti_cuda_backend into portable_lib without relying on C++
185187
# symbols exported from aoti_cuda_shims.dll.

backends/cuda/runtime/TARGETS

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,9 +105,11 @@ runtime.cxx_library(
105105
name = "cuda_backend",
106106
srcs = [
107107
"cuda_backend.cpp",
108+
"cuda_mutable_state.cpp",
108109
],
109110
headers = [
110111
"cuda_delegate_handle.h",
112+
"cuda_mutable_state.h",
111113
],
112114
# @lint-ignore BUCKLINT: Avoid `link_whole=True` (https://fburl.com/avoid-link-whole)
113115
link_whole = True,

backends/cuda/runtime/cuda_backend.cpp

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@
4242
#include <executorch/backends/aoti/utils.h>
4343
#include <executorch/backends/cuda/runtime/cuda_allocator.h>
4444
#include <executorch/backends/cuda/runtime/cuda_delegate_handle.h>
45+
#include <executorch/backends/cuda/runtime/cuda_mutable_state.h>
4546
#include <executorch/backends/cuda/runtime/platform/platform.h>
4647
#include <executorch/backends/cuda/runtime/shims/memory.h>
4748
#include <executorch/backends/cuda/runtime/utils.h>
@@ -466,6 +467,10 @@ class ET_EXPERIMENTAL CudaBackend final
466467
kCudaGraphWarmupSteps);
467468
}
468469

470+
// Record whether this AOTI build exposes the constant-management symbols
471+
// needed for per-session mutable-buffer rebinding (CUDA V2 multi-session).
472+
mutable_state_note_handle(handle);
473+
469474
return (DelegateHandle*)handle; // Return the handle post-processing
470475
}
471476

@@ -514,6 +519,12 @@ class ET_EXPERIMENTAL CudaBackend final
514519
static_cast<int>(device_type));
515520
}
516521

522+
// CUDA V2 multi-session: if a logical session is active on this thread,
523+
// rebind this container's mutable constants (KV/conv/recurrent) to the
524+
// session's own GPU buffers before running. No-op for
525+
// single-session/legacy.
526+
ET_CHECK_OK_OR_RETURN_ERROR(mutable_state_rebind_for_execute(handle));
527+
517528
// ---------------------------------------------------------------
518529
// CUDA graph REPLAY path — skip all tensor setup and just replay
519530
// ---------------------------------------------------------------

0 commit comments

Comments
 (0)