Skip to content

Commit 1b1fac1

Browse files
committed
examples/models/qwen3_5_moe: CUDA Engine/Session adapter + OpenAI serving
Implement Qwen35MoEEngine / Qwen35MoESession (the model-agnostic LLMEngine / LLMSession contract) over the exported prefill/decode methods. serving_capacity() reports a single physical session; the model is hybrid_recurrent with seek() NotSupported (no prefix reuse). main.cpp is a thin CLI over the engine/session. OpenAI serving runs process-isolated and model execution stays in C++: serve.py is the control plane (FastAPI, chat templating, Qwen XML tool parsing, validation; no CUDA, no pybind) and spawns qwen3_5_moe_worker (qwen35_moe_worker.cpp), a C++ worker that constructs the engine and one session and speaks the same JSONL protocol as the generic text worker. Executing the AOTI CUDA model inside a live asyncio server process segfaults in the int4 matmul; isolating it in a plain worker process makes serving reliable while loading weights once. Single-slot: concurrent requests queue. Tool calls use the Qwen XML <function=...> format (QwenFunctionCallDetector). Review order: qwen35_moe_engine.{h,cpp} (adapter) and main.cpp; then qwen35_moe_worker.cpp and serve.py (serving); then tests and docs. ghstack-source-id: ca70937 ghstack-comment-id: 4625142707 Pull-Request: #20043
1 parent 34aee62 commit 1b1fac1

11 files changed

Lines changed: 1277 additions & 302 deletions

File tree

Makefile

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@
9191
#
9292
# ==============================================================================
9393

94-
.PHONY: voxtral-cuda voxtral-cpu voxtral-metal voxtral-mlx voxtral_realtime-cuda voxtral_realtime-cpu voxtral_realtime-metal voxtral_realtime-mlx voxtral_tts-cpu voxtral_tts-cuda whisper-cuda whisper-cuda-debug whisper-cpu whisper-metal parakeet-cuda parakeet-cuda-debug parakeet-cpu parakeet-metal parakeet-mlx parakeet-vulkan dinov2-cuda dinov2-cuda-debug sortformer-cuda sortformer-cpu silero-vad-cpu llama-cuda llama-cuda-debug llama-cpu lfm_2_5-mlx llava-cpu gemma3-cuda gemma3-cpu gemma4_31b-cuda gemma4_31b-mlx qwen3_5_moe-cuda qwen3_5_moe-metal clean help
94+
.PHONY: voxtral-cuda voxtral-cpu voxtral-metal voxtral-mlx voxtral_realtime-cuda voxtral_realtime-cpu voxtral_realtime-metal voxtral_realtime-mlx voxtral_tts-cpu voxtral_tts-cuda whisper-cuda whisper-cuda-debug whisper-cpu whisper-metal parakeet-cuda parakeet-cuda-debug parakeet-cpu parakeet-metal parakeet-mlx parakeet-vulkan dinov2-cuda dinov2-cuda-debug sortformer-cuda sortformer-cpu silero-vad-cpu llama-cuda llama-cuda-debug llama-cpu lfm_2_5-mlx llava-cpu gemma3-cuda gemma3-cpu gemma4_31b-cuda gemma4_31b-mlx qwen3_5_moe-cuda qwen3_5_moe-cuda-serve qwen3_5_moe-metal clean help
9595

9696
help:
9797
@echo "This Makefile adds targets to build runners for various models on various backends. Run using \`make <target>\`. Available targets:"
@@ -130,6 +130,7 @@ help:
130130
@echo " gemma4_31b-cuda - Build Gemma 4 31B runner with CUDA backend"
131131
@echo " gemma4_31b-mlx - Build Gemma 4 31B runner with MLX backend"
132132
@echo " qwen3_5_moe-cuda - Build Qwen3.5 MoE runner with CUDA backend"
133+
@echo " qwen3_5_moe-cuda-serve - Build Qwen3.5 MoE runner + OpenAI serving worker (CUDA)"
133134
@echo " qwen3_5_moe-metal - Build Qwen3.5 MoE runner with Metal backend"
134135
@echo " clean - Clean build artifacts"
135136

@@ -455,6 +456,17 @@ gemma4_31b-mlx:
455456
@echo "✓ Build complete!"
456457
@echo " Binary: cmake-out/examples/models/gemma4_31b/gemma4_31b_runner"
457458

459+
qwen3_5_moe-cuda-serve:
460+
@echo "==> Building and installing ExecuTorch with CUDA..."
461+
cmake --workflow --preset llm-release-cuda
462+
@echo "==> Building Qwen3.5 MoE runner + serving worker with CUDA..."
463+
cd examples/models/qwen3_5_moe && cmake --workflow --preset qwen3-5-moe-cuda-serve
464+
@echo ""
465+
@echo "✓ Build complete!"
466+
@echo " Binary: cmake-out/examples/models/qwen3_5_moe/qwen3_5_moe_runner"
467+
@echo " Serving worker: cmake-out/examples/models/qwen3_5_moe/qwen3_5_moe_worker"
468+
@echo " Launch: see examples/models/qwen3_5_moe/README.md (Serving)"
469+
458470
qwen3_5_moe-metal:
459471
@echo "==> Building and installing ExecuTorch with Metal..."
460472
cmake --workflow --preset llm-release-metal

examples/models/qwen3_5_moe/CMakeLists.txt

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,11 @@ set(EXECUTORCH_ROOT ${CMAKE_CURRENT_SOURCE_DIR}/../../..)
1515
include(${EXECUTORCH_ROOT}/tools/cmake/Utils.cmake)
1616

1717
set(_common_include_directories ${EXECUTORCH_ROOT}/..)
18+
# Vendored single-include nlohmann/json for the worker JSONL protocol (no new
19+
# dependency).
20+
set(_json_include
21+
${EXECUTORCH_ROOT}/extension/llm/tokenizers/third-party/json/single_include
22+
)
1823

1924
# gflags
2025
set(gflags_DIR ${CMAKE_CURRENT_BINARY_DIR}/../../../third-party/gflags)
@@ -60,7 +65,7 @@ endif()
6065
# Tokenizer
6166
list(APPEND link_libraries tokenizers::tokenizers)
6267

63-
add_executable(qwen3_5_moe_runner main.cpp)
68+
add_executable(qwen3_5_moe_runner main.cpp qwen35_moe_engine.cpp)
6469
target_include_directories(
6570
qwen3_5_moe_runner PUBLIC ${_common_include_directories}
6671
)
@@ -70,3 +75,18 @@ if(NOT CMAKE_BUILD_TYPE STREQUAL "Debug")
7075
target_link_options_gc_sections(qwen3_5_moe_runner)
7176
target_link_options(qwen3_5_moe_runner PRIVATE "LINKER:-s")
7277
endif()
78+
79+
# Process-isolated serving worker (qwen3_5_moe_worker): constructs
80+
# Qwen35MoEEngine directly and speaks the JSONL worker protocol that the Python
81+
# control plane drives via WorkerClient (no pybind, no Python model code). Used
82+
# by the qwen3_5_moe-cuda-serve flow.
83+
add_executable(qwen3_5_moe_worker qwen35_moe_worker.cpp qwen35_moe_engine.cpp)
84+
target_include_directories(
85+
qwen3_5_moe_worker PUBLIC ${_common_include_directories} ${_json_include}
86+
)
87+
target_link_libraries(qwen3_5_moe_worker PUBLIC ${link_libraries})
88+
89+
if(NOT CMAKE_BUILD_TYPE STREQUAL "Debug")
90+
target_link_options_gc_sections(qwen3_5_moe_worker)
91+
target_link_options(qwen3_5_moe_worker PRIVATE "LINKER:-s")
92+
endif()

examples/models/qwen3_5_moe/CMakePresets.json

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,11 @@
2424
"list": ["Linux", "Windows"]
2525
}
2626
},
27+
{
28+
"name": "qwen3-5-moe-cuda-serve",
29+
"displayName": "Qwen3.5 MoE runner + serving worker (CUDA)",
30+
"inherits": ["qwen3-5-moe-cuda"]
31+
},
2732
{
2833
"name": "qwen3-5-moe-metal",
2934
"displayName": "Qwen3.5 MoE runner (Metal)",
@@ -45,6 +50,12 @@
4550
"configurePreset": "qwen3-5-moe-cuda",
4651
"targets": ["qwen3_5_moe_runner"]
4752
},
53+
{
54+
"name": "qwen3-5-moe-cuda-serve",
55+
"displayName": "Build Qwen3.5 MoE runner + serving worker (CUDA)",
56+
"configurePreset": "qwen3-5-moe-cuda-serve",
57+
"targets": ["qwen3_5_moe_runner", "qwen3_5_moe_worker"]
58+
},
4859
{
4960
"name": "qwen3-5-moe-metal",
5061
"displayName": "Build Qwen3.5 MoE runner (Metal)",
@@ -67,6 +78,20 @@
6778
}
6879
]
6980
},
81+
{
82+
"name": "qwen3-5-moe-cuda-serve",
83+
"displayName": "Configure and build Qwen3.5 MoE runner + serving worker (CUDA)",
84+
"steps": [
85+
{
86+
"type": "configure",
87+
"name": "qwen3-5-moe-cuda-serve"
88+
},
89+
{
90+
"type": "build",
91+
"name": "qwen3-5-moe-cuda-serve"
92+
}
93+
]
94+
},
7095
{
7196
"name": "qwen3-5-moe-metal",
7297
"displayName": "Configure and build Qwen3.5 MoE runner (Metal)",

examples/models/qwen3_5_moe/README.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,11 +133,95 @@ cmake-out/examples/models/qwen3_5_moe/qwen3_5_moe_runner \
133133
| `--data_path` | (none) | Path to `.ptd` delegate data file (required for CUDA) |
134134
| `--tokenizer_path` | (required) | Path to HuggingFace `tokenizer.json` |
135135
| `--prompt` | `"Hello"` | Input prompt text |
136+
| `--prompt_file` | (none) | Path to a file with the prompt (overrides `--prompt`) |
136137
| `--temperature` | `0.8` | Sampling temperature (0 = greedy) |
137138
| `--max_new_tokens` | `128` | Maximum tokens to generate |
139+
| `--cuda_graph` | off | Capture/replay the decode method as a CUDA graph (CUDA only). See the caveat below. |
140+
| `--warmup` | `0` | Warmup iterations to discard before timing (one model load; the session is reset between iterations) |
141+
| `--num_iters` | `1` | Timed iterations to average, after warmup |
142+
143+
## Serving (OpenAI-compatible)
144+
145+
Run an OpenAI-compatible HTTP server so an agent harness (pi, opencode, …) can
146+
use the model for local tool-use. Point your client at `http://<host>:<port>/v1`.
147+
148+
Build the runner **and** the serving worker:
149+
150+
```bash
151+
make qwen3_5_moe-cuda-serve
152+
```
153+
154+
Launch (the `LD_LIBRARY_PATH` shim is forwarded to the worker for the CUDA blob):
155+
156+
```bash
157+
LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH \
158+
python -m executorch.examples.models.qwen3_5_moe.serve \
159+
--model-path qwen35_moe_exports/model.pte \
160+
--data-path qwen35_moe_exports/aoti_cuda_blob.ptd \
161+
--tokenizer-path ~/models/Qwen3.5-35B-A3B/tokenizer.json \
162+
--hf-tokenizer ~/models/Qwen3.5-35B-A3B \
163+
--model-id qwen3.5-moe --no-think
164+
```
165+
166+
### Architecture (process isolation)
167+
168+
Two processes, one model load:
169+
170+
```
171+
serve.py (control plane: FastAPI/asyncio, OpenAI protocol, chat
172+
templating, tool parsing, validation — NO CUDA, NO pybind)
173+
│ JSONL over stdin/stdout
174+
175+
qwen3_5_moe_worker (C++ binary: one Qwen35MoEEngine + one session, synchronous
176+
loop — the CUDA model; NO asyncio server)
177+
```
178+
179+
The model runs in a **separate worker process** because executing the AOTI CUDA
180+
model inside a live asyncio server process segfaults in the int4 matmul
181+
(reproducible, and isolated by elimination to the asyncio-loop × CUDA
182+
interaction). The worker runs the model like the CLI — a plain synchronous loop —
183+
which is reliable. The control plane only does blocking pipe I/O (no CUDA), which
184+
is safe under asyncio.
185+
186+
### Serve Options
187+
188+
| Flag | Default | Description |
189+
|------|---------|-------------|
190+
| `--model-path` | (required) | Path to exported `.pte` model |
191+
| `--data-path` | (none) | Path to `.ptd` delegate data file (required for CUDA) |
192+
| `--tokenizer-path` | (required) | Path to HuggingFace `tokenizer.json` |
193+
| `--hf-tokenizer` | (required) | HF tokenizer id/dir for the chat template + encoding |
194+
| `--model-id` | `qwen3.5-moe` | Model id reported on `/v1/models` |
195+
| `--host` / `--port` | `127.0.0.1` / `8000` | Bind address |
196+
| `--max-context` | (none) | Reject prompts that exceed it with 400 |
197+
| `--no-think` | off | Default reasoning off (`enable_thinking=False`) |
198+
199+
### V1 limitations
200+
201+
- **Single-slot** (`serving_capacity=1`): one worker, one session, one model
202+
load. `--num-runners > 1` is rejected; concurrent requests queue on the worker.
203+
- **No prefix cache**: the recurrent/conv state cannot be rewound by position
204+
(`seek()` is NotSupported), so turn-to-turn KV reuse is off.
205+
- Supports the chat-completions contract of the generic server; `top_p != 1`,
206+
`seed`, `top_k`, `logprobs`, etc. are rejected (only temperature is plumbed).
138207

139208
## Troubleshooting
140209

210+
- **Runner exits silently right after `Loading methods...`**: the AOTI CUDA blob
211+
is compiled with the conda toolchain's `libstdc++`, which is newer than the
212+
system one (it needs e.g. `GLIBCXX_3.4.34`). Prepend the conda lib dir so the
213+
runner loads the matching `libstdc++`:
214+
215+
```bash
216+
LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH \
217+
cmake-out/examples/models/qwen3_5_moe/qwen3_5_moe_runner ...
218+
```
219+
- **`aoti_torch_cuda_sort_stable ... API call failed` when re-running prefill
220+
with `--cuda_graph`**: capturing the decode CUDA graph and then running another
221+
prefill in the same process currently fails (allocator interaction). Use
222+
`--cuda_graph` for single prefill+decode runs; omit it when looping with
223+
`--warmup`/`--num_iters`.
224+
141225
- **OOM during export**: The model requires significant GPU memory even
142226
with int4 quantization. Try reducing `--max-seq-len` or using a GPU
143227
with more VRAM.

0 commit comments

Comments
 (0)