Commit a90def7
feat(qwen3.5): MTP speculative decoding for Qwen3.5/3.6 dense + MoE (mlx-node#65)
## Overview
Adds the **Multi-Token Prediction (MTP) speculative-decoding** stack for
Qwen3.5/3.6 dense + MoE (Metal/Apple-Silicon), plus the supporting
checkpoint-conversion, server, and benchmarking changes accumulated over
~80 commits. Draft→verify→accept is lossless (Leviathan–Chen); T=0
greedy
output is byte-identical to autoregressive.
## What's in here
- **MTP draft/verify loop** (flat + paged compiled C++ forward paths),
GDN linear-attention state snapshot/restore on partial-accept, chained
cycles (Step-A elimination; M5 gen-gated ON, byte-parity verified).
- **MTP norm handling** — convert-time `+1.0` RMSNorm shift + load-time
raw-checkpoint shift, with double-shift guards.
- **Quantized MTP checkpoints** — convert retains the bf16 MTP head
(`--q-mtp off` default), splits fused MoE MTP experts to `switch_mlp.*`;
validated `qwen3.6-27b-nvfp4-mtp` (dense) and
`qwen3.6-35b-a3b-mxfp8-mtp` (MoE).
- **`extra_body.generation_mode` / `mtp_depth`** plumbed through the
Anthropic `/v1/messages` mapper.
- **Server fix (this session):** `/v1/messages` no longer returns
`400 Unsupported message role: "system"` — Claude Code SessionStart
hook messages are folded into the leading system prompt
(position-agnostic, documented contract). Adversarially reviewed; the
warm-slot cache key is intentionally left coarse (native token-prefix
verifier is the correctness authority; folding hook text into the key
would churn it and force cold prefills with no correctness gain).
- **Bench harness:** `examples/qwen35-mtp-controlled-verdict.ts` gains a
`--prompt` flag (essay/counting/code presets + raw string).
## Performance (MTP vs AR, self-normalized, M5 Max, T=0)
MTP speed is **prompt-gated** (acceptance ∝ prompt predictability).
Dense
breaks even at lower acceptance than MoE (MoE cycle ≈ 2× an AR step).
| prompt | 27b dense (d1) | 35b MoE (d1) | notes |
|---|---|---|---|
| essay (abstract prose) | ~1.03× | 0.82× (loss) | MoE worst case |
| "summarize architecture" | 1.09× | 1.09× (CV 1.6%) | realistic agentic
prose |
| counting (predictable) | 1.58× (d2 1.95×) | 1.26–1.33× | MTP best case
|
Optimal depth: **dense → 2**, **MoE → 1** (MoE d2 acc ~1.7 < the
~1.9–2.0
needed to beat d1).
## Known issues (NOT merge-ready as-is)
From the whole-branch adversarial review:
- **BLOCKING:** compiled graph caches bake the first model's weights and
are never invalidated on reload (`mlx_clear_weights` lacks a
`compile_clear_cache()`), so loading a second same-arch model in one
process silently reuses the first's weights.
- **HIGH:** a mid-cycle non-EOS stop under `reuse_cache` can leave the
cache over-advanced vs `token_history` (flat/MoE/paged).
- Reproducible 37G-MoE teardown OOM (`64 bytes failed` at child exit) —
memory brushing the ceiling / possible minor MTP-path leak.
These should be resolved before merge; opening for review of the overall
design and the shippable hot paths.
## Validation
- Server fix: 76 mapper unit tests pass; `yarn typecheck` / lint / fmt
clean.
- MTP correctness: T=0 byte-identical to AR; acceptance scales with
depth.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **High Risk**
> Touches core Qwen3.5 inference, checkpoint conversion, and server
request mapping; the PR description also flags blocking compiled-weight
cache invalidation and mid-cycle stop/cache desync risks not fully
resolved in this slice.
>
> **Overview**
> Extends the Qwen3.5/3.6 **MTP speculative-decode** surface end-to-end:
**`ChatSession`** auto-enables **`enableMtp`** when the loaded model
reports MTP weights (explicit **`enableMtp: false`** still wins), and
HTTP mappers map **`extra_body.generation_mode`** / **`mtp_depth`** to
chat config on both Anthropic and OpenAI-style paths.
>
> **Server/API hardening:** Anthropic **`/v1/messages`** folds injected
**`{ role: 'system' }`** hook messages into the leading system prompt
instead of **400**; Responses and Messages reject invalid
**`max_output_tokens`** / **`max_tokens`** (non-positive, non-integer,
above **`i32::MAX`**) before NAPI truncation can yield silent empty
completions. **Qwen3 `generate()`** and **GRPO** reject nonpositive
token budgets up front.
>
> **Convert & checkpoints:** Qwen3.5 sanitize/convert now **retains and
normalizes `mtp.*` weights**, optional **`quant_mtp`** policies
(cyankiwi / all / **split** drafter dir), sidecar metadata, guards
against re-quantizing pre-quantized MTP, and recipe tweaks (8-bit
**`o_proj` / `out_proj` / GDN low-rank paths**) for MTP/AR
bit-exactness; **NVFP4 without a recipe** is refused. Default **paged
decode MLX cache clear cadence** moves **64 → 1024** steps.
>
> **Observability & tooling:** **`DecodeProfiler`** gains nested phases,
**`record_mtp_cycle`**, and mlx-vlm-comparable acceptance metrics on
**`PerformanceMetrics`**; a new **`quantized_qmv_microbench`** NAPI hook
supports dispatch benchmarking.
>
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
007c03a. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>1 parent 3360e4f commit a90def7
106 files changed
Lines changed: 37973 additions & 919 deletions
File tree
- __test__
- models
- server
- trainers
- crates
- mlx-core
- src
- array
- grpo
- models
- gemma4
- lfm2
- paddleocr_vl
- qianfan_ocr
- qwen3_5_moe
- qwen3_5
- qwen3
- utils
- tests
- mlx-paged-attn
- metal/attention
- src
- metal
- tests
- mlx-sys
- src
- metal
- docs
- examples
- packages
- cli
- src/commands
- launch-claude
- core
- lm/src
- models
- server/src
- endpoints
- mappers
- trl/src/trainers
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
242 | 242 | | |
243 | 243 | | |
244 | 244 | | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
245 | 300 | | |
246 | 301 | | |
247 | 302 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
61 | 61 | | |
62 | 62 | | |
63 | 63 | | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
64 | 84 | | |
65 | 85 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
| 7 | + | |
7 | 8 | | |
8 | 9 | | |
9 | 10 | | |
| |||
1286 | 1287 | | |
1287 | 1288 | | |
1288 | 1289 | | |
| 1290 | + | |
| 1291 | + | |
| 1292 | + | |
| 1293 | + | |
| 1294 | + | |
| 1295 | + | |
| 1296 | + | |
| 1297 | + | |
| 1298 | + | |
| 1299 | + | |
| 1300 | + | |
| 1301 | + | |
| 1302 | + | |
| 1303 | + | |
| 1304 | + | |
| 1305 | + | |
| 1306 | + | |
| 1307 | + | |
| 1308 | + | |
| 1309 | + | |
| 1310 | + | |
| 1311 | + | |
| 1312 | + | |
| 1313 | + | |
| 1314 | + | |
| 1315 | + | |
| 1316 | + | |
| 1317 | + | |
| 1318 | + | |
| 1319 | + | |
| 1320 | + | |
| 1321 | + | |
| 1322 | + | |
| 1323 | + | |
| 1324 | + | |
| 1325 | + | |
| 1326 | + | |
| 1327 | + | |
| 1328 | + | |
| 1329 | + | |
| 1330 | + | |
| 1331 | + | |
| 1332 | + | |
| 1333 | + | |
| 1334 | + | |
| 1335 | + | |
| 1336 | + | |
| 1337 | + | |
| 1338 | + | |
| 1339 | + | |
| 1340 | + | |
| 1341 | + | |
| 1342 | + | |
| 1343 | + | |
| 1344 | + | |
| 1345 | + | |
| 1346 | + | |
| 1347 | + | |
| 1348 | + | |
| 1349 | + | |
| 1350 | + | |
| 1351 | + | |
| 1352 | + | |
| 1353 | + | |
| 1354 | + | |
| 1355 | + | |
| 1356 | + | |
| 1357 | + | |
| 1358 | + | |
| 1359 | + | |
| 1360 | + | |
| 1361 | + | |
| 1362 | + | |
| 1363 | + | |
| 1364 | + | |
| 1365 | + | |
| 1366 | + | |
| 1367 | + | |
| 1368 | + | |
| 1369 | + | |
| 1370 | + | |
| 1371 | + | |
| 1372 | + | |
| 1373 | + | |
| 1374 | + | |
| 1375 | + | |
| 1376 | + | |
| 1377 | + | |
| 1378 | + | |
| 1379 | + | |
| 1380 | + | |
| 1381 | + | |
| 1382 | + | |
| 1383 | + | |
| 1384 | + | |
| 1385 | + | |
| 1386 | + | |
| 1387 | + | |
| 1388 | + | |
| 1389 | + | |
| 1390 | + | |
| 1391 | + | |
| 1392 | + | |
| 1393 | + | |
| 1394 | + | |
| 1395 | + | |
| 1396 | + | |
| 1397 | + | |
| 1398 | + | |
| 1399 | + | |
| 1400 | + | |
| 1401 | + | |
| 1402 | + | |
| 1403 | + | |
| 1404 | + | |
| 1405 | + | |
| 1406 | + | |
| 1407 | + | |
| 1408 | + | |
| 1409 | + | |
| 1410 | + | |
| 1411 | + | |
| 1412 | + | |
| 1413 | + | |
| 1414 | + | |
| 1415 | + | |
| 1416 | + | |
| 1417 | + | |
| 1418 | + | |
| 1419 | + | |
| 1420 | + | |
| 1421 | + | |
| 1422 | + | |
| 1423 | + | |
| 1424 | + | |
| 1425 | + | |
| 1426 | + | |
| 1427 | + | |
| 1428 | + | |
| 1429 | + | |
| 1430 | + | |
| 1431 | + | |
| 1432 | + | |
| 1433 | + | |
| 1434 | + | |
| 1435 | + | |
| 1436 | + | |
| 1437 | + | |
| 1438 | + | |
| 1439 | + | |
| 1440 | + | |
| 1441 | + | |
| 1442 | + | |
| 1443 | + | |
| 1444 | + | |
| 1445 | + | |
| 1446 | + | |
| 1447 | + | |
| 1448 | + | |
| 1449 | + | |
| 1450 | + | |
| 1451 | + | |
| 1452 | + | |
| 1453 | + | |
| 1454 | + | |
| 1455 | + | |
| 1456 | + | |
| 1457 | + | |
| 1458 | + | |
| 1459 | + | |
| 1460 | + | |
| 1461 | + | |
| 1462 | + | |
| 1463 | + | |
| 1464 | + | |
| 1465 | + | |
| 1466 | + | |
| 1467 | + | |
| 1468 | + | |
| 1469 | + | |
| 1470 | + | |
| 1471 | + | |
| 1472 | + | |
| 1473 | + | |
| 1474 | + | |
| 1475 | + | |
| 1476 | + | |
| 1477 | + | |
| 1478 | + | |
| 1479 | + | |
| 1480 | + | |
| 1481 | + | |
| 1482 | + | |
| 1483 | + | |
| 1484 | + | |
| 1485 | + | |
| 1486 | + | |
| 1487 | + | |
| 1488 | + | |
| 1489 | + | |
| 1490 | + | |
| 1491 | + | |
| 1492 | + | |
| 1493 | + | |
| 1494 | + | |
| 1495 | + | |
| 1496 | + | |
| 1497 | + | |
| 1498 | + | |
| 1499 | + | |
| 1500 | + | |
| 1501 | + | |
| 1502 | + | |
| 1503 | + | |
| 1504 | + | |
| 1505 | + | |
| 1506 | + | |
| 1507 | + | |
| 1508 | + | |
| 1509 | + | |
| 1510 | + | |
| 1511 | + | |
| 1512 | + | |
| 1513 | + | |
| 1514 | + | |
| 1515 | + | |
| 1516 | + | |
| 1517 | + | |
| 1518 | + | |
| 1519 | + | |
| 1520 | + | |
| 1521 | + | |
| 1522 | + | |
| 1523 | + | |
| 1524 | + | |
| 1525 | + | |
| 1526 | + | |
| 1527 | + | |
| 1528 | + | |
1289 | 1529 | | |
0 commit comments