Skip to content

Commit a90def7

Browse files
Brooooooklynclaude
andauthored
feat(qwen3.5): MTP speculative decoding for Qwen3.5/3.6 dense + MoE (mlx-node#65)
## Overview Adds the **Multi-Token Prediction (MTP) speculative-decoding** stack for Qwen3.5/3.6 dense + MoE (Metal/Apple-Silicon), plus the supporting checkpoint-conversion, server, and benchmarking changes accumulated over ~80 commits. Draft→verify→accept is lossless (Leviathan–Chen); T=0 greedy output is byte-identical to autoregressive. ## What's in here - **MTP draft/verify loop** (flat + paged compiled C++ forward paths), GDN linear-attention state snapshot/restore on partial-accept, chained cycles (Step-A elimination; M5 gen-gated ON, byte-parity verified). - **MTP norm handling** — convert-time `+1.0` RMSNorm shift + load-time raw-checkpoint shift, with double-shift guards. - **Quantized MTP checkpoints** — convert retains the bf16 MTP head (`--q-mtp off` default), splits fused MoE MTP experts to `switch_mlp.*`; validated `qwen3.6-27b-nvfp4-mtp` (dense) and `qwen3.6-35b-a3b-mxfp8-mtp` (MoE). - **`extra_body.generation_mode` / `mtp_depth`** plumbed through the Anthropic `/v1/messages` mapper. - **Server fix (this session):** `/v1/messages` no longer returns `400 Unsupported message role: "system"` — Claude Code SessionStart hook messages are folded into the leading system prompt (position-agnostic, documented contract). Adversarially reviewed; the warm-slot cache key is intentionally left coarse (native token-prefix verifier is the correctness authority; folding hook text into the key would churn it and force cold prefills with no correctness gain). - **Bench harness:** `examples/qwen35-mtp-controlled-verdict.ts` gains a `--prompt` flag (essay/counting/code presets + raw string). ## Performance (MTP vs AR, self-normalized, M5 Max, T=0) MTP speed is **prompt-gated** (acceptance ∝ prompt predictability). Dense breaks even at lower acceptance than MoE (MoE cycle ≈ 2× an AR step). | prompt | 27b dense (d1) | 35b MoE (d1) | notes | |---|---|---|---| | essay (abstract prose) | ~1.03× | 0.82× (loss) | MoE worst case | | "summarize architecture" | 1.09× | 1.09× (CV 1.6%) | realistic agentic prose | | counting (predictable) | 1.58× (d2 1.95×) | 1.26–1.33× | MTP best case | Optimal depth: **dense → 2**, **MoE → 1** (MoE d2 acc ~1.7 < the ~1.9–2.0 needed to beat d1). ## Known issues (NOT merge-ready as-is) From the whole-branch adversarial review: - **BLOCKING:** compiled graph caches bake the first model's weights and are never invalidated on reload (`mlx_clear_weights` lacks a `compile_clear_cache()`), so loading a second same-arch model in one process silently reuses the first's weights. - **HIGH:** a mid-cycle non-EOS stop under `reuse_cache` can leave the cache over-advanced vs `token_history` (flat/MoE/paged). - Reproducible 37G-MoE teardown OOM (`64 bytes failed` at child exit) — memory brushing the ceiling / possible minor MTP-path leak. These should be resolved before merge; opening for review of the overall design and the shippable hot paths. ## Validation - Server fix: 76 mapper unit tests pass; `yarn typecheck` / lint / fmt clean. - MTP correctness: T=0 byte-identical to AR; acceptance scales with depth. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **High Risk** > Touches core Qwen3.5 inference, checkpoint conversion, and server request mapping; the PR description also flags blocking compiled-weight cache invalidation and mid-cycle stop/cache desync risks not fully resolved in this slice. > > **Overview** > Extends the Qwen3.5/3.6 **MTP speculative-decode** surface end-to-end: **`ChatSession`** auto-enables **`enableMtp`** when the loaded model reports MTP weights (explicit **`enableMtp: false`** still wins), and HTTP mappers map **`extra_body.generation_mode`** / **`mtp_depth`** to chat config on both Anthropic and OpenAI-style paths. > > **Server/API hardening:** Anthropic **`/v1/messages`** folds injected **`{ role: 'system' }`** hook messages into the leading system prompt instead of **400**; Responses and Messages reject invalid **`max_output_tokens`** / **`max_tokens`** (non-positive, non-integer, above **`i32::MAX`**) before NAPI truncation can yield silent empty completions. **Qwen3 `generate()`** and **GRPO** reject nonpositive token budgets up front. > > **Convert & checkpoints:** Qwen3.5 sanitize/convert now **retains and normalizes `mtp.*` weights**, optional **`quant_mtp`** policies (cyankiwi / all / **split** drafter dir), sidecar metadata, guards against re-quantizing pre-quantized MTP, and recipe tweaks (8-bit **`o_proj` / `out_proj` / GDN low-rank paths**) for MTP/AR bit-exactness; **NVFP4 without a recipe** is refused. Default **paged decode MLX cache clear cadence** moves **64 → 1024** steps. > > **Observability & tooling:** **`DecodeProfiler`** gains nested phases, **`record_mtp_cycle`**, and mlx-vlm-comparable acceptance metrics on **`PerformanceMetrics`**; a new **`quantized_qmv_microbench`** NAPI hook supports dispatch benchmarking. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 007c03a. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 3360e4f commit a90def7

106 files changed

Lines changed: 37973 additions & 919 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

__test__/models/chat-session.test.ts

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -242,6 +242,61 @@ describe('ChatSession', () => {
242242
// restart which rebuilds chatSessionStart from history.
243243
expect(session.turns).toBe(2);
244244
});
245+
246+
// -----------------------------------------------------------------
247+
// W7 (MTP): `enableMtp` auto-default
248+
// -----------------------------------------------------------------
249+
250+
it('auto-defaults enableMtp=true when the model exposes hasMtpWeights()==true', async () => {
251+
const { model, chatSessionStart } = makeMockModel();
252+
(model as SessionCapableModel).hasMtpWeights = () => true;
253+
const session = new ChatSession(model);
254+
255+
await session.send('Hello');
256+
257+
const [, config] = chatSessionStart.mock.calls[0];
258+
expect(config?.enableMtp).toBe(true);
259+
});
260+
261+
it('does not set enableMtp when the model exposes hasMtpWeights()==false', async () => {
262+
const { model, chatSessionStart } = makeMockModel();
263+
(model as SessionCapableModel).hasMtpWeights = () => false;
264+
const session = new ChatSession(model);
265+
266+
await session.send('Hello');
267+
268+
const [, config] = chatSessionStart.mock.calls[0];
269+
// Auto-default never fires → property stays undefined (not `false`),
270+
// mirroring the contract from the JSDoc on `mergeConfig`.
271+
expect(config?.enableMtp).toBeUndefined();
272+
});
273+
274+
it('does not set enableMtp when the model omits hasMtpWeights() entirely', async () => {
275+
// Models predating W7 (Qwen3, Gemma4, LFM2, etc.) do NOT define
276+
// `hasMtpWeights` on their native wrapper. The duck check inside
277+
// `mergeConfig` must skip the auto-default cleanly.
278+
const { model, chatSessionStart } = makeMockModel();
279+
const session = new ChatSession(model);
280+
281+
await session.send('Hello');
282+
283+
const [, config] = chatSessionStart.mock.calls[0];
284+
expect(config?.enableMtp).toBeUndefined();
285+
});
286+
287+
it('respects an explicit enableMtp=false even when the model has MTP weights', async () => {
288+
// An explicit opt-out from the caller must win over the auto-
289+
// default — operators benchmarking MTP-vs-AR need to be able to
290+
// force the AR path on a checkpoint that ships an MTP head.
291+
const { model, chatSessionStart } = makeMockModel();
292+
(model as SessionCapableModel).hasMtpWeights = () => true;
293+
const session = new ChatSession(model);
294+
295+
await session.send('Hello', { config: { enableMtp: false } });
296+
297+
const [, config] = chatSessionStart.mock.calls[0];
298+
expect(config?.enableMtp).toBe(false);
299+
});
245300
});
246301

247302
// -------------------------------------------------------------------

__test__/models/qwen3.test.ts

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,5 +61,25 @@ describe.sequential('Qwen3 Model', () => {
6161
expect(result.numTokens).toBeGreaterThanOrEqual(0);
6262
expect(result.numTokens).toBeLessThanOrEqual(20);
6363
});
64+
65+
it('should reject a nonpositive maxNewTokens budget (parity with Qwen3.5)', async () => {
66+
// The public generate() API rejects a nonpositive budget (Err)
67+
// instead of panicking the model thread on Vec::with_capacity(-1 as
68+
// usize) == usize::MAX. Requires a real model since the guard runs
69+
// inside generate_sync on the model thread (post-load).
70+
const modelPath = process.env.QWEN3_MODEL_PATH;
71+
72+
if (!modelPath) {
73+
console.log(' ⏭️ Skipping nonpositive-budget reject test (set QWEN3_MODEL_PATH to enable)');
74+
return;
75+
}
76+
77+
const model = await Qwen3Model.load(modelPath);
78+
const messages = [{ role: 'user', content: 'Hello' }];
79+
80+
// Both 0 and a negative budget must reject (not panic, not resolve).
81+
await expect(model.generate(messages, { maxNewTokens: 0 })).rejects.toThrow(/max_new_tokens must be > 0/);
82+
await expect(model.generate(messages, { maxNewTokens: -1 })).rejects.toThrow(/max_new_tokens must be > 0/);
83+
});
6484
});
6585
});

__test__/server/anthropic-request-mapper.test.ts

Lines changed: 240 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ import {
44
canonicalizeSystemForCacheKey,
55
mapAnthropicRequest,
66
} from '../../packages/server/src/mappers/anthropic-request.js';
7+
import type { AnthropicContentBlock } from '../../packages/server/src/types-anthropic.js';
78

89
describe('mapAnthropicRequest', () => {
910
it('maps a simple string user message to a single user ChatMessage', () => {
@@ -1286,4 +1287,243 @@ describe('mapAnthropicRequest', () => {
12861287
expect(messages[0].images).toHaveLength(2);
12871288
});
12881289
});
1290+
1291+
// -------------------------------------------------------------------
1292+
// W7 (MTP): `extra_body.generation_mode` + `extra_body.mtp_depth`
1293+
// -------------------------------------------------------------------
1294+
describe('extra_body MTP overrides', () => {
1295+
it('maps generation_mode "mtp" to enableMtp=true', () => {
1296+
const { config } = mapAnthropicRequest({
1297+
model: 'claude-3-5-sonnet-20241022',
1298+
max_tokens: 1024,
1299+
messages: [{ role: 'user', content: 'Hello' }],
1300+
extra_body: { generation_mode: 'mtp' },
1301+
});
1302+
expect(config.enableMtp).toBe(true);
1303+
});
1304+
1305+
it('maps generation_mode "ar" to enableMtp=false', () => {
1306+
const { config } = mapAnthropicRequest({
1307+
model: 'claude-3-5-sonnet-20241022',
1308+
max_tokens: 1024,
1309+
messages: [{ role: 'user', content: 'Hello' }],
1310+
extra_body: { generation_mode: 'ar' },
1311+
});
1312+
expect(config.enableMtp).toBe(false);
1313+
});
1314+
1315+
it('leaves enableMtp untouched when extra_body is absent', () => {
1316+
const { config } = mapAnthropicRequest({
1317+
model: 'claude-3-5-sonnet-20241022',
1318+
max_tokens: 1024,
1319+
messages: [{ role: 'user', content: 'Hello' }],
1320+
});
1321+
expect(config.enableMtp).toBeUndefined();
1322+
});
1323+
1324+
it('forwards a valid mtp_depth onto config.mtpDepth', () => {
1325+
const { config } = mapAnthropicRequest({
1326+
model: 'claude-3-5-sonnet-20241022',
1327+
max_tokens: 1024,
1328+
messages: [{ role: 'user', content: 'Hello' }],
1329+
extra_body: { mtp_depth: 2 },
1330+
});
1331+
expect(config.mtpDepth).toBe(2);
1332+
});
1333+
});
1334+
1335+
// -------------------------------------------------------------------
1336+
// system-role message folding
1337+
//
1338+
// `system` is not a role in the Anthropic Messages spec, but Claude
1339+
// Code's SessionStart hooks (e.g. superpowers) inject a
1340+
// `{ role: 'system' }` message carrying "additional context" into the
1341+
// `messages` array. The mapper previously rejected this with HTTP 400
1342+
// ("Unsupported message role"). It now folds such a message's text into
1343+
// the single leading system prompt instead. The content is positionless
1344+
// additional context, so its location in the array does not matter — we
1345+
// accumulate in encounter order and prepend after the message loop.
1346+
// -------------------------------------------------------------------
1347+
describe('system-role message folding', () => {
1348+
it('folds a trailing system-role message into a leading system prompt (the 400 repro)', () => {
1349+
// Exact shape that produced `400 Unsupported message role: "system"`:
1350+
// a SessionStart hook appends a `{ role: 'system' }` message after the
1351+
// user turn. It must now be accepted and hoisted to the front.
1352+
const { messages } = mapAnthropicRequest({
1353+
model: 'claude-3-5-sonnet-20241022',
1354+
max_tokens: 1024,
1355+
messages: [
1356+
{ role: 'user', content: 'Hi' },
1357+
{ role: 'system', content: 'Additional context: the repo is mlx-node.' },
1358+
],
1359+
});
1360+
1361+
expect(messages).toEqual([
1362+
{ role: 'system', content: 'Additional context: the repo is mlx-node.' },
1363+
{ role: 'user', content: 'Hi' },
1364+
]);
1365+
});
1366+
1367+
it('joins a top-level system and a folded system-role message with a blank line', () => {
1368+
// The separator between distinct contributions is `'\n\n'`. Pin it.
1369+
const { messages } = mapAnthropicRequest({
1370+
model: 'claude-3-5-sonnet-20241022',
1371+
max_tokens: 1024,
1372+
system: 'You are a helpful assistant.',
1373+
messages: [
1374+
{ role: 'user', content: 'Hi' },
1375+
{ role: 'system', content: 'Session note: be concise.' },
1376+
],
1377+
});
1378+
1379+
expect(messages).toEqual([
1380+
{ role: 'system', content: 'You are a helpful assistant.\n\nSession note: be concise.' },
1381+
{ role: 'user', content: 'Hi' },
1382+
]);
1383+
});
1384+
1385+
it('folds multiple system-role messages in encounter order after the top-level system', () => {
1386+
const { messages } = mapAnthropicRequest({
1387+
model: 'claude-3-5-sonnet-20241022',
1388+
max_tokens: 1024,
1389+
system: 'Base prompt.',
1390+
messages: [
1391+
{ role: 'system', content: 'context A' },
1392+
{ role: 'user', content: 'Hi' },
1393+
{ role: 'system', content: 'context B' },
1394+
],
1395+
});
1396+
1397+
expect(messages).toEqual([
1398+
{ role: 'system', content: 'Base prompt.\n\ncontext A\n\ncontext B' },
1399+
{ role: 'user', content: 'Hi' },
1400+
]);
1401+
});
1402+
1403+
it('concatenates a system-role array content with no separator within the message, tolerating extra block fields', () => {
1404+
// Within a single array-content message, text blocks join with `''`
1405+
// (matching the top-level `system`-array behaviour); the `'\n\n'`
1406+
// separator only appears BETWEEN distinct contributions. Extra block
1407+
// fields a client may attach (e.g. `cache_control`) are ignored — the
1408+
// cast models the wire shape Claude Code can send even though the
1409+
// text-block type does not declare the field.
1410+
const { messages } = mapAnthropicRequest({
1411+
model: 'claude-3-5-sonnet-20241022',
1412+
max_tokens: 1024,
1413+
system: 'Base prompt.',
1414+
messages: [
1415+
{
1416+
role: 'system',
1417+
content: [
1418+
{ type: 'text', text: 'part one ', cache_control: { type: 'ephemeral' } },
1419+
{ type: 'text', text: 'part two' },
1420+
] as unknown as AnthropicContentBlock[],
1421+
},
1422+
{ role: 'user', content: 'Hi' },
1423+
],
1424+
});
1425+
1426+
expect(messages).toEqual([
1427+
{ role: 'system', content: 'Base prompt.\n\npart one part two' },
1428+
{ role: 'user', content: 'Hi' },
1429+
]);
1430+
});
1431+
1432+
it('hoists a mid-conversation system-role message to the front while preserving the rest of the order', () => {
1433+
// CONTRACT: a system-role message is positionless and hoisted to the
1434+
// single leading system prompt regardless of where it appears. This is
1435+
// deliberate — the Anthropic wire format defines no positional `system`
1436+
// role, the only known producer (Claude Code SessionStart hooks) emits
1437+
// positionless additional context, and the internal
1438+
// `ChatMessage`/`primeHistory` pipeline can represent only a single
1439+
// leading system message. See the matching block comment in the mapper.
1440+
const { messages } = mapAnthropicRequest({
1441+
model: 'claude-3-5-sonnet-20241022',
1442+
max_tokens: 1024,
1443+
messages: [
1444+
{ role: 'user', content: 'What is 2+2?' },
1445+
{ role: 'assistant', content: '4' },
1446+
{ role: 'system', content: 'midstream note' },
1447+
{ role: 'user', content: 'Are you sure?' },
1448+
],
1449+
});
1450+
1451+
expect(messages).toEqual([
1452+
{ role: 'system', content: 'midstream note' },
1453+
{ role: 'user', content: 'What is 2+2?' },
1454+
{ role: 'assistant', content: '4' },
1455+
{ role: 'user', content: 'Are you sure?' },
1456+
]);
1457+
});
1458+
1459+
it('drops an empty system-role message rather than corrupting the system prompt with a trailing blank line', () => {
1460+
// An empty hook context message must neither append a dangling `'\n\n'`
1461+
// to a real system prompt nor synthesise a bare empty system message.
1462+
const withTopLevel = mapAnthropicRequest({
1463+
model: 'claude-3-5-sonnet-20241022',
1464+
max_tokens: 1024,
1465+
system: 'Real system prompt.',
1466+
messages: [
1467+
{ role: 'user', content: 'Hi' },
1468+
{ role: 'system', content: '' },
1469+
],
1470+
});
1471+
expect(withTopLevel.messages).toEqual([
1472+
{ role: 'system', content: 'Real system prompt.' },
1473+
{ role: 'user', content: 'Hi' },
1474+
]);
1475+
1476+
// With no top-level system and only an empty folded message, no system
1477+
// message is emitted at all (mirrors the all-stripped-array contract).
1478+
const withoutTopLevel = mapAnthropicRequest({
1479+
model: 'claude-3-5-sonnet-20241022',
1480+
max_tokens: 1024,
1481+
messages: [
1482+
{ role: 'user', content: 'Hi' },
1483+
{ role: 'system', content: '' },
1484+
],
1485+
});
1486+
expect(withoutTopLevel.messages).toEqual([{ role: 'user', content: 'Hi' }]);
1487+
});
1488+
1489+
it('rejects a non-text content block inside a system-role message', () => {
1490+
const imageData =
1491+
'iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mNk+M9QDwADhgGAWjR9awAAAABJRU5ErkJggg==';
1492+
expect(() =>
1493+
mapAnthropicRequest({
1494+
model: 'claude-3-5-sonnet-20241022',
1495+
max_tokens: 1024,
1496+
messages: [
1497+
{
1498+
role: 'system',
1499+
content: [{ type: 'image', source: { type: 'base64', media_type: 'image/png', data: imageData } }],
1500+
},
1501+
],
1502+
}),
1503+
).toThrow(/Unsupported content block type "image" in system-role message/i);
1504+
});
1505+
1506+
it('regression: a request with no system-role message is unaffected by the folding path', () => {
1507+
// The folding path must be inert for normal traffic — a top-level
1508+
// system plus an ordinary multi-turn conversation maps exactly as
1509+
// before, with no `'\n\n'` artifacts introduced.
1510+
const { messages } = mapAnthropicRequest({
1511+
model: 'claude-3-5-sonnet-20241022',
1512+
max_tokens: 1024,
1513+
system: 'You are a helpful assistant.',
1514+
messages: [
1515+
{ role: 'user', content: 'What is 2+2?' },
1516+
{ role: 'assistant', content: '4' },
1517+
{ role: 'user', content: 'Are you sure?' },
1518+
],
1519+
});
1520+
1521+
expect(messages).toEqual([
1522+
{ role: 'system', content: 'You are a helpful assistant.' },
1523+
{ role: 'user', content: 'What is 2+2?' },
1524+
{ role: 'assistant', content: '4' },
1525+
{ role: 'user', content: 'Are you sure?' },
1526+
]);
1527+
});
1528+
});
12891529
});

0 commit comments

Comments
 (0)