@@ -14,6 +14,44 @@ See also `MTP.md` (Gemma) and `docs/speculative.md` for shared CLI concepts.
1414
1515---
1616
17+ ## 0. Pre-built model GGUFs
18+
19+ Recommended source for Qwen 3.6 combined ` *_MTP.gguf ` checkpoints is the
20+ ** unsloth** Hugging Face collection — the same files exercised in the
21+ matrix bench (§7):
22+
23+ | Target | Combined ` _MTP.gguf ` (target + NextN head) | Recommended quant | Architecture |
24+ | ---| ---| ---| ---|
25+ | Qwen 3.6 35B-A3B (MoE) | [ ` unsloth/Qwen3.6-35B-A3B-MTP-GGUF ` ] ( https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF ) | ** ` UD-Q4_K_XL ` ** (22.9 GB) | ` qwen35moe ` |
26+ | Qwen 3.6 27B (dense) | [ ` unsloth/Qwen3.6-27B-MTP-GGUF ` ] ( https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF ) | ** ` UD-Q4_K_XL ` ** | ` qwen35 ` |
27+
28+ Both repos ship ` UD-IQ1_M ` … ` BF16 ` quants. The shared-model NextN path
29+ works on ** any** of them as long as the file contains the NextN auxiliary
30+ head (` nextn_predict_layers > 0 ` ) — which all ` *-MTP-GGUF ` quants do by
31+ construction. ` scripts/verify-qwen36-nextn-gguf.py ` will refuse to load a
32+ file missing the NextN layer.
33+
34+ Quick pull via ` -hf ` (target) + ` -hfd ` (draft); the server resolves both to
35+ the same file in the HF cache and takes the shared-model branch:
36+
37+ ``` bash
38+ # 35B-A3B MoE (headline +24-36 % cell in the matrix)
39+ llama-server \
40+ -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
41+ -hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
42+ --spec-type nextn --draft-max 2 --draft-min 1 \
43+ -c 8192 -ngl 99 -ngld 99 -fa on
44+
45+ # 27B dense
46+ llama-server \
47+ -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
48+ -hfd unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
49+ --spec-type nextn --draft-max 2 --draft-min 1 \
50+ -c 8192 -ngl 99 -ngld 99 -fa on
51+ ```
52+
53+ ---
54+
1755## 1. Architecture
1856
1957| Piece | Role |
@@ -76,35 +114,80 @@ PYTHONPATH=gguf-py python3 scripts/verify-qwen36-nextn-gguf.py /path/to/model.gg
76114- ` scripts/run-qwen36-27b-nextn-server.sh `
77115- ` scripts/run-qwen36-35ba3b-nextn-server.sh `
78116
79- Set ` MAIN_GGUF ` to your Qwen3.6 GGUF; draft defaults to the same path.
117+ Set ` MAIN_GGUF ` to your Qwen3.6 ` *_MTP.gguf ` (see §0 for the recommended
118+ unsloth quants); draft defaults to the same path so the server takes the
119+ shared-model branch. Alternatively use ` -hf ` (target) + ` -hfd ` (draft) to
120+ let ` llama-server ` pull both from Hugging Face into the local cache:
121+
122+ ``` bash
123+ llama-server \
124+ -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
125+ -hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
126+ --spec-type nextn --draft-max 2 --draft-min 1
127+ ```
80128
81129---
82130
83- ## 7. Performance notes (Apple M4 Max, Metal)
131+ ## 7. Performance notes (MacBook Pro M4 Max, 40-core GPU, 48 GB , Metal)
84132
85133Median TPS over 3 runs, prompt = 50-token instruction, ` --draft-max=2 --draft-min=1 ` ,
86- NextN draft DM=2 (single async chain), context 8192. See ` .scratch/bench-logs/qwen-matrix-shared-*.md ` .
87-
88- | model | mode | short tps (n=128) | long tps (n=512) | accept (long) | Δ vs base (long) |
89- | ---| ---| ---:| ---:| ---:| ---:|
90- | qwen-27B dense | f16-base | 20.82 | 20.49 | — | — |
91- | qwen-27B dense | f16-nextn | 20.33 | 18.93 | 72.0% | ** −7.6%** |
92- | qwen-27B dense | turbo3-base | 18.41 | 17.85 | — | — |
93- | qwen-27B dense | turbo3-nextn | 17.88 | 15.72 | 65.4% | ** −11.9%** |
94- | qwen-35B-A3B MoE | f16-base | 69.31 | 69.30 | — | — |
95- | qwen-35B-A3B MoE | f16-nextn | 91.86 | 83.63 | 66.1% | ** +20.7%** |
96- | qwen-35B-A3B MoE | turbo3-base | 62.46 | 61.97 | — | — |
97- | qwen-35B-A3B MoE | turbo3-nextn | 84.91 | 78.41 | 67.7% | ** +26.5%** |
98-
99- ** Where NextN helps** : MoE targets (qwen-35B-A3B) — verify is heavy enough that the draft
100- compute fully overlaps via the async pipeline. Wins range from ** +20% (f16, long)** to
101- ** +36% (turbo3, short)** .
102-
103- ** Known limitation: 27B dense NextN draft is draft-compute-bound.** The NextN-layer is a
104- full transformer block, so on a dense model ` t_draft ≈ 2.6× t_verify ` . The async pipeline
105- cannot overlap that fully → speculative wins are negative or paritetical. turbo3 KV
106- quantization adds another ** ~ 7%** to draft compute (Metal dequant overhead inside the
107- NextN attention), pushing 27B turbo3-nextn long to ** −12%** vs baseline. This is not a bug:
108- isolated diagnostics (` accept_token ` 71.2% f16 ≈ 71.5% turbo3 — H1/H3 rejected,
109- ` t_draft ` 1354 → 1449 ms — H4 partially confirmed) point to physical compute limits on
110- M4 Max. Stick to f16 KV when running NextN on dense Qwen3.6 27B if every percent matters.
134+ NextN draft DM=2 (single async chain), context 8192. Single-slot
135+ (` --parallel 1 -np 1 --cont-batching ` ), full GPU offload (` -ngl 99 -ngld 99 -fa on ` ),
136+ shared-model draft path (no second mmap of combined ` _MTP.gguf ` ). See
137+ ` .scratch/bench-logs/qwen-matrix-fullrun-20260512-222625.md ` .
138+
139+ ### Bench host
140+
141+ | Component | Value |
142+ | ---| ---|
143+ | Machine | MacBook Pro (` Mac16,5 ` , MX313LL/A) |
144+ | SoC | Apple ** M4 Max** — 16 CPU cores (12P + 4E), ** 40-core GPU** |
145+ | Unified memory | ** 48 GB** LPDDR5 |
146+ | OS | macOS 26.3.1 (build 25D2128), Darwin 25.3.0 |
147+ | llama.cpp backend | Metal (full GPU offload: ` -ngl 99 -ngld 99 ` , ` -fa on ` ) |
148+ | Server | local ` llama-server ` over ` 127.0.0.1:8080 ` |
149+ | Client | ` python3 urllib ` → ` /v1/chat/completions ` , ` temperature=0 ` , ` cache_prompt=false ` , ` stream=false ` |
150+ | Driver | ` scripts/bench-matrix-qwen.sh ` (3 runs/cell, median tps, mean accept) |
151+
152+ Single-slot configuration (` --parallel 1 -np 1 --cont-batching ` ); no other
153+ heavy GPU/CPU workloads were running on the host during the matrix sweep.
154+
155+ | model | mode | short tps (n=128) | long tps (n=512) | short accept | long accept | Δ short | Δ long |
156+ | ---| ---| ---:| ---:| ---:| ---:| ---:| ---:|
157+ | qwen-27B dense | f16-base | 21.34 | 20.82 | — | — | — | — |
158+ | qwen-27B dense | f16-nextn | ** 22.86** | ** 21.57** | 93.9% | 85.1% | ** +7.1%** | ** +3.6%** |
159+ | qwen-27B dense | turbo3-base | 19.71 | 18.74 | — | — | — | — |
160+ | qwen-27B dense | turbo3-nextn | ** 20.75** | ** 19.73** | 85.5% | 78.7% | ** +5.3%** | ** +5.3%** |
161+ | qwen-35B-A3B MoE | f16-base | 70.09 | 69.63 | — | — | — | — |
162+ | qwen-35B-A3B MoE | f16-nextn | ** 95.22** | ** 89.13** | 88.2% | 78.7% | ** +35.8%** | ** +28.0%** |
163+ | qwen-35B-A3B MoE | turbo3-base | 61.84 | 62.01 | — | — | — | — |
164+ | qwen-35B-A3B MoE | turbo3-nextn | ** 82.73** | ** 77.20** | 82.9% | 80.6% | ** +33.8%** | ** +24.5%** |
165+
166+ ** Where NextN helps the most: MoE targets (qwen-35B-A3B).** Verify is heavy enough that the
167+ draft compute fully overlaps via the async pipeline; acceptance stays high (≥78%) at both
168+ prompt lengths. Wins range from ** +24% (turbo3, long)** to ** +36% (f16, short)** , on top of
169+ the +13% TurboQuant memory-bandwidth lift from ` turbo3 ` KV.
170+
171+ ** Dense 27B is draft-compute-bound but no longer regresses.** The NextN-layer is a full
172+ transformer block; on a dense model ` t_draft ≈ 2.6× t_verify ` , so the async pipeline cannot
173+ overlap it fully and the upside is bounded by accept-rate × ` (t_verify / (t_verify + non-overlapped t_draft)) ` .
174+ With the shared-model draft path (no double mmap, no graph rebuilds across submits) we land
175+ at ** +5-7% across short/long, both KV typings** — modest but consistent, and * positive*
176+ where the previous double-mmap path was negative (the old ` qwen-matrix-shared ` matrix logged
177+ −7.6% / −11.9% on long for f16-nextn / turbo3-nextn respectively). ` turbo3 ` KV adds ~ 5% extra
178+ draft compute on this rig (Metal dequant inside NextN attention) but it is hidden in the
179+ overlap and TurboQuant's bandwidth win covers the rest.
180+
181+ ### History within this branch (27B regression resolved)
182+
183+ | Bench log (mtime) | Path | 27B f16-nextn long (Δ vs f16-base) | 27B turbo3-nextn long (Δ vs turbo3-base) | Note |
184+ | ---| ---| ---:| ---:| ---|
185+ | ` qwen-matrix-shared-20260512-202358.md ` | double mmap | −7.6 % (18.93 vs 20.49) | −11.9 % (15.72 vs 17.85) | 35B-A3B OOM on long prompts |
186+ | ` qwen-matrix-fullrun-20260512-222625.md ` | shared model | ** +3.6 % (21.57 vs 20.82)** | ** +5.3 % (19.73 vs 18.74)** | this matrix |
187+
188+ The jump came from a single architectural change: dropping the second
189+ ` llama_model_load_from_file ` and reusing the target's already-loaded NextN tensors via
190+ ` cparams.nextn_draft = true ` . Side-effects: (a) 22 GB second ` MTLBuffer ` gone — 35B-A3B MoE
191+ now runs without OOM and posts +24-36%; (b) draft KV cache resized only for the NextN layer
192+ (` kv_only_nextn = true ` is mutated transparently in ` llama_context ` ctor for draft); (c) the
193+ NextN graph builder now flows through ` LLM_GRAPH_TYPE_NEXTN ` instead of ` override_arch ` .
0 commit comments