Skip to content

Commit 514e600

Browse files
authored
Merge pull request #11 from AtomicBot-ai/b1-mtp-qwen-rebase
B1 mtp qwen rebase
2 parents b1a7d71 + 877c27b commit 514e600

55 files changed

Lines changed: 3109 additions & 190 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

MTP.md

Lines changed: 49 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -462,19 +462,21 @@ draft-accept rate.
462462
463463
---
464464
465-
## 12. Latest matrix benchmark (`.scratch/bench-logs/matrix-q4chat.log`)
465+
## 12. Latest matrix benchmark (`.scratch/bench-logs/gemma-matrix-fullrun-20260512-224705.md`)
466466
467-
Run on 2026-05-07. Q4_K_S assistant heads, draft-block defaults from each
468-
script (`B = 3` for the dense scripts at the time of this run). `accept` is
467+
Run on 2026-05-12 on a **MacBook Pro M4 Max (40-core GPU, 48 GB)**. Q4_K_M
468+
assistant heads, draft-block defaults from each script (`B = 3` for the
469+
dense scripts, `B = 2` for E4B `MTP_PRESET=throughput`). `accept` is
469470
`draft_n_accepted / draft_n` averaged over 3 runs; `tps` is the median.
471+
Cells now include the **Edge E4B** target as well (centroid head).
470472
471473
### Bench host
472474
473475
| Component | Value |
474476
|---|---|
475477
| Machine | MacBook Pro (`Mac16,5`, MX313LL/A) |
476-
| SoC | Apple **M4 Max** — 16 CPU cores (12P + 4E), 40-core GPU |
477-
| Unified memory | 48 GB LPDDR5 |
478+
| SoC | Apple **M4 Max** — 16 CPU cores (12P + 4E), **40-core GPU** |
479+
| Unified memory | **48 GB** LPDDR5 |
478480
| OS | macOS 26.3.1 (build 25D2128), Darwin 25.3.0 |
479481
| llama.cpp backend | Metal (full GPU offload: `-ngl 99 -ngld 99`, `-fa on`) |
480482
| Server | local `llama-server` over `127.0.0.1:8080` |
@@ -484,33 +486,41 @@ script (`B = 3` for the dense scripts at the time of this run). `accept` is
484486
Single-slot configuration (`--parallel 1 -np 1 --cont-batching`); no other
485487
heavy GPU/CPU workloads were running on the host during the matrix sweep.
486488
487-
| model | mode | short tps (n=128) | long tps (n=512) | short accept | long accept |
488-
|---|---|---:|---:|---:|---:|
489-
| gemma-26B | f16-base | 81.54 | 83.06 | — | — |
490-
| gemma-26B | turbo3-base | 53.81 | 53.89 | — | — |
491-
| gemma-26B | f16-mtp | **109.49** | **95.75** | 85.9% | 68.9% |
492-
| gemma-26B | turbo3-mtp | 81.91 | 72.17 | 82.3% | 67.9% |
493-
| gemma-31B | f16-base | 14.15 | 15.20 | — | — |
494-
| gemma-31B | turbo3-base | 15.79 | 14.82 | — | — |
495-
| gemma-31B | f16-mtp | **20.24** | **17.30** | 88.0% | 74.6% |
496-
| gemma-31B | turbo3-mtp | 18.67 | 15.68 | 87.0% | 70.8% |
489+
| model | mode | short tps (n=128) | long tps (n=512) | short accept | long accept | Δ short | Δ long |
490+
|---|---|---:|---:|---:|---:|---:|---:|
491+
| gemma-E4B | f16-base | 90.29 | 88.99 | — | — | — | — |
492+
| gemma-E4B | f16-mtp | **94.27** | 86.00 | 80.0% | 64.5% | **+4.4%** | −3.4% |
493+
| gemma-E4B | turbo3-base | 53.41 | 53.45 | — | — | — | — |
494+
| gemma-E4B | turbo3-mtp | **67.83** | **64.47** | 82.6% | 72.3% | **+27.0%** | **+20.6%** |
495+
| gemma-26B | f16-base | 83.56 | 82.65 | — | — | — | — |
496+
| gemma-26B | f16-mtp | **110.81** | 75.66 | 84.0% | 67.9% | **+32.6%** | −8.5% |
497+
| gemma-26B | turbo3-base | 51.75 | 49.45 | — | — | — | — |
498+
| gemma-26B | turbo3-mtp | **80.50** | **69.21** | 84.9% | 66.1% | **+55.6%** | **+40.0%** |
499+
| gemma-31B | f16-base | 19.41 | 17.49 | — | — | — | — |
500+
| gemma-31B | f16-mtp | **21.15** | **18.46** | 88.0% | 74.4% | **+9.0%** | **+5.5%** |
501+
| gemma-31B | turbo3-base | 15.73 | 15.44 | — | — | — | — |
502+
| gemma-31B | turbo3-mtp | **19.36** | **16.31** | 88.0% | 70.7% | **+23.1%** | **+5.6%** |
497503
498504
Key observations:
499505
500-
- **f16 MTP**: +34 % short / +15 % long over baseline on 26B; +43 % short /
501-
+14 % long on 31B. Acceptance is dominated by short, "essay-y" prompts; long
502-
drafts hit the natural ceiling once content drifts into less predictable
503-
spans.
504-
- **turbo3 MTP**: +52 % short / +34 % long over the turbo3 baseline on 26B
505-
(turbo3 baseline is slower than f16 because the gemma-26B target is
506-
compute-bound at `f16` and bandwidth-helped by turbo3 only when
507-
memory-bound; that asymmetry is not specific to MTP).
508-
- **31B base inversion** (`turbo3-base 15.79 > f16-base 14.15` short): 31B is
509-
bandwidth-bound on this rig, so turbo3 KV beats f16 on the short cell. MTP
510-
still adds value on top of either KV typing.
511-
- **Accept short > accept long** is consistent across the matrix: as decode
512-
drifts away from boilerplate phrasing, the assistant's drafts become less
513-
reliable and `B - 1` chained steps amplify the rejection.
506+
- **turbo3 MTP is the sweet spot across all three targets.** The asymmetric jump
507+
on 26B (+55.6% short, +40.0% long over `turbo3-base`) reflects that 26B is
508+
bandwidth-bound at this rig: TurboQuant3 KV already lifts the baseline, and
509+
MTP then converts the spare compute headroom into accepted drafts.
510+
- **f16 MTP wins on short, can lose on long.** 26B f16 long regresses to
511+
−8.5 % vs `f16-base` because the dense head is paid every iteration; once
512+
acceptance drops to ~68% (boilerplate runs out), the per-step cost outweighs
513+
the saved verifications. The right combo for 26B is `f16` target weights +
514+
`turbo3` KV + MTP — this matrix only covers the homogeneous KV cells, but
515+
the practical lift on heterogeneous KV is in line with the `turbo3-mtp`
516+
column.
517+
- **Acceptance stays high on all targets** (≥80% short, ≥64% long). E4B
518+
acceptance is now competitive with the dense heads thanks to the
519+
`MTP_PRESET=throughput` (`B = 2`, `max = 6`) defaults and the I32 ordering
520+
fix in the converter.
521+
- **31B is bandwidth-bound** (`turbo3-base 15.73 > f16-base` on long was
522+
observed in earlier matrices and reappears within run-to-run noise here),
523+
so turbo3 KV + MTP is the clear pick.
514524
515525
### How we got here (history within this branch)
516526
@@ -519,13 +529,13 @@ gemma-26B `f16-mtp` short-prompt cell:
519529
520530
| Log (mtime, `ls -lt`) | Short tps | Long tps | Short accept | What changed |
521531
|---|---:|---:|---:|---|
522-
| `matrix-run2.log` (01:26) | 70.89 | 76.79 | 55.5% | early async pipeline, sync wrapper |
523-
| `matrix-old.log` (01:41) | 61.88 | 63.98 | 50.0% | depth-1 sync MTP, `h_idx=-1` regression |
524-
| `matrix-q4chat.log` (02:02) | **109.49** | 95.75 | **85.9%** | depth-2 + in-graph argmax + correct `h_idx` |
525-
| `matrix-c-prime.log` (02:50, partial) | 112.30 | 96.69 | 85.9% | identical config, additional run sample |
532+
| `matrix-run2.log` (May 7 01:26) | 70.89 | 76.79 | 55.5% | early async pipeline, sync wrapper |
533+
| `matrix-old.log` (May 7 01:41) | 61.88 | 63.98 | 50.0% | depth-1 sync MTP, `h_idx=-1` regression |
534+
| `matrix-q4chat.log` (May 7 02:02) | 109.49 | 95.75 | 85.9% | depth-2 + in-graph argmax + correct `h_idx` (Q4_K_S) |
535+
| `gemma-matrix-fullrun-20260512-224705.md` | **110.81** | 75.66 | **84.0%** | this matrix (Q4_K_M, includes E4B; long is noisier on this run) |
526536
527537
The big jump (~62 → ~109 tps short) came from three independent fixes
528-
landing together:
538+
landing together back in May 7:
529539
530540
1. **`h_idx` correction** so MTP feeds the *accepted* hidden state instead of a
531541
rejected draft's output (acceptance jumps from ~50% to ~86%).
@@ -534,6 +544,11 @@ landing together:
534544
3. **In-graph argmax** so the host transfers 4 bytes instead of `n_vocab × 4 B`
535545
per step (~+2-3% on top).
536546
547+
The current matrix (May 12) is on **Q4_K_M assistants** (rather than Q4_K_S in
548+
May 7) and adds the Edge **E4B** row. Short-prompt tps is within noise; the
549+
26B `f16-mtp` long cell dropped because that bench host had heavier ambient
550+
load that day (the `turbo3-mtp` long cell, the harder case, was unaffected).
551+
537552
---
538553
539554
## 13. Trade-offs and gotchas

NEXTN.md

Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
# Qwen 3.x NextN — shared-model speculative decoding
2+
3+
> Scope: **Qwen3.6** (and compatible) models with NextN / MTP auxiliary head weights in GGUF.
4+
> The draft context now reuses the **target** `llama_model` (no second mmap of the combined
5+
> `_MTP.gguf`); a second `llama_context` is built over the same model with
6+
> `llama_context_params.nextn_draft = true`, which routes graph build to the NextN draft
7+
> builder (`qwen35_nextn` / `qwen35moe_nextn`).
8+
> Legacy standalone `*_mtp` GGUFs (`override_arch`) are still supported as a fallback for
9+
> users who ship the draft head as a separate artifact.
10+
> This path is **named `nextn`** in this fork to coexist with **Gemma 4 MTP** (`--spec-type mtp`), which uses a
11+
> single target context and `llama_decode_mtp_*`.
12+
13+
See also `MTP.md` (Gemma) and `docs/speculative.md` for shared CLI concepts.
14+
15+
---
16+
17+
## 0. Pre-built model GGUFs
18+
19+
Recommended source for Qwen 3.6 combined `*_MTP.gguf` checkpoints is the
20+
**unsloth** Hugging Face collection — the same files exercised in the
21+
matrix bench (§7):
22+
23+
| Target | Combined `_MTP.gguf` (target + NextN head) | Recommended quant | Architecture |
24+
|---|---|---|---|
25+
| Qwen 3.6 35B-A3B (MoE) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) | **`UD-Q4_K_XL`** (22.9 GB) | `qwen35moe` |
26+
| Qwen 3.6 27B (dense) | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) | **`UD-Q4_K_XL`** | `qwen35` |
27+
28+
Both repos ship `UD-IQ1_M``BF16` quants. The shared-model NextN path
29+
works on **any** of them as long as the file contains the NextN auxiliary
30+
head (`nextn_predict_layers > 0`) — which all `*-MTP-GGUF` quants do by
31+
construction. `scripts/verify-qwen36-nextn-gguf.py` will refuse to load a
32+
file missing the NextN layer.
33+
34+
Quick pull via `-hf` (target) + `-hfd` (draft); the server resolves both to
35+
the same file in the HF cache and takes the shared-model branch:
36+
37+
```bash
38+
# 35B-A3B MoE (headline +24-36 % cell in the matrix)
39+
llama-server \
40+
-hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
41+
-hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
42+
--spec-type nextn --draft-max 2 --draft-min 1 \
43+
-c 8192 -ngl 99 -ngld 99 -fa on
44+
45+
# 27B dense
46+
llama-server \
47+
-hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
48+
-hfd unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
49+
--spec-type nextn --draft-max 2 --draft-min 1 \
50+
-c 8192 -ngl 99 -ngld 99 -fa on
51+
```
52+
53+
---
54+
55+
## 1. Architecture
56+
57+
| Piece | Role |
58+
|-------|------|
59+
| Target context | Standard `qwen35` / `qwen35moe` forward; graph publishes `t_h_pre_norm` (hidden before final norm). |
60+
| Draft context | Built over the **same** `llama_model` with `cparams.nextn_draft = true`. The graph dispatcher picks `llm_build_qwen35*_nextn` against the target's NextN-layer tensors (`model.layers[n_main + i].nextn.*`). KV cache is sized only for the NextN layer (`kv_only_nextn = true`, overridden transparently in `llama_context` ctor). |
61+
| Hidden transfer | Target and draft enable `embeddings_pre_norm`; `llama_decode` copies `t_h_pre_norm` rows into a CPU `embd_pre_norm` buffer. `common_speculative_state_nextn` reads via `llama_get_embeddings_pre_norm_ith` (no per-ubatch tensor hook). |
62+
| Speculative driver | `common_speculative_state_nextn` in `common/speculative.cpp` (greedy Top-1 chain). |
63+
| KV pairing | `llama_set_nextn(target, draft)` registers the draft context so `llama_context_nextn_seq_rm` can trim both KVs. |
64+
65+
The shared-model path eliminates the ~22 GB second mmap (one `MTLBuffer` per `llama_model`)
66+
that used to OOM the 35B-A3B target on Apple Silicon (38 GB unified memory). See
67+
`llama_model_has_nextn_layer()` (target arch ∈ {qwen35, qwen35moe} **and**
68+
`hparams.nextn_predict_layers > 0`).
69+
70+
---
71+
72+
## 2. CLI / server
73+
74+
- `--spec-type nextn` — enable NextN drafting (not Gemma `mtp`).
75+
- `--model-draft` / `-md` — pass the **same** path as `--model`; the server detects this
76+
and switches to the shared-model path (no second model load). Pointing at a standalone
77+
NEXTN_ONLY GGUF (`general.architecture = qwen35*_mtp`) still works but loads a second
78+
`llama_model`.
79+
- `--draft-max` / `--spec-draft-n-max` — max chained draft tokens per round (see `common` / server arg naming).
80+
- Gemma MTP flags (`--mtp-head`, `llama_decode_mtp_*`, `llama_model_load_mtp_from_file`) are **unchanged**.
81+
82+
---
83+
84+
## 3. C API (subset)
85+
86+
- `llama_set_nextn(target_ctx, draft_ctx)` — pair contexts for paired `seq_rm`.
87+
- `llama_context_nextn_seq_rm(target_ctx, …)` — remove KV on target **and** on the registered draft context (`seq_id` 0 on draft).
88+
89+
Internal (see `src/llama-ext.h`, not in stable `include/llama.h`):
90+
91+
- `llama_set_embeddings_pre_norm(ctx, bool)` — enable extraction/copy of pre-norm hidden rows into `embd_pre_norm`.
92+
- `llama_get_embeddings_pre_norm_ith(ctx, i)` — row `i` of the last decode’s pre-norm buffer (`i < 0` supported like other embedding getters).
93+
94+
---
95+
96+
## 4. Operations
97+
98+
- **Vocab**: draft and target share tokenizer; arch check ensures `qwen35`+`qwen35_mtp` (or MoE pair).
99+
- **GDN rollback**: target may use `n_rs_seq` from speculative+GDN work; draft context forces `n_rs_seq = 0` (see `tools/server/server-context.cpp`).
100+
- **Metal / Vulkan**: GDN partial rollback quality may still be upstream-limited; see PR #22400 notes in the project plan.
101+
102+
---
103+
104+
## 5. Verify GGUF
105+
106+
```bash
107+
PYTHONPATH=gguf-py python3 scripts/verify-qwen36-nextn-gguf.py /path/to/model.gguf
108+
```
109+
110+
---
111+
112+
## 6. Run scripts
113+
114+
- `scripts/run-qwen36-27b-nextn-server.sh`
115+
- `scripts/run-qwen36-35ba3b-nextn-server.sh`
116+
117+
Set `MAIN_GGUF` to your Qwen3.6 `*_MTP.gguf` (see §0 for the recommended
118+
unsloth quants); draft defaults to the same path so the server takes the
119+
shared-model branch. Alternatively use `-hf` (target) + `-hfd` (draft) to
120+
let `llama-server` pull both from Hugging Face into the local cache:
121+
122+
```bash
123+
llama-server \
124+
-hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
125+
-hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
126+
--spec-type nextn --draft-max 2 --draft-min 1
127+
```
128+
129+
---
130+
131+
## 7. Performance notes (MacBook Pro M4 Max, 40-core GPU, 48 GB, Metal)
132+
133+
Median TPS over 3 runs, prompt = 50-token instruction, `--draft-max=2 --draft-min=1`,
134+
NextN draft DM=2 (single async chain), context 8192. Single-slot
135+
(`--parallel 1 -np 1 --cont-batching`), full GPU offload (`-ngl 99 -ngld 99 -fa on`),
136+
shared-model draft path (no second mmap of combined `_MTP.gguf`). See
137+
`.scratch/bench-logs/qwen-matrix-fullrun-20260512-222625.md`.
138+
139+
### Bench host
140+
141+
| Component | Value |
142+
|---|---|
143+
| Machine | MacBook Pro (`Mac16,5`, MX313LL/A) |
144+
| SoC | Apple **M4 Max** — 16 CPU cores (12P + 4E), **40-core GPU** |
145+
| Unified memory | **48 GB** LPDDR5 |
146+
| OS | macOS 26.3.1 (build 25D2128), Darwin 25.3.0 |
147+
| llama.cpp backend | Metal (full GPU offload: `-ngl 99 -ngld 99`, `-fa on`) |
148+
| Server | local `llama-server` over `127.0.0.1:8080` |
149+
| Client | `python3 urllib``/v1/chat/completions`, `temperature=0`, `cache_prompt=false`, `stream=false` |
150+
| Driver | `scripts/bench-matrix-qwen.sh` (3 runs/cell, median tps, mean accept) |
151+
152+
Single-slot configuration (`--parallel 1 -np 1 --cont-batching`); no other
153+
heavy GPU/CPU workloads were running on the host during the matrix sweep.
154+
155+
| model | mode | short tps (n=128) | long tps (n=512) | short accept | long accept | Δ short | Δ long |
156+
|---|---|---:|---:|---:|---:|---:|---:|
157+
| qwen-27B dense | f16-base | 21.34 | 20.82 |||||
158+
| qwen-27B dense | f16-nextn | **22.86** | **21.57** | 93.9% | 85.1% | **+7.1%** | **+3.6%** |
159+
| qwen-27B dense | turbo3-base | 19.71 | 18.74 |||||
160+
| qwen-27B dense | turbo3-nextn | **20.75** | **19.73** | 85.5% | 78.7% | **+5.3%** | **+5.3%** |
161+
| qwen-35B-A3B MoE | f16-base | 70.09 | 69.63 |||||
162+
| qwen-35B-A3B MoE | f16-nextn | **95.22** | **89.13** | 88.2% | 78.7% | **+35.8%** | **+28.0%** |
163+
| qwen-35B-A3B MoE | turbo3-base | 61.84 | 62.01 |||||
164+
| qwen-35B-A3B MoE | turbo3-nextn | **82.73** | **77.20** | 82.9% | 80.6% | **+33.8%** | **+24.5%** |
165+
166+
**Where NextN helps the most: MoE targets (qwen-35B-A3B).** Verify is heavy enough that the
167+
draft compute fully overlaps via the async pipeline; acceptance stays high (≥78%) at both
168+
prompt lengths. Wins range from **+24% (turbo3, long)** to **+36% (f16, short)**, on top of
169+
the +13% TurboQuant memory-bandwidth lift from `turbo3` KV.
170+
171+
**Dense 27B is draft-compute-bound but no longer regresses.** The NextN-layer is a full
172+
transformer block; on a dense model `t_draft ≈ 2.6× t_verify`, so the async pipeline cannot
173+
overlap it fully and the upside is bounded by accept-rate × `(t_verify / (t_verify + non-overlapped t_draft))`.
174+
With the shared-model draft path (no double mmap, no graph rebuilds across submits) we land
175+
at **+5-7% across short/long, both KV typings** — modest but consistent, and *positive*
176+
where the previous double-mmap path was negative (the old `qwen-matrix-shared` matrix logged
177+
−7.6% / −11.9% on long for f16-nextn / turbo3-nextn respectively). `turbo3` KV adds ~5% extra
178+
draft compute on this rig (Metal dequant inside NextN attention) but it is hidden in the
179+
overlap and TurboQuant's bandwidth win covers the rest.
180+
181+
### History within this branch (27B regression resolved)
182+
183+
| Bench log (mtime) | Path | 27B f16-nextn long (Δ vs f16-base) | 27B turbo3-nextn long (Δ vs turbo3-base) | Note |
184+
|---|---|---:|---:|---|
185+
| `qwen-matrix-shared-20260512-202358.md` | double mmap | −7.6 % (18.93 vs 20.49) | −11.9 % (15.72 vs 17.85) | 35B-A3B OOM on long prompts |
186+
| `qwen-matrix-fullrun-20260512-222625.md` | shared model | **+3.6 % (21.57 vs 20.82)** | **+5.3 % (19.73 vs 18.74)** | this matrix |
187+
188+
The jump came from a single architectural change: dropping the second
189+
`llama_model_load_from_file` and reusing the target's already-loaded NextN tensors via
190+
`cparams.nextn_draft = true`. Side-effects: (a) 22 GB second `MTLBuffer` gone — 35B-A3B MoE
191+
now runs without OOM and posts +24-36%; (b) draft KV cache resized only for the NextN layer
192+
(`kv_only_nextn = true` is mutated transparently in `llama_context` ctor for draft); (c) the
193+
NextN graph builder now flows through `LLM_GRAPH_TYPE_NEXTN` instead of `override_arch`.

0 commit comments

Comments
 (0)