Skip to content

Commit 8893692

Browse files
authored
Merge pull request #13 from AtomicBot-ai/b1-mtp-qwen-rebase
\
2 parents 514e600 + c7e6138 commit 8893692

22 files changed

Lines changed: 969 additions & 55 deletions

NEXTN.md

Lines changed: 60 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -16,16 +16,16 @@ See also `MTP.md` (Gemma) and `docs/speculative.md` for shared CLI concepts.
1616

1717
## 0. Pre-built model GGUFs
1818

19-
Recommended source for Qwen 3.6 combined `*_MTP.gguf` checkpoints is the
20-
**unsloth** Hugging Face collection — the same files exercised in the
21-
matrix bench (§7):
19+
**Recommended:** the [AtomicChat — Qwen 3.6 UDT](https://huggingface.co/collections/AtomicChat/qwen-36-udt-atomicchat-6a0481f5cc5a057c07759176) collection — drop-in combined `*_MTP.gguf` quants tuned for this fork. Each repo ships Q3 / **Q4** / Q5 / Q6 / Q8 `_K_XL`, plus the `mmproj` for vision and a copy of `imatrix_unsloth.gguf_file` for reproducibility. Upstream Unsloth files keep working too — same arch metadata, same NextN tail.
2220

23-
| Target | Combined `_MTP.gguf` (target + NextN head) | Recommended quant | Architecture |
21+
| Target | Recommended (AtomicChat UDT) | Upstream baseline (Unsloth) | Architecture |
2422
|---|---|---|---|
25-
| Qwen 3.6 35B-A3B (MoE) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) | **`UD-Q4_K_XL`** (22.9 GB) | `qwen35moe` |
26-
| Qwen 3.6 27B (dense) | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) | **`UD-Q4_K_XL`** | `qwen35` |
23+
| Qwen 3.6 35B-A3B (MoE) | [`AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF) (`Q4_K_XL` ≈ 20.7 GiB) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) | `qwen35moe` |
24+
| Qwen 3.6 27B (dense) | [`AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF) (`Q4_K_XL` ≈ 17.7 GiB) | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) | `qwen35` |
2725

28-
Both repos ship `UD-IQ1_M``BF16` quants. The shared-model NextN path
26+
**Why UDT** — built on Unsloth's public MTP-aware [`imatrix_unsloth.gguf_file`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/blob/main/imatrix_unsloth.gguf_file), then layered with this fork's tensor-type masks (see §8): every `blk.*.nextn.*` / `mtp.*` tensor pinned to `Q8_0` to preserve draft acceptance, and `attn_q` / `attn_k` lifted to `Q6_K` so the file pairs cleanly with TurboQuant3 KV. End-to-end recipe & runbook: [docs/qwen-udt/RUNBOOK.md](docs/qwen-udt/RUNBOOK.md). Attribution: Qwen team (weights), Unsloth (imatrix + BF16 sources), @TheTom (TurboQuant), AtomicChat (UDT masks + packaging).
27+
28+
The shared-model NextN path
2929
works on **any** of them as long as the file contains the NextN auxiliary
3030
head (`nextn_predict_layers > 0`) — which all `*-MTP-GGUF` quants do by
3131
construction. `scripts/verify-qwen36-nextn-gguf.py` will refuse to load a
@@ -37,15 +37,15 @@ the same file in the HF cache and takes the shared-model branch:
3737
```bash
3838
# 35B-A3B MoE (headline +24-36 % cell in the matrix)
3939
llama-server \
40-
-hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
41-
-hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
40+
-hf AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
41+
-hfd AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
4242
--spec-type nextn --draft-max 2 --draft-min 1 \
4343
-c 8192 -ngl 99 -ngld 99 -fa on
4444

4545
# 27B dense
4646
llama-server \
47-
-hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
48-
-hfd unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
47+
-hf AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL \
48+
-hfd AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL \
4949
--spec-type nextn --draft-max 2 --draft-min 1 \
5050
-c 8192 -ngl 99 -ngld 99 -fa on
5151
```
@@ -121,20 +121,21 @@ let `llama-server` pull both from Hugging Face into the local cache:
121121

122122
```bash
123123
llama-server \
124-
-hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
125-
-hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
124+
-hf AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
125+
-hfd AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
126126
--spec-type nextn --draft-max 2 --draft-min 1
127127
```
128128

129129
---
130130

131131
## 7. Performance notes (MacBook Pro M4 Max, 40-core GPU, 48 GB, Metal)
132132

133-
Median TPS over 3 runs, prompt = 50-token instruction, `--draft-max=2 --draft-min=1`,
133+
Median TPS over 2 runs, prompt = 50-token instruction, `--draft-max=2 --draft-min=1`,
134134
NextN draft DM=2 (single async chain), context 8192. Single-slot
135135
(`--parallel 1 -np 1 --cont-batching`), full GPU offload (`-ngl 99 -ngld 99 -fa on`),
136-
shared-model draft path (no second mmap of combined `_MTP.gguf`). See
137-
`.scratch/bench-logs/qwen-matrix-fullrun-20260512-222625.md`.
136+
shared-model draft path (no second mmap of combined `_MTP.gguf`),
137+
AtomicChat **`UDT-Q4_K_XL_MTP`** file. See
138+
`.scratch/bench-logs/qwen-udt-ab-20260513-132549.md`.
138139

139140
### Bench host
140141

@@ -191,3 +192,46 @@ The jump came from a single architectural change: dropping the second
191192
now runs without OOM and posts +24-36%; (b) draft KV cache resized only for the NextN layer
192193
(`kv_only_nextn = true` is mutated transparently in `llama_context` ctor for draft); (c) the
193194
NextN graph builder now flows through `LLM_GRAPH_TYPE_NEXTN` instead of `override_arch`.
195+
196+
---
197+
198+
## 8. UDT quantization recipe (calibration + masks)
199+
200+
**Goal:** keep Unsloth’s **MTP-aware imatrix** (public `imatrix_unsloth.gguf_file` per HF repo) while applying **AtomicChat-specific** `--tensor-type-file` overrides:
201+
202+
| File | Extra tensors vs base |
203+
|------|-------------------------|
204+
| `scripts/quantize-masks/qwen36-ud-base.txt` | `token_embd` / `output` high bit width; `attn_v` / `ffn_down` lifted; `ffn_gate_inp` for MoE |
205+
| `qwen36-ud-v1-nextn.txt` | All `blk.*.nextn.*` and `mtp.*` at `q8_0` (draft-head preservation) |
206+
| `qwen36-ud-v2-turbo3.txt` | `attn_q` / `attn_k` at `q6_K` (stack with TurboQuant3 KV) |
207+
| `qwen36-ud-v3-combined.txt` | Union of v1 + v2 (default release build) |
208+
209+
**Build entrypoints**
210+
211+
- Single quant: `scripts/quantize-qwen-udt.sh`
212+
- Full sweep: `scripts/quantize-qwen-udt-matrix.sh`
213+
- Remote / bench / HF: **[docs/qwen-udt/RUNBOOK.md](../docs/qwen-udt/RUNBOOK.md)**
214+
215+
**Note:** `UDT` filenames use `…Q4_K_XL…` as a product tag; `llama-quantize` is still invoked with family types `Q4_K_M`, `Q5_K_M`, etc.
216+
217+
---
218+
219+
## 9. Released artifacts — AtomicChat UDT collection
220+
221+
The recipe above ships as two ready-to-pull Hugging Face repos, grouped into one collection:
222+
223+
- Collection — [AtomicChat — Qwen 3.6 UDT](https://huggingface.co/collections/AtomicChat/qwen-36-udt-atomicchat-6a0481f5cc5a057c07759176)
224+
- 27B dense — [`AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF)
225+
- 35B-A3B MoE — [`AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF)
226+
227+
What's actually in each repo, and why it's a bit unusual for a quant drop:
228+
229+
- **5 quants per model, all `_MTP.gguf`**`Q3_K_XL` / `Q4_K_XL` / `Q5_K_XL` / `Q6_K` / `Q8_K_XL`. Every file already includes the NextN auxiliary head, so the same path works for `-m` *and* `-md` — no second GGUF, no second mmap, no second tokenizer.
230+
- **NextN-preserve mask (V1)** — every `blk.*.nextn.*` and `mtp.*` tensor pinned to `Q8_0`. The cost is ~10 MiB of file size; the win is that the draft head stays close to BF16 fidelity, which keeps `acceptance` high under `--spec-type nextn`. Plain UD quants compress the head at the same bit-width as the body and bleed acceptance under `turbo3` KV.
231+
- **TurboQuant3-friendly mask (V2)** — attention Q/K bumped to `Q6_K`. This is the piece we tuned specifically for this fork: when KV is compressed to 3-bit via `-ctk turbo3 -ctv turbo3`, the attention scores see extra dequant noise on K, so giving Q/K a little more headroom on the weight side cancels most of it out.
232+
- **Default release = V3 (V1 ∪ V2)** — the combined mask shipped on Hugging Face. V1-only and V2-only quants exist as ablation artifacts in the build tree but are not published; the V3 file simply has both lifts at once.
233+
- **mmproj mirrored from Unsloth**`mmproj-F16.gguf` and `mmproj-BF16.gguf` re-hosted byte-for-byte from the corresponding `unsloth/Qwen3.6-*-MTP-GGUF` repo so a single `-hf` line gets you target + draft + projector.
234+
- **`imatrix_unsloth.gguf_file` re-hosted** — same artifact as Unsloth's (77-chunk, MTP-aware), included in each repo so the build is reproducible from a clean clone of the recipe.
235+
- **Apache-2.0**, attribution: Qwen team (weights), Unsloth (imatrix + BF16 sources), [@TheTom](https://github.com/TheTom) (TurboQuant), AtomicChat (UDT masks + packaging). Fork: [`AtomicBot-ai/atomic-llama-cpp-turboquant`](https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant).
236+
237+
The whole pipeline (download → quantize on H100 → bench on M4 Max → upload) is scripted in [`docs/qwen-udt/RUNBOOK.md`](../docs/qwen-udt/RUNBOOK.md); re-running it on the same Unsloth sources reproduces the published files bit-identical.

README.md

Lines changed: 23 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ LLM inference in C/C++
1818
## Hot topics
1919

2020
- **Gemma 4 MTP speculative decoding: pair a `gemma4` target with the official `gemma4_assistant` head (loaded via `--mtp-head`) for ~+30-50 % short-prompt throughput. See [MTP.md](MTP.md) and the pre-built Q4 assistant GGUFs at the [AtomicChat/Gemma 4 Assistant GGUF collection](https://huggingface.co/collections/AtomicChat/gemma-4-assistant-gguf).**
21-
- **Qwen 3.6 NextN speculative decoding: point `--model-draft` at the same combined `*_MTP.gguf` and pass `--spec-type nextn` — the draft context reuses the target `llama_model` (no second mmap) and lands **+24-36 % tps** on Qwen 3.6 35B-A3B MoE, **+5-7 % tps** on Qwen 3.6 27B dense (MacBook Pro M4 Max, single-slot). See [NEXTN.md](NEXTN.md) and the pre-built combined `_MTP.gguf` quants at [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) / [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF).**
21+
- **Qwen 3.6 NextN speculative decoding: point `--model-draft` at the same combined `*_MTP.gguf` and pass `--spec-type nextn` — the draft context reuses the target `llama_model` (no second mmap) and lands +24-36 % tps on Qwen 3.6 35B-A3B MoE, +5-7 % tps on Qwen 3.6 27B dense (MacBook Pro M4 Max, single-slot). See [NEXTN.md](NEXTN.md). Recommended pre-built combined `_MTP.gguf` quants live in the **[AtomicChat — Qwen 3.6 UDT](https://huggingface.co/collections/AtomicChat/qwen-36-udt-atomicchat-6a0481f5cc5a057c07759176)** collection ([27B](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF) · [35B-A3B](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF)) — built with the Unsloth public MTP-aware imatrix + fork masks that pin NextN/MTP tensors to `Q8_0` (preserves draft acceptance) and lift attention Q/K to `Q6_K` (pairs cleanly with TurboQuant3 KV); upstream sources also work: [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) / [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF).**
2222
- **TurboQuant KV cache & weights: WHT-rotated low-bit quantization with backend-native kernels (Metal `TurboFlash`, CUDA, Vulkan, HIP). Use `-ctk turbo3 -ctv turbo3` for ~4.3× KV compression, or quantize weights to `TQ4_1S`/`TQ3_1S`. See [Compression below](#turboquant-kv-cache--weight-compression).**
2323
- **Hugging Face cache migration: models downloaded with `-hf` are now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools.**
2424
- **[guide : using the new WebUI of llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/16938)**
@@ -198,23 +198,31 @@ Highlights:
198198

199199
### Pre-built model GGUFs
200200

201-
Recommended source is the **unsloth** Hugging Face collection — the same
202-
combined `*_MTP.gguf` files exercised in the matrix bench. The
203-
`UD-Q4_K_XL` quant is the recommended default (matches the bench cells).
201+
**Recommended:** the AtomicChat **UDT** (UD-Turbo) collection — drop-in combined `_MTP.gguf` quants tuned for this fork. One repo per model, 5 quants each (Q3 / **Q4** / Q5 / Q6 / Q8 `_K_XL`), plus the `mmproj` for vision and the original Unsloth imatrix re-hosted for reproducibility:
204202

205-
| Target | Combined `_MTP.gguf` (target + NextN head) |
206-
|---|---|
207-
| Qwen 3.6 35B-A3B (MoE) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) |
208-
| Qwen 3.6 27B (dense) | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) |
203+
| Target | Recommended (AtomicChat UDT) | Upstream baseline (Unsloth) |
204+
|---|---|---|
205+
| Qwen 3.6 35B-A3B (MoE) | [`AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) |
206+
| Qwen 3.6 27B (dense) | [`AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF) | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) |
207+
208+
What makes UDT different from a vanilla `llama-quantize -imatrix` run:
209+
210+
- **MTP-aware imatrix** — calibrated by Unsloth with the NextN head active (we re-host their public [`imatrix_unsloth.gguf_file`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/blob/main/imatrix_unsloth.gguf_file) so you can reproduce or re-mix on top of it).
211+
- **NextN-preserve mask** — every `blk.*.nextn.*` and `mtp.*` tensor pinned to `Q8_0`. Tiny size cost (~10 MiB), keeps draft acceptance high.
212+
- **TurboQuant3-friendly mask**`attn_q` / `attn_k` bumped to `Q6_K` so the file pairs cleanly with `-ctk turbo3 -ctv turbo3`.
213+
- **Combined `_MTP.gguf`** — target + NextN head in one file, ready for the shared-model speculative path (`-m` and `-md` point at the same path; no second mmap).
214+
- **Apache-2.0**, full attribution: Qwen team (weights), Unsloth (imatrix + BF16 sources), @TheTom (TurboQuant), AtomicChat (UDT masks + packaging).
215+
216+
Collection: [AtomicChat — Qwen 3.6 UDT](https://huggingface.co/collections/AtomicChat/qwen-36-udt-atomicchat-6a0481f5cc5a057c07759176). Full recipe & runbook: [docs/qwen-udt/RUNBOOK.md](docs/qwen-udt/RUNBOOK.md). Mask files: [`scripts/quantize-masks/qwen36-ud-{base,v1-nextn,v2-turbo3,v3-combined}.txt`](scripts/quantize-masks).
209217

210218
### Quick start
211219

212220
```bash
213221
# Pull both target (-hf) and draft (-hfd) from the same HF combined _MTP.gguf;
214222
# they resolve to the same cached file → the server takes the shared-model branch.
215223
llama-server \
216-
-hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
217-
-hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
224+
-hf AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
225+
-hfd AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
218226
--spec-type nextn \
219227
--draft-max 2 --draft-min 1 \
220228
-c 8192 \
@@ -760,13 +768,15 @@ To learn more about model quantization, [read this documentation](tools/quantize
760768
35B-A3B MoE the combination is **+24-36 % tps** vs the same target
761769
without speculation.
762770

763-
Pre-built combined `_MTP.gguf` quants (recommended **`UD-Q4_K_XL`**,
771+
Pre-built combined `_MTP.gguf` quants (recommended **`Q4_K_XL`**,
764772
matches the matrix bench cells):
765773

766774
| Target | Combined `_MTP.gguf` |
767775
|---|---|
768-
| Qwen 3.6 35B-A3B (MoE) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) |
769-
| Qwen 3.6 27B (dense) | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) |
776+
| Qwen 3.6 35B-A3B (MoE) — AtomicChat UDT | [`AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF) |
777+
| Qwen 3.6 27B (dense) — AtomicChat UDT | [`AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF) |
778+
| Qwen 3.6 35B-A3B (MoE) — Unsloth | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) |
779+
| Qwen 3.6 27B (dense) — Unsloth | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) |
770780

771781
```bash
772782
# Pull both target (-hf) and draft (-hfd) from the same HF combined _MTP.gguf.

0 commit comments

Comments
 (0)