You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update documentation and scripts for AtomicChat UDT quantization and Qwen 3.6 NextN enhancements
- Revised NEXTN.md to highlight the new AtomicChat UDT collection, detailing the combined `_MTP.gguf` quants and their benefits for NextN processing.
- Updated README.md to reflect changes in recommended sources for Qwen 3.6 models, emphasizing the AtomicChat UDT collection and its features.
- Enhanced quantization scripts to support improved file handling and added compatibility for new tensor types.
- Introduced a new script for running perplexity benchmarks on UDT quant models, generating detailed performance logs.
- Improved error handling and user feedback in various scripts to streamline the quantization and benchmarking processes.
Copy file name to clipboardExpand all lines: NEXTN.md
+38-17Lines changed: 38 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,18 +16,16 @@ See also `MTP.md` (Gemma) and `docs/speculative.md` for shared CLI concepts.
16
16
17
17
## 0. Pre-built model GGUFs
18
18
19
-
Recommended source for Qwen 3.6 combined `*_MTP.gguf` checkpoints is the
20
-
**unsloth** Hugging Face collection — the same files exercised in the
21
-
matrix bench (§7):
19
+
**Recommended:** the [AtomicChat — Qwen 3.6 UDT](https://huggingface.co/collections/AtomicChat/qwen-36-udt-atomicchat-6a0481f5cc5a057c07759176) collection — drop-in combined `*_MTP.gguf` quants tuned for this fork. Each repo ships Q3 / **Q4** / Q5 / Q6 / Q8 `_K_XL`, plus the `mmproj` for vision and a copy of `imatrix_unsloth.gguf_file` for reproducibility. Upstream Unsloth files keep working too — same arch metadata, same NextN tail.
**AtomicChat `UDT` (UD-Turbo)** — this fork publishes additional combined `*_MTP.gguf` quants built with Unsloth’s public MTP-aware `imatrix_unsloth.gguf_file` plus our tensor-type masks (`scripts/quantize-masks/qwen36-ud-*.txt`) for NextN / TurboQuant3-oriented quality. End-to-end recipe: **[docs/qwen-udt/RUNBOOK.md](docs/qwen-udt/RUNBOOK.md)**; HF targets: [release/qwen-udt/HF_REPOS.md](release/qwen-udt/HF_REPOS.md).
26
+
**Why UDT** — built on Unsloth's public MTP-aware [`imatrix_unsloth.gguf_file`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/blob/main/imatrix_unsloth.gguf_file), then layered with this fork's tensor-type masks (see §8): every `blk.*.nextn.*` / `mtp.*` tensor pinned to `Q8_0` to preserve draft acceptance, and `attn_q` / `attn_k` lifted to `Q6_K` so the file pairs cleanly with TurboQuant3 KV. End-to-end recipe & runbook: [docs/qwen-udt/RUNBOOK.md](docs/qwen-udt/RUNBOOK.md). Attribution: Qwen team (weights), Unsloth (imatrix + BF16 sources), @TheTom (TurboQuant), AtomicChat (UDT masks + packaging).
29
27
30
-
Both repos ship `UD-IQ1_M` … `BF16` quants. The shared-model NextN path
28
+
The shared-model NextN path
31
29
works on **any** of them as long as the file contains the NextN auxiliary
32
30
head (`nextn_predict_layers > 0`) — which all `*-MTP-GGUF` quants do by
33
31
construction. `scripts/verify-qwen36-nextn-gguf.py` will refuse to load a
@@ -39,15 +37,15 @@ the same file in the HF cache and takes the shared-model branch:
39
37
```bash
40
38
# 35B-A3B MoE (headline +24-36 % cell in the matrix)
What's actually in each repo, and why it's a bit unusual for a quant drop:
228
+
229
+
-**5 quants per model, all `_MTP.gguf`** — `Q3_K_XL` / `Q4_K_XL` / `Q5_K_XL` / `Q6_K` / `Q8_K_XL`. Every file already includes the NextN auxiliary head, so the same path works for `-m`*and*`-md` — no second GGUF, no second mmap, no second tokenizer.
230
+
-**NextN-preserve mask (V1)** — every `blk.*.nextn.*` and `mtp.*` tensor pinned to `Q8_0`. The cost is ~10 MiB of file size; the win is that the draft head stays close to BF16 fidelity, which keeps `acceptance` high under `--spec-type nextn`. Plain UD quants compress the head at the same bit-width as the body and bleed acceptance under `turbo3` KV.
231
+
-**TurboQuant3-friendly mask (V2)** — attention Q/K bumped to `Q6_K`. This is the piece we tuned specifically for this fork: when KV is compressed to 3-bit via `-ctk turbo3 -ctv turbo3`, the attention scores see extra dequant noise on K, so giving Q/K a little more headroom on the weight side cancels most of it out.
232
+
-**Default release = V3 (V1 ∪ V2)** — the combined mask shipped on Hugging Face. V1-only and V2-only quants exist as ablation artifacts in the build tree but are not published; the V3 file simply has both lifts at once.
233
+
-**mmproj mirrored from Unsloth** — `mmproj-F16.gguf` and `mmproj-BF16.gguf` re-hosted byte-for-byte from the corresponding `unsloth/Qwen3.6-*-MTP-GGUF` repo so a single `-hf` line gets you target + draft + projector.
234
+
-**`imatrix_unsloth.gguf_file` re-hosted** — same artifact as Unsloth's (77-chunk, MTP-aware), included in each repo so the build is reproducible from a clean clone of the recipe.
The whole pipeline (download → quantize on H100 → bench on M4 Max → upload) is scripted in [`docs/qwen-udt/RUNBOOK.md`](../docs/qwen-udt/RUNBOOK.md); re-running it on the same Unsloth sources reproduces the published files bit-identical.
Copy file name to clipboardExpand all lines: README.md
+22-14Lines changed: 22 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@ LLM inference in C/C++
18
18
## Hot topics
19
19
20
20
-**Gemma 4 MTP speculative decoding: pair a `gemma4` target with the official `gemma4_assistant` head (loaded via `--mtp-head`) for ~+30-50 % short-prompt throughput. See [MTP.md](MTP.md) and the pre-built Q4 assistant GGUFs at the [AtomicChat/Gemma 4 Assistant GGUF collection](https://huggingface.co/collections/AtomicChat/gemma-4-assistant-gguf).**
21
-
-**Qwen 3.6 NextN speculative decoding: point `--model-draft` at the same combined `*_MTP.gguf` and pass `--spec-type nextn` — the draft context reuses the target `llama_model` (no second mmap) and lands **+24-36 % tps** on Qwen 3.6 35B-A3B MoE, **+5-7 % tps** on Qwen 3.6 27B dense (MacBook Pro M4 Max, single-slot). See [NEXTN.md](NEXTN.md) and the pre-built combined `_MTP.gguf` quants at [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) / [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF). Optional **AtomicChat `UDT`** quants (Unsloth imatrix + fork masks): [docs/qwen-udt/RUNBOOK.md](docs/qwen-udt/RUNBOOK.md).**
21
+
- **Qwen 3.6 NextN speculative decoding: point `--model-draft` at the same combined `*_MTP.gguf` and pass `--spec-type nextn` — the draft context reuses the target `llama_model` (no second mmap) and lands +24-36 % tps on Qwen 3.6 35B-A3B MoE, +5-7 % tps on Qwen 3.6 27B dense (MacBook Pro M4 Max, single-slot). See [NEXTN.md](NEXTN.md). Recommended pre-built combined `_MTP.gguf` quants live in the **[AtomicChat — Qwen 3.6 UDT](https://huggingface.co/collections/AtomicChat/qwen-36-udt-atomicchat-6a0481f5cc5a057c07759176)** collection ([27B](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF) · [35B-A3B](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF)) — built with the Unsloth public MTP-aware imatrix + fork masks that pin NextN/MTP tensors to `Q8_0` (preserves draft acceptance) and lift attention Q/K to `Q6_K` (pairs cleanly with TurboQuant3 KV); upstream sources also work: [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) / [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF).**
22
22
-**TurboQuant KV cache & weights: WHT-rotated low-bit quantization with backend-native kernels (Metal `TurboFlash`, CUDA, Vulkan, HIP). Use `-ctk turbo3 -ctv turbo3` for ~4.3× KV compression, or quantize weights to `TQ4_1S`/`TQ3_1S`. See [Compression below](#turboquant-kv-cache--weight-compression).**
23
23
-**Hugging Face cache migration: models downloaded with `-hf` are now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools.**
24
24
-**[guide : using the new WebUI of llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/16938)**
@@ -198,25 +198,31 @@ Highlights:
198
198
199
199
### Pre-built model GGUFs
200
200
201
-
Recommended source is the **unsloth** Hugging Face collection — the same
202
-
combined `*_MTP.gguf` files exercised in the matrix bench. The
203
-
`UD-Q4_K_XL` quant is the recommended default (matches the bench cells).
201
+
**Recommended:** the AtomicChat **UDT** (UD-Turbo) collection — drop-in combined `_MTP.gguf` quants tuned for this fork. One repo per model, 5 quants each (Q3 / **Q4** / Q5 / Q6 / Q8 `_K_XL`), plus the `mmproj` for vision and the original Unsloth imatrix re-hosted for reproducibility:
What makes UDT different from a vanilla `llama-quantize -imatrix` run:
209
+
210
+
-**MTP-aware imatrix** — calibrated by Unsloth with the NextN head active (we re-host their public [`imatrix_unsloth.gguf_file`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/blob/main/imatrix_unsloth.gguf_file) so you can reproduce or re-mix on top of it).
211
+
-**NextN-preserve mask** — every `blk.*.nextn.*` and `mtp.*` tensor pinned to `Q8_0`. Tiny size cost (~10 MiB), keeps draft acceptance high.
212
+
-**TurboQuant3-friendly mask** — `attn_q` / `attn_k` bumped to `Q6_K` so the file pairs cleanly with `-ctk turbo3 -ctv turbo3`.
213
+
-**Combined `_MTP.gguf`** — target + NextN head in one file, ready for the shared-model speculative path (`-m` and `-md` point at the same path; no second mmap).
214
+
-**Apache-2.0**, full attribution: Qwen team (weights), Unsloth (imatrix + BF16 sources), @TheTom (TurboQuant), AtomicChat (UDT masks + packaging).
209
215
210
-
**AtomicChat `UDT` quants (UD-Turbo)** — optional GGUFs built with Unsloth’s public MTP-aware imatrix plus fork-specific tensor masks for NextN + TurboQuant3 (`scripts/quantize-masks/qwen36-ud-*.txt`). See **[docs/qwen-udt/RUNBOOK.md](docs/qwen-udt/RUNBOOK.md)** and [release/qwen-udt/HF_REPOS.md](release/qwen-udt/HF_REPOS.md).
0 commit comments