AtomicBot-ai
diff --git a/‎NEXTN.md‎
Lines changed: 60 additions & 16 deletions b/‎NEXTN.md‎
Lines changed: 60 additions & 16 deletions
diff --git a/‎README.md‎
Lines changed: 23 additions & 13 deletions b/‎README.md‎
Lines changed: 23 additions & 13 deletions
@@ -16,16 +16,16 @@ See also `MTP.md` (Gemma) and `docs/speculative.md` for shared CLI concepts.
 
 ## 0. Pre-built model GGUFs
 
-Recommended source for Qwen 3.6 combined `*_MTP.gguf` checkpoints is the
-**unsloth** Hugging Face collection — the same files exercised in the
-matrix bench (§7):
+**Recommended:** the [AtomicChat — Qwen 3.6 UDT](https://huggingface.co/collections/AtomicChat/qwen-36-udt-atomicchat-6a0481f5cc5a057c07759176) collection — drop-in combined `*_MTP.gguf` quants tuned for this fork. Each repo ships Q3 / **Q4** / Q5 / Q6 / Q8 `_K_XL`, plus the `mmproj` for vision and a copy of `imatrix_unsloth.gguf_file` for reproducibility. Upstream Unsloth files keep working too — same arch metadata, same NextN tail.
 
-| Target | Combined `_MTP.gguf` (target + NextN head) | Recommended quant | Architecture |
+| Target | Recommended (AtomicChat UDT) | Upstream baseline (Unsloth) | Architecture |
 |---|---|---|---|
-| Qwen 3.6 35B-A3B (MoE) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) | **`UD-Q4_K_XL`** (22.9 GB) | `qwen35moe` |
-| Qwen 3.6 27B (dense)   | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) | **`UD-Q4_K_XL`** | `qwen35` |
+| Qwen 3.6 35B-A3B (MoE) | [`AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF) (`Q4_K_XL` ≈ 20.7 GiB) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) | `qwen35moe` |
+| Qwen 3.6 27B (dense)   | [`AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF) (`Q4_K_XL` ≈ 17.7 GiB)             | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF)             | `qwen35`    |
 
-Both repos ship `UD-IQ1_M` … `BF16` quants. The shared-model NextN path
+**Why UDT** — built on Unsloth's public MTP-aware [`imatrix_unsloth.gguf_file`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/blob/main/imatrix_unsloth.gguf_file), then layered with this fork's tensor-type masks (see §8): every `blk.*.nextn.*` / `mtp.*` tensor pinned to `Q8_0` to preserve draft acceptance, and `attn_q` / `attn_k` lifted to `Q6_K` so the file pairs cleanly with TurboQuant3 KV. End-to-end recipe & runbook: [docs/qwen-udt/RUNBOOK.md](docs/qwen-udt/RUNBOOK.md). Attribution: Qwen team (weights), Unsloth (imatrix + BF16 sources), @TheTom (TurboQuant), AtomicChat (UDT masks + packaging).
+
+The shared-model NextN path
 works on **any** of them as long as the file contains the NextN auxiliary
 head (`nextn_predict_layers > 0`) — which all `*-MTP-GGUF` quants do by
 construction. `scripts/verify-qwen36-nextn-gguf.py` will refuse to load a
@@ -37,15 +37,15 @@ the same file in the HF cache and takes the shared-model branch:
 ```bash
 # 35B-A3B MoE (headline +24-36 % cell in the matrix)
 llama-server \
-  -hf  unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
-  -hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
+  -hf  AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
+  -hfd AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
   --spec-type nextn --draft-max 2 --draft-min 1 \
   -c 8192 -ngl 99 -ngld 99 -fa on
 
 # 27B dense
 llama-server \
-  -hf  unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
-  -hfd unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
+  -hf  AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL \
+  -hfd AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL \
   --spec-type nextn --draft-max 2 --draft-min 1 \
   -c 8192 -ngl 99 -ngld 99 -fa on
 ```
@@ -121,20 +121,21 @@ let `llama-server` pull both from Hugging Face into the local cache:
 
 ```bash
 llama-server \
-  -hf  unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
-  -hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
+  -hf  AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
+  -hfd AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
   --spec-type nextn --draft-max 2 --draft-min 1
 ```
 
 ---
 
 ## 7. Performance notes (MacBook Pro M4 Max, 40-core GPU, 48 GB, Metal)
 
-Median TPS over 3 runs, prompt = 50-token instruction, `--draft-max=2 --draft-min=1`,
+Median TPS over 2 runs, prompt = 50-token instruction, `--draft-max=2 --draft-min=1`,
 NextN draft DM=2 (single async chain), context 8192. Single-slot
 (`--parallel 1 -np 1 --cont-batching`), full GPU offload (`-ngl 99 -ngld 99 -fa on`),
-shared-model draft path (no second mmap of combined `_MTP.gguf`). See
-`.scratch/bench-logs/qwen-matrix-fullrun-20260512-222625.md`.
+shared-model draft path (no second mmap of combined `_MTP.gguf`),
+AtomicChat **`UDT-Q4_K_XL_MTP`** file. See
+`.scratch/bench-logs/qwen-udt-ab-20260513-132549.md`.
 
 ### Bench host
 
@@ -191,3 +192,46 @@ The jump came from a single architectural change: dropping the second
 now runs without OOM and posts +24-36%; (b) draft KV cache resized only for the NextN layer
 (`kv_only_nextn = true` is mutated transparently in `llama_context` ctor for draft); (c) the
 NextN graph builder now flows through `LLM_GRAPH_TYPE_NEXTN` instead of `override_arch`.
+
+---
+
+## 8. UDT quantization recipe (calibration + masks)
+
+**Goal:** keep Unsloth’s **MTP-aware imatrix** (public `imatrix_unsloth.gguf_file` per HF repo) while applying **AtomicChat-specific** `--tensor-type-file` overrides:
+
+| File | Extra tensors vs base |
+|------|-------------------------|
+| `scripts/quantize-masks/qwen36-ud-base.txt` | `token_embd` / `output` high bit width; `attn_v` / `ffn_down` lifted; `ffn_gate_inp` for MoE |
+| `qwen36-ud-v1-nextn.txt` | All `blk.*.nextn.*` and `mtp.*` at `q8_0` (draft-head preservation) |
+| `qwen36-ud-v2-turbo3.txt` | `attn_q` / `attn_k` at `q6_K` (stack with TurboQuant3 KV) |
+| `qwen36-ud-v3-combined.txt` | Union of v1 + v2 (default release build) |
+
+**Build entrypoints**
+
+- Single quant: `scripts/quantize-qwen-udt.sh`
+- Full sweep: `scripts/quantize-qwen-udt-matrix.sh`
+- Remote / bench / HF: **[docs/qwen-udt/RUNBOOK.md](../docs/qwen-udt/RUNBOOK.md)**
+
+**Note:** `UDT` filenames use `…Q4_K_XL…` as a product tag; `llama-quantize` is still invoked with family types `Q4_K_M`, `Q5_K_M`, etc.
+
+---
+
+## 9. Released artifacts — AtomicChat UDT collection
+
+The recipe above ships as two ready-to-pull Hugging Face repos, grouped into one collection:
+
+- Collection — [AtomicChat — Qwen 3.6 UDT](https://huggingface.co/collections/AtomicChat/qwen-36-udt-atomicchat-6a0481f5cc5a057c07759176)
+- 27B dense — [`AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF)
+- 35B-A3B MoE — [`AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF)
+
+What's actually in each repo, and why it's a bit unusual for a quant drop:
+
+- **5 quants per model, all `_MTP.gguf`** — `Q3_K_XL` / `Q4_K_XL` / `Q5_K_XL` / `Q6_K` / `Q8_K_XL`. Every file already includes the NextN auxiliary head, so the same path works for `-m` *and* `-md` — no second GGUF, no second mmap, no second tokenizer.
+- **NextN-preserve mask (V1)** — every `blk.*.nextn.*` and `mtp.*` tensor pinned to `Q8_0`. The cost is ~10 MiB of file size; the win is that the draft head stays close to BF16 fidelity, which keeps `acceptance` high under `--spec-type nextn`. Plain UD quants compress the head at the same bit-width as the body and bleed acceptance under `turbo3` KV.
+- **TurboQuant3-friendly mask (V2)** — attention Q/K bumped to `Q6_K`. This is the piece we tuned specifically for this fork: when KV is compressed to 3-bit via `-ctk turbo3 -ctv turbo3`, the attention scores see extra dequant noise on K, so giving Q/K a little more headroom on the weight side cancels most of it out.
+- **Default release = V3 (V1 ∪ V2)** — the combined mask shipped on Hugging Face. V1-only and V2-only quants exist as ablation artifacts in the build tree but are not published; the V3 file simply has both lifts at once.
+- **mmproj mirrored from Unsloth** — `mmproj-F16.gguf` and `mmproj-BF16.gguf` re-hosted byte-for-byte from the corresponding `unsloth/Qwen3.6-*-MTP-GGUF` repo so a single `-hf` line gets you target + draft + projector.
+- **`imatrix_unsloth.gguf_file` re-hosted** — same artifact as Unsloth's (77-chunk, MTP-aware), included in each repo so the build is reproducible from a clean clone of the recipe.
+- **Apache-2.0**, attribution: Qwen team (weights), Unsloth (imatrix + BF16 sources), [@TheTom](https://github.com/TheTom) (TurboQuant), AtomicChat (UDT masks + packaging). Fork: [`AtomicBot-ai/atomic-llama-cpp-turboquant`](https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant).
+
+The whole pipeline (download → quantize on H100 → bench on M4 Max → upload) is scripted in [`docs/qwen-udt/RUNBOOK.md`](../docs/qwen-udt/RUNBOOK.md); re-running it on the same Unsloth sources reproduces the published files bit-identical.
@@ -18,7 +18,7 @@ LLM inference in C/C++
 ## Hot topics
 
 - **Gemma 4 MTP speculative decoding: pair a `gemma4` target with the official `gemma4_assistant` head (loaded via `--mtp-head`) for ~+30-50 % short-prompt throughput. See [MTP.md](MTP.md) and the pre-built Q4 assistant GGUFs at the [AtomicChat/Gemma 4 Assistant GGUF collection](https://huggingface.co/collections/AtomicChat/gemma-4-assistant-gguf).**
-- **Qwen 3.6 NextN speculative decoding: point `--model-draft` at the same combined `*_MTP.gguf` and pass `--spec-type nextn` — the draft context reuses the target `llama_model` (no second mmap) and lands **+24-36 % tps** on Qwen 3.6 35B-A3B MoE, **+5-7 % tps** on Qwen 3.6 27B dense (MacBook Pro M4 Max, single-slot). See [NEXTN.md](NEXTN.md) and the pre-built combined `_MTP.gguf` quants at [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) / [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF).**
+- **Qwen 3.6 NextN speculative decoding: point `--model-draft` at the same combined `*_MTP.gguf` and pass `--spec-type nextn` — the draft context reuses the target `llama_model` (no second mmap) and lands +24-36 % tps on Qwen 3.6 35B-A3B MoE, +5-7 % tps on Qwen 3.6 27B dense (MacBook Pro M4 Max, single-slot). See [NEXTN.md](NEXTN.md). Recommended pre-built combined `_MTP.gguf` quants live in the **[AtomicChat — Qwen 3.6 UDT](https://huggingface.co/collections/AtomicChat/qwen-36-udt-atomicchat-6a0481f5cc5a057c07759176)** collection ([27B](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF) · [35B-A3B](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF)) — built with the Unsloth public MTP-aware imatrix + fork masks that pin NextN/MTP tensors to `Q8_0` (preserves draft acceptance) and lift attention Q/K to `Q6_K` (pairs cleanly with TurboQuant3 KV); upstream sources also work: [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) / [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF).**
 - **TurboQuant KV cache & weights: WHT-rotated low-bit quantization with backend-native kernels (Metal `TurboFlash`, CUDA, Vulkan, HIP). Use `-ctk turbo3 -ctv turbo3` for ~4.3× KV compression, or quantize weights to `TQ4_1S`/`TQ3_1S`. See [Compression below](#turboquant-kv-cache--weight-compression).**
 - **Hugging Face cache migration: models downloaded with `-hf` are now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools.**
 - **[guide : using the new WebUI of llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/16938)**
@@ -198,23 +198,31 @@ Highlights:
 
 ### Pre-built model GGUFs
 
-Recommended source is the **unsloth** Hugging Face collection — the same
-combined `*_MTP.gguf` files exercised in the matrix bench. The
-`UD-Q4_K_XL` quant is the recommended default (matches the bench cells).
+**Recommended:** the AtomicChat **UDT** (UD-Turbo) collection — drop-in combined `_MTP.gguf` quants tuned for this fork. One repo per model, 5 quants each (Q3 / **Q4** / Q5 / Q6 / Q8 `_K_XL`), plus the `mmproj` for vision and the original Unsloth imatrix re-hosted for reproducibility:
 
-| Target | Combined `_MTP.gguf` (target + NextN head) |
-|---|---|
-| Qwen 3.6 35B-A3B (MoE) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) |
-| Qwen 3.6 27B (dense)   | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) |
+| Target | Recommended (AtomicChat UDT) | Upstream baseline (Unsloth) |
+|---|---|---|
+| Qwen 3.6 35B-A3B (MoE) | [`AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) |
+| Qwen 3.6 27B (dense)   | [`AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF)         | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF)         |
+
+What makes UDT different from a vanilla `llama-quantize -imatrix` run:
+
+- **MTP-aware imatrix** — calibrated by Unsloth with the NextN head active (we re-host their public [`imatrix_unsloth.gguf_file`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/blob/main/imatrix_unsloth.gguf_file) so you can reproduce or re-mix on top of it).
+- **NextN-preserve mask** — every `blk.*.nextn.*` and `mtp.*` tensor pinned to `Q8_0`. Tiny size cost (~10 MiB), keeps draft acceptance high.
+- **TurboQuant3-friendly mask** — `attn_q` / `attn_k` bumped to `Q6_K` so the file pairs cleanly with `-ctk turbo3 -ctv turbo3`.
+- **Combined `_MTP.gguf`** — target + NextN head in one file, ready for the shared-model speculative path (`-m` and `-md` point at the same path; no second mmap).
+- **Apache-2.0**, full attribution: Qwen team (weights), Unsloth (imatrix + BF16 sources), @TheTom (TurboQuant), AtomicChat (UDT masks + packaging).
+
+Collection: [AtomicChat — Qwen 3.6 UDT](https://huggingface.co/collections/AtomicChat/qwen-36-udt-atomicchat-6a0481f5cc5a057c07759176). Full recipe & runbook: [docs/qwen-udt/RUNBOOK.md](docs/qwen-udt/RUNBOOK.md). Mask files: [`scripts/quantize-masks/qwen36-ud-{base,v1-nextn,v2-turbo3,v3-combined}.txt`](scripts/quantize-masks).
 
 ### Quick start
 
 ```bash
 # Pull both target (-hf) and draft (-hfd) from the same HF combined _MTP.gguf;
 # they resolve to the same cached file → the server takes the shared-model branch.
 llama-server \
-  -hf  unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
-  -hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
+  -hf  AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
+  -hfd AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
   --spec-type nextn \
   --draft-max 2 --draft-min 1 \
   -c 8192 \
@@ -760,13 +768,15 @@ To learn more about model quantization, [read this documentation](tools/quantize
     35B-A3B MoE the combination is **+24-36 % tps** vs the same target
     without speculation.
 
-    Pre-built combined `_MTP.gguf` quants (recommended **`UD-Q4_K_XL`**,
+    Pre-built combined `_MTP.gguf` quants (recommended **`Q4_K_XL`**,
     matches the matrix bench cells):
 
     | Target | Combined `_MTP.gguf` |
     |---|---|
-    | Qwen 3.6 35B-A3B (MoE) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) |
-    | Qwen 3.6 27B (dense)   | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) |
+    | Qwen 3.6 35B-A3B (MoE) — AtomicChat UDT | [`AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF) |
+    | Qwen 3.6 27B (dense)   — AtomicChat UDT | [`AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF) |
+    | Qwen 3.6 35B-A3B (MoE) — Unsloth        | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) |
+    | Qwen 3.6 27B (dense)   — Unsloth        | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) |
 
     ```bash
     # Pull both target (-hf) and draft (-hfd) from the same HF combined _MTP.gguf.