Update documentation and scripts for AtomicChat UDT quantization and Qwen 3.6 NextN enhancements

Ooooze · Ooooze · commit c7e6138971c0 · 2026-05-13T17:33:18.000+03:00
- Revised NEXTN.md to highlight the new AtomicChat UDT collection, detailing the combined `_MTP.gguf` quants and their benefits for NextN processing.
- Updated README.md to reflect changes in recommended sources for Qwen 3.6 models, emphasizing the AtomicChat UDT collection and its features.
- Enhanced quantization scripts to support improved file handling and added compatibility for new tensor types.
- Introduced a new script for running perplexity benchmarks on UDT quant models, generating detailed performance logs.
- Improved error handling and user feedback in various scripts to streamline the quantization and benchmarking processes.
diff --git a/NEXTN.md b/NEXTN.md
@@ -16,18 +16,16 @@ See also `MTP.md` (Gemma) and `docs/speculative.md` for shared CLI concepts.
 
 ## 0. Pre-built model GGUFs
 
-Recommended source for Qwen 3.6 combined `*_MTP.gguf` checkpoints is the
-**unsloth** Hugging Face collection — the same files exercised in the
-matrix bench (§7):
+**Recommended:** the [AtomicChat — Qwen 3.6 UDT](https://huggingface.co/collections/AtomicChat/qwen-36-udt-atomicchat-6a0481f5cc5a057c07759176) collection — drop-in combined `*_MTP.gguf` quants tuned for this fork. Each repo ships Q3 / **Q4** / Q5 / Q6 / Q8 `_K_XL`, plus the `mmproj` for vision and a copy of `imatrix_unsloth.gguf_file` for reproducibility. Upstream Unsloth files keep working too — same arch metadata, same NextN tail.
 
-| Target | Combined `_MTP.gguf` (target + NextN head) | Recommended quant | Architecture |
+| Target | Recommended (AtomicChat UDT) | Upstream baseline (Unsloth) | Architecture |
 |---|---|---|---|
-| Qwen 3.6 35B-A3B (MoE) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) | **`UD-Q4_K_XL`** (22.9 GB) | `qwen35moe` |
-| Qwen 3.6 27B (dense)   | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) | **`UD-Q4_K_XL`** | `qwen35` |
+| Qwen 3.6 35B-A3B (MoE) | [`AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF) (`Q4_K_XL` ≈ 20.7 GiB) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) | `qwen35moe` |
+| Qwen 3.6 27B (dense)   | [`AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF) (`Q4_K_XL` ≈ 17.7 GiB)             | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF)             | `qwen35`    |
 
-**AtomicChat `UDT` (UD-Turbo)** — this fork publishes additional combined `*_MTP.gguf` quants built with Unsloth’s public MTP-aware `imatrix_unsloth.gguf_file` plus our tensor-type masks (`scripts/quantize-masks/qwen36-ud-*.txt`) for NextN / TurboQuant3-oriented quality. End-to-end recipe: **[docs/qwen-udt/RUNBOOK.md](docs/qwen-udt/RUNBOOK.md)**; HF targets: [release/qwen-udt/HF_REPOS.md](release/qwen-udt/HF_REPOS.md).
+**Why UDT** — built on Unsloth's public MTP-aware [`imatrix_unsloth.gguf_file`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/blob/main/imatrix_unsloth.gguf_file), then layered with this fork's tensor-type masks (see §8): every `blk.*.nextn.*` / `mtp.*` tensor pinned to `Q8_0` to preserve draft acceptance, and `attn_q` / `attn_k` lifted to `Q6_K` so the file pairs cleanly with TurboQuant3 KV. End-to-end recipe & runbook: [docs/qwen-udt/RUNBOOK.md](docs/qwen-udt/RUNBOOK.md). Attribution: Qwen team (weights), Unsloth (imatrix + BF16 sources), @TheTom (TurboQuant), AtomicChat (UDT masks + packaging).
 
-Both repos ship `UD-IQ1_M` … `BF16` quants. The shared-model NextN path
+The shared-model NextN path
 works on **any** of them as long as the file contains the NextN auxiliary
 head (`nextn_predict_layers > 0`) — which all `*-MTP-GGUF` quants do by
 construction. `scripts/verify-qwen36-nextn-gguf.py` will refuse to load a
@@ -39,15 +37,15 @@ the same file in the HF cache and takes the shared-model branch:
 ```bash
 # 35B-A3B MoE (headline +24-36 % cell in the matrix)
 llama-server \
-  -hf  unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
-  -hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
+  -hf  AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
+  -hfd AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
   --spec-type nextn --draft-max 2 --draft-min 1 \
   -c 8192 -ngl 99 -ngld 99 -fa on
 
 # 27B dense
 llama-server \
-  -hf  unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
-  -hfd unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
+  -hf  AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL \
+  -hfd AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL \
   --spec-type nextn --draft-max 2 --draft-min 1 \
   -c 8192 -ngl 99 -ngld 99 -fa on
 ```
@@ -123,20 +121,21 @@ let `llama-server` pull both from Hugging Face into the local cache:
 
 ```bash
 llama-server \
-  -hf  unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
-  -hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
+  -hf  AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
+  -hfd AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
   --spec-type nextn --draft-max 2 --draft-min 1
 ```
 
 ---
 
 ## 7. Performance notes (MacBook Pro M4 Max, 40-core GPU, 48 GB, Metal)
 
-Median TPS over 3 runs, prompt = 50-token instruction, `--draft-max=2 --draft-min=1`,
+Median TPS over 2 runs, prompt = 50-token instruction, `--draft-max=2 --draft-min=1`,
 NextN draft DM=2 (single async chain), context 8192. Single-slot
 (`--parallel 1 -np 1 --cont-batching`), full GPU offload (`-ngl 99 -ngld 99 -fa on`),
-shared-model draft path (no second mmap of combined `_MTP.gguf`). See
-`.scratch/bench-logs/qwen-matrix-fullrun-20260512-222625.md`.
+shared-model draft path (no second mmap of combined `_MTP.gguf`),
+AtomicChat **`UDT-Q4_K_XL_MTP`** file. See
+`.scratch/bench-logs/qwen-udt-ab-20260513-132549.md`.
 
 ### Bench host
 
@@ -214,3 +213,25 @@ NextN graph builder now flows through `LLM_GRAPH_TYPE_NEXTN` instead of `overrid
 - Remote / bench / HF: **[docs/qwen-udt/RUNBOOK.md](../docs/qwen-udt/RUNBOOK.md)**
 
 **Note:** `UDT` filenames use `…Q4_K_XL…` as a product tag; `llama-quantize` is still invoked with family types `Q4_K_M`, `Q5_K_M`, etc.
+
+---
+
+## 9. Released artifacts — AtomicChat UDT collection
+
+The recipe above ships as two ready-to-pull Hugging Face repos, grouped into one collection:
+
+- Collection — [AtomicChat — Qwen 3.6 UDT](https://huggingface.co/collections/AtomicChat/qwen-36-udt-atomicchat-6a0481f5cc5a057c07759176)
+- 27B dense — [`AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF)
+- 35B-A3B MoE — [`AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF)
+
+What's actually in each repo, and why it's a bit unusual for a quant drop:
+
+- **5 quants per model, all `_MTP.gguf`** — `Q3_K_XL` / `Q4_K_XL` / `Q5_K_XL` / `Q6_K` / `Q8_K_XL`. Every file already includes the NextN auxiliary head, so the same path works for `-m` *and* `-md` — no second GGUF, no second mmap, no second tokenizer.
+- **NextN-preserve mask (V1)** — every `blk.*.nextn.*` and `mtp.*` tensor pinned to `Q8_0`. The cost is ~10 MiB of file size; the win is that the draft head stays close to BF16 fidelity, which keeps `acceptance` high under `--spec-type nextn`. Plain UD quants compress the head at the same bit-width as the body and bleed acceptance under `turbo3` KV.
+- **TurboQuant3-friendly mask (V2)** — attention Q/K bumped to `Q6_K`. This is the piece we tuned specifically for this fork: when KV is compressed to 3-bit via `-ctk turbo3 -ctv turbo3`, the attention scores see extra dequant noise on K, so giving Q/K a little more headroom on the weight side cancels most of it out.
+- **Default release = V3 (V1 ∪ V2)** — the combined mask shipped on Hugging Face. V1-only and V2-only quants exist as ablation artifacts in the build tree but are not published; the V3 file simply has both lifts at once.
+- **mmproj mirrored from Unsloth** — `mmproj-F16.gguf` and `mmproj-BF16.gguf` re-hosted byte-for-byte from the corresponding `unsloth/Qwen3.6-*-MTP-GGUF` repo so a single `-hf` line gets you target + draft + projector.
+- **`imatrix_unsloth.gguf_file` re-hosted** — same artifact as Unsloth's (77-chunk, MTP-aware), included in each repo so the build is reproducible from a clean clone of the recipe.
+- **Apache-2.0**, attribution: Qwen team (weights), Unsloth (imatrix + BF16 sources), [@TheTom](https://github.com/TheTom) (TurboQuant), AtomicChat (UDT masks + packaging). Fork: [`AtomicBot-ai/atomic-llama-cpp-turboquant`](https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant).
+
+The whole pipeline (download → quantize on H100 → bench on M4 Max → upload) is scripted in [`docs/qwen-udt/RUNBOOK.md`](../docs/qwen-udt/RUNBOOK.md); re-running it on the same Unsloth sources reproduces the published files bit-identical.
diff --git a/README.md b/README.md
@@ -18,7 +18,7 @@ LLM inference in C/C++
 ## Hot topics
 
 - **Gemma 4 MTP speculative decoding: pair a `gemma4` target with the official `gemma4_assistant` head (loaded via `--mtp-head`) for ~+30-50 % short-prompt throughput. See [MTP.md](MTP.md) and the pre-built Q4 assistant GGUFs at the [AtomicChat/Gemma 4 Assistant GGUF collection](https://huggingface.co/collections/AtomicChat/gemma-4-assistant-gguf).**
-- **Qwen 3.6 NextN speculative decoding: point `--model-draft` at the same combined `*_MTP.gguf` and pass `--spec-type nextn` — the draft context reuses the target `llama_model` (no second mmap) and lands **+24-36 % tps** on Qwen 3.6 35B-A3B MoE, **+5-7 % tps** on Qwen 3.6 27B dense (MacBook Pro M4 Max, single-slot). See [NEXTN.md](NEXTN.md) and the pre-built combined `_MTP.gguf` quants at [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) / [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF). Optional **AtomicChat `UDT`** quants (Unsloth imatrix + fork masks): [docs/qwen-udt/RUNBOOK.md](docs/qwen-udt/RUNBOOK.md).**
+- **Qwen 3.6 NextN speculative decoding: point `--model-draft` at the same combined `*_MTP.gguf` and pass `--spec-type nextn` — the draft context reuses the target `llama_model` (no second mmap) and lands +24-36 % tps on Qwen 3.6 35B-A3B MoE, +5-7 % tps on Qwen 3.6 27B dense (MacBook Pro M4 Max, single-slot). See [NEXTN.md](NEXTN.md). Recommended pre-built combined `_MTP.gguf` quants live in the **[AtomicChat — Qwen 3.6 UDT](https://huggingface.co/collections/AtomicChat/qwen-36-udt-atomicchat-6a0481f5cc5a057c07759176)** collection ([27B](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF) · [35B-A3B](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF)) — built with the Unsloth public MTP-aware imatrix + fork masks that pin NextN/MTP tensors to `Q8_0` (preserves draft acceptance) and lift attention Q/K to `Q6_K` (pairs cleanly with TurboQuant3 KV); upstream sources also work: [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) / [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF).**
 - **TurboQuant KV cache & weights: WHT-rotated low-bit quantization with backend-native kernels (Metal `TurboFlash`, CUDA, Vulkan, HIP). Use `-ctk turbo3 -ctv turbo3` for ~4.3× KV compression, or quantize weights to `TQ4_1S`/`TQ3_1S`. See [Compression below](#turboquant-kv-cache--weight-compression).**
 - **Hugging Face cache migration: models downloaded with `-hf` are now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools.**
 - **[guide : using the new WebUI of llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/16938)**
@@ -198,25 +198,31 @@ Highlights:
 
 ### Pre-built model GGUFs
 
-Recommended source is the **unsloth** Hugging Face collection — the same
-combined `*_MTP.gguf` files exercised in the matrix bench. The
-`UD-Q4_K_XL` quant is the recommended default (matches the bench cells).
+**Recommended:** the AtomicChat **UDT** (UD-Turbo) collection — drop-in combined `_MTP.gguf` quants tuned for this fork. One repo per model, 5 quants each (Q3 / **Q4** / Q5 / Q6 / Q8 `_K_XL`), plus the `mmproj` for vision and the original Unsloth imatrix re-hosted for reproducibility:
 
-| Target | Combined `_MTP.gguf` (target + NextN head) |
-|---|---|
-| Qwen 3.6 35B-A3B (MoE) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) |
-| Qwen 3.6 27B (dense)   | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) |
+| Target | Recommended (AtomicChat UDT) | Upstream baseline (Unsloth) |
+|---|---|---|
+| Qwen 3.6 35B-A3B (MoE) | [`AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) |
+| Qwen 3.6 27B (dense)   | [`AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF)         | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF)         |
+
+What makes UDT different from a vanilla `llama-quantize -imatrix` run:
+
+- **MTP-aware imatrix** — calibrated by Unsloth with the NextN head active (we re-host their public [`imatrix_unsloth.gguf_file`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/blob/main/imatrix_unsloth.gguf_file) so you can reproduce or re-mix on top of it).
+- **NextN-preserve mask** — every `blk.*.nextn.*` and `mtp.*` tensor pinned to `Q8_0`. Tiny size cost (~10 MiB), keeps draft acceptance high.
+- **TurboQuant3-friendly mask** — `attn_q` / `attn_k` bumped to `Q6_K` so the file pairs cleanly with `-ctk turbo3 -ctv turbo3`.
+- **Combined `_MTP.gguf`** — target + NextN head in one file, ready for the shared-model speculative path (`-m` and `-md` point at the same path; no second mmap).
+- **Apache-2.0**, full attribution: Qwen team (weights), Unsloth (imatrix + BF16 sources), @TheTom (TurboQuant), AtomicChat (UDT masks + packaging).
 
-**AtomicChat `UDT` quants (UD-Turbo)** — optional GGUFs built with Unsloth’s public MTP-aware imatrix plus fork-specific tensor masks for NextN + TurboQuant3 (`scripts/quantize-masks/qwen36-ud-*.txt`). See **[docs/qwen-udt/RUNBOOK.md](docs/qwen-udt/RUNBOOK.md)** and [release/qwen-udt/HF_REPOS.md](release/qwen-udt/HF_REPOS.md).
+Collection: [AtomicChat — Qwen 3.6 UDT](https://huggingface.co/collections/AtomicChat/qwen-36-udt-atomicchat-6a0481f5cc5a057c07759176). Full recipe & runbook: [docs/qwen-udt/RUNBOOK.md](docs/qwen-udt/RUNBOOK.md). Mask files: [`scripts/quantize-masks/qwen36-ud-{base,v1-nextn,v2-turbo3,v3-combined}.txt`](scripts/quantize-masks).
 
 ### Quick start
 
 ```bash
 # Pull both target (-hf) and draft (-hfd) from the same HF combined _MTP.gguf;
 # they resolve to the same cached file → the server takes the shared-model branch.
 llama-server \
-  -hf  unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
-  -hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
+  -hf  AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
+  -hfd AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
   --spec-type nextn \
   --draft-max 2 --draft-min 1 \
   -c 8192 \
@@ -762,13 +768,15 @@ To learn more about model quantization, [read this documentation](tools/quantize
     35B-A3B MoE the combination is **+24-36 % tps** vs the same target
     without speculation.
 
-    Pre-built combined `_MTP.gguf` quants (recommended **`UD-Q4_K_XL`**,
+    Pre-built combined `_MTP.gguf` quants (recommended **`Q4_K_XL`**,
     matches the matrix bench cells):
 
     | Target | Combined `_MTP.gguf` |
     |---|---|
-    | Qwen 3.6 35B-A3B (MoE) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) |
-    | Qwen 3.6 27B (dense)   | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) |
+    | Qwen 3.6 35B-A3B (MoE) — AtomicChat UDT | [`AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF) |
+    | Qwen 3.6 27B (dense)   — AtomicChat UDT | [`AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF) |
+    | Qwen 3.6 35B-A3B (MoE) — Unsloth        | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) |
+    | Qwen 3.6 27B (dense)   — Unsloth        | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) |
 
     ```bash
     # Pull both target (-hf) and draft (-hfd) from the same HF combined _MTP.gguf.
diff --git a/scripts/quantize-qwen-udt.sh b/scripts/quantize-qwen-udt.sh
@@ -44,7 +44,7 @@ case "$MODEL" in
     INP="${BF16_INPUT:-}"
     if [[ -z "$INP" ]]; then
       shopt -s nullglob
-      cand=( "${SOURCES}/${SUB}"/Qwen3.6-27B-BF16-*.gguf )
+      cand=( "${SOURCES}/${SUB}/BF16"/Qwen3.6-27B-BF16-*.gguf "${SOURCES}/${SUB}"/Qwen3.6-27B-BF16-*.gguf )
       shopt -u nullglob
       INP="${cand[0]:-}"
     fi
@@ -53,9 +53,10 @@ case "$MODEL" in
     PREFIX="Qwen3.6-35B-A3B"
     SUB="${QWEN_UDT_SUBDIR_35:-35a3b}"
     IMT="${IMATRIX_FILE:-${SOURCES}/${SUB}/imatrix_unsloth.gguf_file}"
+    INP="${BF16_INPUT:-}"
     if [[ -z "$INP" ]]; then
       shopt -s nullglob
-      cand=( "${SOURCES}/${SUB}"/Qwen3.6-35B-A3B-BF16-*.gguf )
+      cand=( "${SOURCES}/${SUB}/BF16"/Qwen3.6-35B-A3B-BF16-*.gguf "${SOURCES}/${SUB}"/Qwen3.6-35B-A3B-BF16-*.gguf )
       shopt -u nullglob
       INP="${cand[0]:-}"
     fi
@@ -66,11 +67,15 @@ case "$MODEL" in
 esac
 
 case "$FTYPE" in
-  Q3_K_M|Q4_K_M|Q5_K_M|Q6_K) ;;
-  *) echo "error: unsupported ftype '$FTYPE' (expected Q3_K_M|Q4_K_M|Q5_K_M|Q6_K)" >&2; exit 1 ;;
+  Q3_K_M|Q4_K_M|Q5_K_M|Q6_K|Q8_0) ;;
+  *) echo "error: unsupported ftype '$FTYPE' (expected Q3_K_M|Q4_K_M|Q5_K_M|Q6_K|Q8_0)" >&2; exit 1 ;;
 esac
 
-XL_TAG="${FTYPE/_K_M/_K_XL}"
+case "$FTYPE" in
+  Q3_K_M|Q4_K_M|Q5_K_M) XL_TAG="${FTYPE/_K_M/_K_XL}" ;;
+  Q6_K) XL_TAG="Q6_K" ;;
+  Q8_0) XL_TAG="Q8_K_XL" ;;
+esac
 case "$VARIANT" in
   base) SUFFIX="-base" ;;
   v1)   SUFFIX="-V1" ;;
diff --git a/scripts/qwen-udt/hf-download-sources.sh b/scripts/qwen-udt/hf-download-sources.sh
@@ -6,38 +6,42 @@
 #
 # Default DEST_DIR: ${REPO}/.scratch/qwen-ud-sources with per-model subdirs 27b/ and 35a3b/
 #
-# Requires: huggingface-cli (pip install -U "huggingface_hub[cli]") and HF read access.
+# Requires: `hf` (huggingface_hub>=1.0) or the older `huggingface-cli`.
 
 set -euo pipefail
 
 ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
 DEST="${1:-${ROOT}/.scratch/qwen-ud-sources}"
 
-if ! command -v huggingface-cli >/dev/null 2>&1; then
-  echo "error: huggingface-cli not found (pip install -U \"huggingface_hub[cli]\")" >&2
+if command -v hf >/dev/null 2>&1; then
+  HF=hf
+elif command -v huggingface-cli >/dev/null 2>&1; then
+  HF=huggingface-cli
+else
+  echo 'error: neither `hf` nor `huggingface-cli` found (pip install -U "huggingface_hub[cli]")' >&2
   exit 1
 fi
 
 mkdir -p "$DEST/27b" "$DEST/35a3b"
 
+dl() {
+  local repo="$1"; shift
+  local local_dir="$1"; shift
+  "$HF" download "$repo" "$@" --local-dir "$local_dir"
+}
+
 echo "info: 27B — imatrix..."
-huggingface-cli download unsloth/Qwen3.6-27B-MTP-GGUF imatrix_unsloth.gguf_file \
-  --local-dir "$DEST/27b" --local-dir-use-symlinks False
-echo "info: 27B — reference quant..."
-huggingface-cli download unsloth/Qwen3.6-27B-MTP-GGUF Qwen3.6-27B-UD-Q4_K_XL.gguf \
-  --local-dir "$DEST/27b" --local-dir-use-symlinks False
+dl unsloth/Qwen3.6-27B-MTP-GGUF "$DEST/27b" imatrix_unsloth.gguf_file
+echo "info: 27B — reference UD-Q4_K_XL..."
+dl unsloth/Qwen3.6-27B-MTP-GGUF "$DEST/27b" Qwen3.6-27B-UD-Q4_K_XL.gguf
 echo "info: 27B — BF16 shards..."
-huggingface-cli download unsloth/Qwen3.6-27B-MTP-GGUF --include "BF16/*" \
-  --local-dir "$DEST/27b" --local-dir-use-symlinks False
+dl unsloth/Qwen3.6-27B-MTP-GGUF "$DEST/27b" --include "BF16/*"
 
 echo "info: 35B-A3B — imatrix..."
-huggingface-cli download unsloth/Qwen3.6-35B-A3B-MTP-GGUF imatrix_unsloth.gguf_file \
-  --local-dir "$DEST/35a3b" --local-dir-use-symlinks False
-echo "info: 35B-A3B — reference quant..."
-huggingface-cli download unsloth/Qwen3.6-35B-A3B-MTP-GGUF Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
-  --local-dir "$DEST/35a3b" --local-dir-use-symlinks False
+dl unsloth/Qwen3.6-35B-A3B-MTP-GGUF "$DEST/35a3b" imatrix_unsloth.gguf_file
+echo "info: 35B-A3B — reference UD-Q4_K_XL..."
+dl unsloth/Qwen3.6-35B-A3B-MTP-GGUF "$DEST/35a3b" Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
 echo "info: 35B-A3B — BF16 shards..."
-huggingface-cli download unsloth/Qwen3.6-35B-A3B-MTP-GGUF --include "BF16/*" \
-  --local-dir "$DEST/35a3b" --local-dir-use-symlinks False
+dl unsloth/Qwen3.6-35B-A3B-MTP-GGUF "$DEST/35a3b" --include "BF16/*"
 
 echo "ok: sources under $DEST/{27b,35a3b}"
diff --git a/scripts/qwen-udt/hf-upload-qwen-udt.sh b/scripts/qwen-udt/hf-upload-qwen-udt.sh
diff --git a/scripts/qwen-udt/ppl-matrix-remote.sh b/scripts/qwen-udt/ppl-matrix-remote.sh