Skip to content

Commit c7e6138

Browse files
committed
Update documentation and scripts for AtomicChat UDT quantization and Qwen 3.6 NextN enhancements
- Revised NEXTN.md to highlight the new AtomicChat UDT collection, detailing the combined `_MTP.gguf` quants and their benefits for NextN processing. - Updated README.md to reflect changes in recommended sources for Qwen 3.6 models, emphasizing the AtomicChat UDT collection and its features. - Enhanced quantization scripts to support improved file handling and added compatibility for new tensor types. - Introduced a new script for running perplexity benchmarks on UDT quant models, generating detailed performance logs. - Improved error handling and user feedback in various scripts to streamline the quantization and benchmarking processes.
1 parent 33e9b6d commit c7e6138

6 files changed

Lines changed: 161 additions & 58 deletions

File tree

NEXTN.md

Lines changed: 38 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -16,18 +16,16 @@ See also `MTP.md` (Gemma) and `docs/speculative.md` for shared CLI concepts.
1616

1717
## 0. Pre-built model GGUFs
1818

19-
Recommended source for Qwen 3.6 combined `*_MTP.gguf` checkpoints is the
20-
**unsloth** Hugging Face collection — the same files exercised in the
21-
matrix bench (§7):
19+
**Recommended:** the [AtomicChat — Qwen 3.6 UDT](https://huggingface.co/collections/AtomicChat/qwen-36-udt-atomicchat-6a0481f5cc5a057c07759176) collection — drop-in combined `*_MTP.gguf` quants tuned for this fork. Each repo ships Q3 / **Q4** / Q5 / Q6 / Q8 `_K_XL`, plus the `mmproj` for vision and a copy of `imatrix_unsloth.gguf_file` for reproducibility. Upstream Unsloth files keep working too — same arch metadata, same NextN tail.
2220

23-
| Target | Combined `_MTP.gguf` (target + NextN head) | Recommended quant | Architecture |
21+
| Target | Recommended (AtomicChat UDT) | Upstream baseline (Unsloth) | Architecture |
2422
|---|---|---|---|
25-
| Qwen 3.6 35B-A3B (MoE) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) | **`UD-Q4_K_XL`** (22.9 GB) | `qwen35moe` |
26-
| Qwen 3.6 27B (dense) | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) | **`UD-Q4_K_XL`** | `qwen35` |
23+
| Qwen 3.6 35B-A3B (MoE) | [`AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF) (`Q4_K_XL` ≈ 20.7 GiB) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) | `qwen35moe` |
24+
| Qwen 3.6 27B (dense) | [`AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF) (`Q4_K_XL` ≈ 17.7 GiB) | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) | `qwen35` |
2725

28-
**AtomicChat `UDT` (UD-Turbo)**this fork publishes additional combined `*_MTP.gguf` quants built with Unsloths public MTP-aware `imatrix_unsloth.gguf_file` plus our tensor-type masks (`scripts/quantize-masks/qwen36-ud-*.txt`) for NextN / TurboQuant3-oriented quality. End-to-end recipe: **[docs/qwen-udt/RUNBOOK.md](docs/qwen-udt/RUNBOOK.md)**; HF targets: [release/qwen-udt/HF_REPOS.md](release/qwen-udt/HF_REPOS.md).
26+
**Why UDT** — built on Unsloth's public MTP-aware [`imatrix_unsloth.gguf_file`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/blob/main/imatrix_unsloth.gguf_file), then layered with this fork's tensor-type masks (see §8): every `blk.*.nextn.*` / `mtp.*` tensor pinned to `Q8_0` to preserve draft acceptance, and `attn_q` / `attn_k` lifted to `Q6_K` so the file pairs cleanly with TurboQuant3 KV. End-to-end recipe & runbook: [docs/qwen-udt/RUNBOOK.md](docs/qwen-udt/RUNBOOK.md). Attribution: Qwen team (weights), Unsloth (imatrix + BF16 sources), @TheTom (TurboQuant), AtomicChat (UDT masks + packaging).
2927

30-
Both repos ship `UD-IQ1_M``BF16` quants. The shared-model NextN path
28+
The shared-model NextN path
3129
works on **any** of them as long as the file contains the NextN auxiliary
3230
head (`nextn_predict_layers > 0`) — which all `*-MTP-GGUF` quants do by
3331
construction. `scripts/verify-qwen36-nextn-gguf.py` will refuse to load a
@@ -39,15 +37,15 @@ the same file in the HF cache and takes the shared-model branch:
3937
```bash
4038
# 35B-A3B MoE (headline +24-36 % cell in the matrix)
4139
llama-server \
42-
-hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
43-
-hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
40+
-hf AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
41+
-hfd AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
4442
--spec-type nextn --draft-max 2 --draft-min 1 \
4543
-c 8192 -ngl 99 -ngld 99 -fa on
4644

4745
# 27B dense
4846
llama-server \
49-
-hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
50-
-hfd unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
47+
-hf AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL \
48+
-hfd AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF:Q4_K_XL \
5149
--spec-type nextn --draft-max 2 --draft-min 1 \
5250
-c 8192 -ngl 99 -ngld 99 -fa on
5351
```
@@ -123,20 +121,21 @@ let `llama-server` pull both from Hugging Face into the local cache:
123121

124122
```bash
125123
llama-server \
126-
-hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
127-
-hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
124+
-hf AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
125+
-hfd AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
128126
--spec-type nextn --draft-max 2 --draft-min 1
129127
```
130128

131129
---
132130

133131
## 7. Performance notes (MacBook Pro M4 Max, 40-core GPU, 48 GB, Metal)
134132

135-
Median TPS over 3 runs, prompt = 50-token instruction, `--draft-max=2 --draft-min=1`,
133+
Median TPS over 2 runs, prompt = 50-token instruction, `--draft-max=2 --draft-min=1`,
136134
NextN draft DM=2 (single async chain), context 8192. Single-slot
137135
(`--parallel 1 -np 1 --cont-batching`), full GPU offload (`-ngl 99 -ngld 99 -fa on`),
138-
shared-model draft path (no second mmap of combined `_MTP.gguf`). See
139-
`.scratch/bench-logs/qwen-matrix-fullrun-20260512-222625.md`.
136+
shared-model draft path (no second mmap of combined `_MTP.gguf`),
137+
AtomicChat **`UDT-Q4_K_XL_MTP`** file. See
138+
`.scratch/bench-logs/qwen-udt-ab-20260513-132549.md`.
140139

141140
### Bench host
142141

@@ -214,3 +213,25 @@ NextN graph builder now flows through `LLM_GRAPH_TYPE_NEXTN` instead of `overrid
214213
- Remote / bench / HF: **[docs/qwen-udt/RUNBOOK.md](../docs/qwen-udt/RUNBOOK.md)**
215214

216215
**Note:** `UDT` filenames use `…Q4_K_XL…` as a product tag; `llama-quantize` is still invoked with family types `Q4_K_M`, `Q5_K_M`, etc.
216+
217+
---
218+
219+
## 9. Released artifacts — AtomicChat UDT collection
220+
221+
The recipe above ships as two ready-to-pull Hugging Face repos, grouped into one collection:
222+
223+
- Collection — [AtomicChat — Qwen 3.6 UDT](https://huggingface.co/collections/AtomicChat/qwen-36-udt-atomicchat-6a0481f5cc5a057c07759176)
224+
- 27B dense — [`AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF)
225+
- 35B-A3B MoE — [`AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF)
226+
227+
What's actually in each repo, and why it's a bit unusual for a quant drop:
228+
229+
- **5 quants per model, all `_MTP.gguf`**`Q3_K_XL` / `Q4_K_XL` / `Q5_K_XL` / `Q6_K` / `Q8_K_XL`. Every file already includes the NextN auxiliary head, so the same path works for `-m` *and* `-md` — no second GGUF, no second mmap, no second tokenizer.
230+
- **NextN-preserve mask (V1)** — every `blk.*.nextn.*` and `mtp.*` tensor pinned to `Q8_0`. The cost is ~10 MiB of file size; the win is that the draft head stays close to BF16 fidelity, which keeps `acceptance` high under `--spec-type nextn`. Plain UD quants compress the head at the same bit-width as the body and bleed acceptance under `turbo3` KV.
231+
- **TurboQuant3-friendly mask (V2)** — attention Q/K bumped to `Q6_K`. This is the piece we tuned specifically for this fork: when KV is compressed to 3-bit via `-ctk turbo3 -ctv turbo3`, the attention scores see extra dequant noise on K, so giving Q/K a little more headroom on the weight side cancels most of it out.
232+
- **Default release = V3 (V1 ∪ V2)** — the combined mask shipped on Hugging Face. V1-only and V2-only quants exist as ablation artifacts in the build tree but are not published; the V3 file simply has both lifts at once.
233+
- **mmproj mirrored from Unsloth**`mmproj-F16.gguf` and `mmproj-BF16.gguf` re-hosted byte-for-byte from the corresponding `unsloth/Qwen3.6-*-MTP-GGUF` repo so a single `-hf` line gets you target + draft + projector.
234+
- **`imatrix_unsloth.gguf_file` re-hosted** — same artifact as Unsloth's (77-chunk, MTP-aware), included in each repo so the build is reproducible from a clean clone of the recipe.
235+
- **Apache-2.0**, attribution: Qwen team (weights), Unsloth (imatrix + BF16 sources), [@TheTom](https://github.com/TheTom) (TurboQuant), AtomicChat (UDT masks + packaging). Fork: [`AtomicBot-ai/atomic-llama-cpp-turboquant`](https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant).
236+
237+
The whole pipeline (download → quantize on H100 → bench on M4 Max → upload) is scripted in [`docs/qwen-udt/RUNBOOK.md`](../docs/qwen-udt/RUNBOOK.md); re-running it on the same Unsloth sources reproduces the published files bit-identical.

README.md

Lines changed: 22 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ LLM inference in C/C++
1818
## Hot topics
1919

2020
- **Gemma 4 MTP speculative decoding: pair a `gemma4` target with the official `gemma4_assistant` head (loaded via `--mtp-head`) for ~+30-50 % short-prompt throughput. See [MTP.md](MTP.md) and the pre-built Q4 assistant GGUFs at the [AtomicChat/Gemma 4 Assistant GGUF collection](https://huggingface.co/collections/AtomicChat/gemma-4-assistant-gguf).**
21-
- **Qwen 3.6 NextN speculative decoding: point `--model-draft` at the same combined `*_MTP.gguf` and pass `--spec-type nextn` — the draft context reuses the target `llama_model` (no second mmap) and lands **+24-36 % tps** on Qwen 3.6 35B-A3B MoE, **+5-7 % tps** on Qwen 3.6 27B dense (MacBook Pro M4 Max, single-slot). See [NEXTN.md](NEXTN.md) and the pre-built combined `_MTP.gguf` quants at [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) / [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF). Optional **AtomicChat `UDT`** quants (Unsloth imatrix + fork masks): [docs/qwen-udt/RUNBOOK.md](docs/qwen-udt/RUNBOOK.md).**
21+
- **Qwen 3.6 NextN speculative decoding: point `--model-draft` at the same combined `*_MTP.gguf` and pass `--spec-type nextn` — the draft context reuses the target `llama_model` (no second mmap) and lands +24-36 % tps on Qwen 3.6 35B-A3B MoE, +5-7 % tps on Qwen 3.6 27B dense (MacBook Pro M4 Max, single-slot). See [NEXTN.md](NEXTN.md). Recommended pre-built combined `_MTP.gguf` quants live in the **[AtomicChat — Qwen 3.6 UDT](https://huggingface.co/collections/AtomicChat/qwen-36-udt-atomicchat-6a0481f5cc5a057c07759176)** collection ([27B](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF) · [35B-A3B](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF)) — built with the Unsloth public MTP-aware imatrix + fork masks that pin NextN/MTP tensors to `Q8_0` (preserves draft acceptance) and lift attention Q/K to `Q6_K` (pairs cleanly with TurboQuant3 KV); upstream sources also work: [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) / [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF).**
2222
- **TurboQuant KV cache & weights: WHT-rotated low-bit quantization with backend-native kernels (Metal `TurboFlash`, CUDA, Vulkan, HIP). Use `-ctk turbo3 -ctv turbo3` for ~4.3× KV compression, or quantize weights to `TQ4_1S`/`TQ3_1S`. See [Compression below](#turboquant-kv-cache--weight-compression).**
2323
- **Hugging Face cache migration: models downloaded with `-hf` are now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools.**
2424
- **[guide : using the new WebUI of llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/16938)**
@@ -198,25 +198,31 @@ Highlights:
198198

199199
### Pre-built model GGUFs
200200

201-
Recommended source is the **unsloth** Hugging Face collection — the same
202-
combined `*_MTP.gguf` files exercised in the matrix bench. The
203-
`UD-Q4_K_XL` quant is the recommended default (matches the bench cells).
201+
**Recommended:** the AtomicChat **UDT** (UD-Turbo) collection — drop-in combined `_MTP.gguf` quants tuned for this fork. One repo per model, 5 quants each (Q3 / **Q4** / Q5 / Q6 / Q8 `_K_XL`), plus the `mmproj` for vision and the original Unsloth imatrix re-hosted for reproducibility:
204202

205-
| Target | Combined `_MTP.gguf` (target + NextN head) |
206-
|---|---|
207-
| Qwen 3.6 35B-A3B (MoE) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) |
208-
| Qwen 3.6 27B (dense) | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) |
203+
| Target | Recommended (AtomicChat UDT) | Upstream baseline (Unsloth) |
204+
|---|---|---|
205+
| Qwen 3.6 35B-A3B (MoE) | [`AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) |
206+
| Qwen 3.6 27B (dense) | [`AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF) | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) |
207+
208+
What makes UDT different from a vanilla `llama-quantize -imatrix` run:
209+
210+
- **MTP-aware imatrix** — calibrated by Unsloth with the NextN head active (we re-host their public [`imatrix_unsloth.gguf_file`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/blob/main/imatrix_unsloth.gguf_file) so you can reproduce or re-mix on top of it).
211+
- **NextN-preserve mask** — every `blk.*.nextn.*` and `mtp.*` tensor pinned to `Q8_0`. Tiny size cost (~10 MiB), keeps draft acceptance high.
212+
- **TurboQuant3-friendly mask**`attn_q` / `attn_k` bumped to `Q6_K` so the file pairs cleanly with `-ctk turbo3 -ctv turbo3`.
213+
- **Combined `_MTP.gguf`** — target + NextN head in one file, ready for the shared-model speculative path (`-m` and `-md` point at the same path; no second mmap).
214+
- **Apache-2.0**, full attribution: Qwen team (weights), Unsloth (imatrix + BF16 sources), @TheTom (TurboQuant), AtomicChat (UDT masks + packaging).
209215

210-
**AtomicChat `UDT` quants (UD-Turbo)** — optional GGUFs built with Unsloth’s public MTP-aware imatrix plus fork-specific tensor masks for NextN + TurboQuant3 (`scripts/quantize-masks/qwen36-ud-*.txt`). See **[docs/qwen-udt/RUNBOOK.md](docs/qwen-udt/RUNBOOK.md)** and [release/qwen-udt/HF_REPOS.md](release/qwen-udt/HF_REPOS.md).
216+
Collection: [AtomicChat — Qwen 3.6 UDT](https://huggingface.co/collections/AtomicChat/qwen-36-udt-atomicchat-6a0481f5cc5a057c07759176). Full recipe & runbook: [docs/qwen-udt/RUNBOOK.md](docs/qwen-udt/RUNBOOK.md). Mask files: [`scripts/quantize-masks/qwen36-ud-{base,v1-nextn,v2-turbo3,v3-combined}.txt`](scripts/quantize-masks).
211217

212218
### Quick start
213219

214220
```bash
215221
# Pull both target (-hf) and draft (-hfd) from the same HF combined _MTP.gguf;
216222
# they resolve to the same cached file → the server takes the shared-model branch.
217223
llama-server \
218-
-hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
219-
-hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
224+
-hf AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
225+
-hfd AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF:Q4_K_XL \
220226
--spec-type nextn \
221227
--draft-max 2 --draft-min 1 \
222228
-c 8192 \
@@ -762,13 +768,15 @@ To learn more about model quantization, [read this documentation](tools/quantize
762768
35B-A3B MoE the combination is **+24-36 % tps** vs the same target
763769
without speculation.
764770

765-
Pre-built combined `_MTP.gguf` quants (recommended **`UD-Q4_K_XL`**,
771+
Pre-built combined `_MTP.gguf` quants (recommended **`Q4_K_XL`**,
766772
matches the matrix bench cells):
767773

768774
| Target | Combined `_MTP.gguf` |
769775
|---|---|
770-
| Qwen 3.6 35B-A3B (MoE) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) |
771-
| Qwen 3.6 27B (dense) | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) |
776+
| Qwen 3.6 35B-A3B (MoE) — AtomicChat UDT | [`AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-35B-A3B-UDT-MTP-GGUF) |
777+
| Qwen 3.6 27B (dense) — AtomicChat UDT | [`AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF`](https://huggingface.co/AtomicChat/Qwen3.6-27B-UDT-MTP-GGUF) |
778+
| Qwen 3.6 35B-A3B (MoE) — Unsloth | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) |
779+
| Qwen 3.6 27B (dense) — Unsloth | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) |
772780

773781
```bash
774782
# Pull both target (-hf) and draft (-hfd) from the same HF combined _MTP.gguf.

scripts/quantize-qwen-udt.sh

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ case "$MODEL" in
4444
INP="${BF16_INPUT:-}"
4545
if [[ -z "$INP" ]]; then
4646
shopt -s nullglob
47-
cand=( "${SOURCES}/${SUB}"/Qwen3.6-27B-BF16-*.gguf )
47+
cand=( "${SOURCES}/${SUB}/BF16"/Qwen3.6-27B-BF16-*.gguf "${SOURCES}/${SUB}"/Qwen3.6-27B-BF16-*.gguf )
4848
shopt -u nullglob
4949
INP="${cand[0]:-}"
5050
fi
@@ -53,9 +53,10 @@ case "$MODEL" in
5353
PREFIX="Qwen3.6-35B-A3B"
5454
SUB="${QWEN_UDT_SUBDIR_35:-35a3b}"
5555
IMT="${IMATRIX_FILE:-${SOURCES}/${SUB}/imatrix_unsloth.gguf_file}"
56+
INP="${BF16_INPUT:-}"
5657
if [[ -z "$INP" ]]; then
5758
shopt -s nullglob
58-
cand=( "${SOURCES}/${SUB}"/Qwen3.6-35B-A3B-BF16-*.gguf )
59+
cand=( "${SOURCES}/${SUB}/BF16"/Qwen3.6-35B-A3B-BF16-*.gguf "${SOURCES}/${SUB}"/Qwen3.6-35B-A3B-BF16-*.gguf )
5960
shopt -u nullglob
6061
INP="${cand[0]:-}"
6162
fi
@@ -66,11 +67,15 @@ case "$MODEL" in
6667
esac
6768

6869
case "$FTYPE" in
69-
Q3_K_M|Q4_K_M|Q5_K_M|Q6_K) ;;
70-
*) echo "error: unsupported ftype '$FTYPE' (expected Q3_K_M|Q4_K_M|Q5_K_M|Q6_K)" >&2; exit 1 ;;
70+
Q3_K_M|Q4_K_M|Q5_K_M|Q6_K|Q8_0) ;;
71+
*) echo "error: unsupported ftype '$FTYPE' (expected Q3_K_M|Q4_K_M|Q5_K_M|Q6_K|Q8_0)" >&2; exit 1 ;;
7172
esac
7273

73-
XL_TAG="${FTYPE/_K_M/_K_XL}"
74+
case "$FTYPE" in
75+
Q3_K_M|Q4_K_M|Q5_K_M) XL_TAG="${FTYPE/_K_M/_K_XL}" ;;
76+
Q6_K) XL_TAG="Q6_K" ;;
77+
Q8_0) XL_TAG="Q8_K_XL" ;;
78+
esac
7479
case "$VARIANT" in
7580
base) SUFFIX="-base" ;;
7681
v1) SUFFIX="-V1" ;;

scripts/qwen-udt/hf-download-sources.sh

Lines changed: 21 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -6,38 +6,42 @@
66
#
77
# Default DEST_DIR: ${REPO}/.scratch/qwen-ud-sources with per-model subdirs 27b/ and 35a3b/
88
#
9-
# Requires: huggingface-cli (pip install -U "huggingface_hub[cli]") and HF read access.
9+
# Requires: `hf` (huggingface_hub>=1.0) or the older `huggingface-cli`.
1010

1111
set -euo pipefail
1212

1313
ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
1414
DEST="${1:-${ROOT}/.scratch/qwen-ud-sources}"
1515

16-
if ! command -v huggingface-cli >/dev/null 2>&1; then
17-
echo "error: huggingface-cli not found (pip install -U \"huggingface_hub[cli]\")" >&2
16+
if command -v hf >/dev/null 2>&1; then
17+
HF=hf
18+
elif command -v huggingface-cli >/dev/null 2>&1; then
19+
HF=huggingface-cli
20+
else
21+
echo 'error: neither `hf` nor `huggingface-cli` found (pip install -U "huggingface_hub[cli]")' >&2
1822
exit 1
1923
fi
2024

2125
mkdir -p "$DEST/27b" "$DEST/35a3b"
2226

27+
dl() {
28+
local repo="$1"; shift
29+
local local_dir="$1"; shift
30+
"$HF" download "$repo" "$@" --local-dir "$local_dir"
31+
}
32+
2333
echo "info: 27B — imatrix..."
24-
huggingface-cli download unsloth/Qwen3.6-27B-MTP-GGUF imatrix_unsloth.gguf_file \
25-
--local-dir "$DEST/27b" --local-dir-use-symlinks False
26-
echo "info: 27B — reference quant..."
27-
huggingface-cli download unsloth/Qwen3.6-27B-MTP-GGUF Qwen3.6-27B-UD-Q4_K_XL.gguf \
28-
--local-dir "$DEST/27b" --local-dir-use-symlinks False
34+
dl unsloth/Qwen3.6-27B-MTP-GGUF "$DEST/27b" imatrix_unsloth.gguf_file
35+
echo "info: 27B — reference UD-Q4_K_XL..."
36+
dl unsloth/Qwen3.6-27B-MTP-GGUF "$DEST/27b" Qwen3.6-27B-UD-Q4_K_XL.gguf
2937
echo "info: 27B — BF16 shards..."
30-
huggingface-cli download unsloth/Qwen3.6-27B-MTP-GGUF --include "BF16/*" \
31-
--local-dir "$DEST/27b" --local-dir-use-symlinks False
38+
dl unsloth/Qwen3.6-27B-MTP-GGUF "$DEST/27b" --include "BF16/*"
3239

3340
echo "info: 35B-A3B — imatrix..."
34-
huggingface-cli download unsloth/Qwen3.6-35B-A3B-MTP-GGUF imatrix_unsloth.gguf_file \
35-
--local-dir "$DEST/35a3b" --local-dir-use-symlinks False
36-
echo "info: 35B-A3B — reference quant..."
37-
huggingface-cli download unsloth/Qwen3.6-35B-A3B-MTP-GGUF Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
38-
--local-dir "$DEST/35a3b" --local-dir-use-symlinks False
41+
dl unsloth/Qwen3.6-35B-A3B-MTP-GGUF "$DEST/35a3b" imatrix_unsloth.gguf_file
42+
echo "info: 35B-A3B — reference UD-Q4_K_XL..."
43+
dl unsloth/Qwen3.6-35B-A3B-MTP-GGUF "$DEST/35a3b" Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
3944
echo "info: 35B-A3B — BF16 shards..."
40-
huggingface-cli download unsloth/Qwen3.6-35B-A3B-MTP-GGUF --include "BF16/*" \
41-
--local-dir "$DEST/35a3b" --local-dir-use-symlinks False
45+
dl unsloth/Qwen3.6-35B-A3B-MTP-GGUF "$DEST/35a3b" --include "BF16/*"
4246

4347
echo "ok: sources under $DEST/{27b,35a3b}"

0 commit comments

Comments
 (0)