Skip to content

Commit 4002a3d

Browse files
committed
feat: add Qwen3.6-35B-A3B local model support
Add Qwen3.6 model documentation with optimized settings. Key finding: q8_0 KV cache kills tg from ~35 to ~12 tok/s on this model — use default f16 cache instead. Performance: pp ~575 tok/s, tg ~35 tok/s on M1 Max 64GB.
1 parent b12511f commit 4002a3d

2 files changed

Lines changed: 121 additions & 0 deletions

File tree

docs-site/src/content/docs/integrations/local-llms.mdx

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -187,6 +187,7 @@ prompt). All models served via `llama-server`.
187187
| Model | Active Params | tg (tok/s) |
188188
|-------|--------------|------------|
189189
| **Gemma-4-26B-A4B** | **4B** | **~40** |
190+
| **Qwen3.6-35B-A3B** | **3B** | **~35** |
190191
| GPT-OSS-20B | 3.6B | ~17--38 |
191192
| Qwen3-30B-A3B | 3B | ~15--27 |
192193
| GLM-4.7-Flash | 3B | ~12--13 |
@@ -476,6 +477,68 @@ feed visible answers back -- exclude prior thought
476477
blocks.
477478
:::
478479

480+
### Qwen3.6-35B-A3B -- Fast Qwen MoE
481+
482+
A 35B MoE model with 3B active parameters. Successor
483+
to Qwen3.5-35B-A3B with vision support. Uses sliding
484+
window attention (SWA).
485+
486+
```bash
487+
llama-server \
488+
-hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL \
489+
--port 8133 \
490+
-ngl 999 \
491+
--threads 8 \
492+
-c 65536 \
493+
-b 2048 \
494+
-ub 1024 \
495+
--parallel 1 \
496+
-fa on \
497+
--jinja \
498+
--keep 1024 \
499+
--swa-full \
500+
--no-context-shift \
501+
--chat-template-kwargs '{"enable_thinking": false}' \
502+
--temp 0.7 \
503+
--top-p 0.8 \
504+
--top-k 20 \
505+
--min-p 0.00 \
506+
--no-mmap
507+
```
508+
509+
**Critical settings:**
510+
511+
| Setting | Why |
512+
|---------|-----|
513+
| No `--cache-type-k/v` | **Do not** use `q8_0` KV cache -- it kills tg from ~35 to ~12 tok/s. Use default f16 cache. |
514+
| `--swa-full` | Expands SWA cache to full context, enabling prompt caching |
515+
| `--no-context-shift` | Required -- context shift is incompatible with SWA |
516+
| `--chat-template-kwargs ...` | Disables thinking mode for agentic workflows |
517+
| `-c 65536` | 64K context -- enough for Claude Code, avoids the RAM cost of 128K |
518+
519+
**Performance (M1 Max 64 GB, ~41K input tokens):**
520+
521+
pp = prompt processing, tg = token generation.
522+
523+
- Cold start: pp 575 tok/s, tg 35 tok/s (~79s total)
524+
- Cached follow-up: tg 35 tok/s (~8s total)
525+
526+
| Quant | Size | Notes |
527+
|-------|------|-------|
528+
| UD-Q4\_K\_XL | ~23 GB | Recommended |
529+
| UD-Q4\_K\_M | ~22 GB | Slightly smaller |
530+
| UD-Q4\_K\_S | ~21 GB | Smallest Q4, marginal quality loss |
531+
532+
:::caution[KV Cache Quantization]
533+
Using `--cache-type-k q8_0 --cache-type-v q8_0`
534+
reduces RAM usage but **drops token generation from
535+
~35 tok/s to ~12 tok/s** on this model. The
536+
dequantization overhead per decode step is severe.
537+
This does not affect all models equally -- Qwen3.5
538+
showed the same penalty, but other architectures
539+
may not.
540+
:::
541+
479542
## Quick Reference
480543

481544
| Model | Port | Command |
@@ -490,6 +553,7 @@ blocks.
490553
| Qwen3-Coder-Next | 8130 | See full command above (~46 GB) |
491554
| Qwen3.5-35B-A3B | 8131 | See full command above (needs `--swa-full`) |
492555
| Gemma-4-26B-A4B | 8132 | See full command above |
556+
| Qwen3.6-35B-A3B | 8133 | See full command above (no q8\_0 cache!) |
493557

494558
## Vision Models
495559

docs/local-llm-setup.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,7 @@ prompt). All models served via `llama-server`.
9999
| Model | Active Params | tg (tok/s) |
100100
|-------|--------------|------------|
101101
| **Gemma-4-26B-A4B** | **4B** | **~40** |
102+
| **Qwen3.6-35B-A3B** | **3B** | **~35** |
102103
| GPT-OSS-20B | 3.6B | ~17--38 |
103104
| Qwen3-30B-A3B | 3B | ~15--27 |
104105
| GLM-4.7-Flash | 3B | ~12--13 |
@@ -313,6 +314,61 @@ pp = prompt processing, tg = token generation.
313314
> before the final answer. For multi-turn conversations, only feed visible
314315
> answers back -- exclude prior thought blocks.
315316
317+
### Qwen3.6-35B-A3B (Fast Qwen MoE)
318+
319+
A 35B MoE model with 3B active parameters. Successor to Qwen3.5 with vision
320+
support. Uses sliding window attention (SWA).
321+
322+
```bash
323+
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL \
324+
--port 8133 \
325+
-ngl 999 \
326+
--threads 8 \
327+
-c 65536 \
328+
-b 2048 \
329+
-ub 1024 \
330+
--parallel 1 \
331+
-fa on \
332+
--jinja \
333+
--keep 1024 \
334+
--swa-full \
335+
--no-context-shift \
336+
--chat-template-kwargs '{"enable_thinking": false}' \
337+
--temp 0.7 \
338+
--top-p 0.8 \
339+
--top-k 20 \
340+
--min-p 0.00 \
341+
--no-mmap
342+
```
343+
344+
**Critical settings:**
345+
346+
| Setting | Why |
347+
|---------|-----|
348+
| No `--cache-type-k/v` | **Do not** use `q8_0` KV cache -- drops tg from ~35 to ~12 tok/s |
349+
| `--swa-full` | Expands SWA cache for prompt caching |
350+
| `--no-context-shift` | Required with SWA |
351+
| `--chat-template-kwargs ...` | Disables thinking mode for agentic workflows |
352+
| `-c 65536` | 64K context -- enough for Claude Code |
353+
354+
**Performance (M1 Max 64 GB, ~41K input tokens):**
355+
356+
pp = prompt processing, tg = token generation.
357+
358+
- Cold start: pp 575 tok/s, tg 35 tok/s (~79s total)
359+
- Cached follow-up: tg 35 tok/s (~8s total)
360+
361+
**Quantization options:**
362+
363+
| Quant | Size | Notes |
364+
|-------|------|-------|
365+
| UD-Q4_K_XL | ~23 GB | Recommended |
366+
| UD-Q4_K_M | ~22 GB | Slightly smaller |
367+
| UD-Q4_K_S | ~21 GB | Smallest Q4, marginal quality loss |
368+
369+
> **Warning:** Using `--cache-type-k q8_0 --cache-type-v q8_0` reduces RAM but
370+
> **drops token generation from ~35 tok/s to ~12 tok/s** on this model.
371+
316372
## Quick Reference
317373

318374
| Model | Port | Command |
@@ -325,6 +381,7 @@ pp = prompt processing, tg = token generation.
325381
| Qwen3-Coder-Next | 8130 | See full command above (~46GB RAM) |
326382
| GLM-4.7-Flash | 8129 | See full command above (requires chat template) |
327383
| Gemma-4-26B-A4B | 8132 | See full command above |
384+
| Qwen3.6-35B-A3B | 8133 | See full command above (no q8_0 cache!) |
328385

329386
## Usage
330387

0 commit comments

Comments
 (0)