Skip to content

Commit bcbeba3

Browse files
committed
docs: add Qwen3.5-35B-A3B to Starlight local LLM docs
- Full llama-server command with tested settings - SWA cache fix (--swa-full --no-context-shift) - KV cache q8_0, mlock, no-mmap optimizations - Performance numbers from M1 Max 64GB testing - Caution admonition for SWA cache gotcha
1 parent 8238256 commit bcbeba3

1 file changed

Lines changed: 61 additions & 0 deletions

File tree

docs-site/src/content/docs/integrations/local-llms.mdx

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -238,6 +238,66 @@ llama-server \
238238
|-------|------|-------|
239239
| UD-Q4\_K\_XL | ~46 GB | Recommended for 64 GB systems |
240240

241+
### Qwen3.5-35B-A3B -- Smart General-Purpose MoE
242+
243+
A 35B MoE model with 3B active parameters. Uses
244+
sliding window attention (SWA), which requires the
245+
`--swa-full` flag to enable prompt caching across
246+
follow-up requests. Without it, every request
247+
reprocesses the full prompt from scratch.
248+
249+
```bash
250+
llama-server \
251+
-hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M \
252+
--port 8131 \
253+
-c 131072 \
254+
-b 2048 \
255+
-ub 1024 \
256+
--parallel 1 \
257+
-fa on \
258+
--jinja \
259+
--keep 1024 \
260+
--cache-type-k q8_0 \
261+
--cache-type-v q8_0 \
262+
--swa-full \
263+
--no-context-shift \
264+
--mlock \
265+
--no-mmap
266+
```
267+
268+
**Critical settings:**
269+
270+
| Setting | Why |
271+
|---------|-----|
272+
| `--swa-full` | Expands SWA cache to full context, enabling prompt caching (uses more RAM) |
273+
| `--no-context-shift` | Required -- context shift is incompatible with SWA |
274+
| `--cache-type-k/v q8_0` | "Basically free" quality-wise, boosts throughput |
275+
| `--keep 1024` | Keeps system prompt prefix in cache |
276+
| `--mlock --no-mmap` | macOS memory optimization |
277+
278+
**Performance (M1 Max 64 GB):**
279+
280+
- Cold start: ~93 seconds (processing ~35k token system prompt)
281+
- Cached follow-ups: ~10 seconds
282+
- Prompt eval: ~245--375 tok/s
283+
- Generation: ~12 tok/s
284+
285+
| Quant | Size | Notes |
286+
|-------|------|-------|
287+
| Q4\_K\_M | ~20 GB | Good balance, recommended |
288+
| UD-Q4\_K\_XL | ~21 GB | Slightly better quality |
289+
| UD-Q5\_K\_XL | ~25 GB | Higher quality, slower |
290+
291+
:::caution[SWA Cache Gotcha]
292+
Without `--swa-full`, this model reprocesses the
293+
entire ~35k token prompt on every request, turning
294+
a 10-second follow-up into a 93-second wait. This
295+
is due to the sliding window attention architecture
296+
invalidating the KV cache. See
297+
[llama.cpp #19858](https://github.com/ggml-org/llama.cpp/issues/19858)
298+
for details.
299+
:::
300+
241301
### Qwen3-Next-80B-A3B -- Better Long Context
242302

243303
Slower generation but performance does not degrade
@@ -349,6 +409,7 @@ llama-server \
349409
| Qwen3-VL-30B | 8128 | See [Vision Models](#vision-models) |
350410
| GLM-4.7-Flash | 8129 | See full command above |
351411
| Qwen3-Coder-Next | 8130 | See full command above (~46 GB) |
412+
| Qwen3.5-35B-A3B | 8131 | See full command above (needs `--swa-full`) |
352413

353414
## Vision Models
354415

0 commit comments

Comments
 (0)