@@ -238,6 +238,66 @@ llama-server \
238238| -------| ------| -------|
239239| UD-Q4\_ K\_ XL | ~ 46 GB | Recommended for 64 GB systems |
240240
241+ ### Qwen3.5-35B-A3B -- Smart General-Purpose MoE
242+
243+ A 35B MoE model with 3B active parameters. Uses
244+ sliding window attention (SWA), which requires the
245+ ` --swa-full ` flag to enable prompt caching across
246+ follow-up requests. Without it, every request
247+ reprocesses the full prompt from scratch.
248+
249+ ``` bash
250+ llama-server \
251+ -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M \
252+ --port 8131 \
253+ -c 131072 \
254+ -b 2048 \
255+ -ub 1024 \
256+ --parallel 1 \
257+ -fa on \
258+ --jinja \
259+ --keep 1024 \
260+ --cache-type-k q8_0 \
261+ --cache-type-v q8_0 \
262+ --swa-full \
263+ --no-context-shift \
264+ --mlock \
265+ --no-mmap
266+ ```
267+
268+ ** Critical settings:**
269+
270+ | Setting | Why |
271+ | ---------| -----|
272+ | ` --swa-full ` | Expands SWA cache to full context, enabling prompt caching (uses more RAM) |
273+ | ` --no-context-shift ` | Required -- context shift is incompatible with SWA |
274+ | ` --cache-type-k/v q8_0 ` | "Basically free" quality-wise, boosts throughput |
275+ | ` --keep 1024 ` | Keeps system prompt prefix in cache |
276+ | ` --mlock --no-mmap ` | macOS memory optimization |
277+
278+ ** Performance (M1 Max 64 GB):**
279+
280+ - Cold start: ~ 93 seconds (processing ~ 35k token system prompt)
281+ - Cached follow-ups: ~ 10 seconds
282+ - Prompt eval: ~ 245--375 tok/s
283+ - Generation: ~ 12 tok/s
284+
285+ | Quant | Size | Notes |
286+ | -------| ------| -------|
287+ | Q4\_ K\_ M | ~ 20 GB | Good balance, recommended |
288+ | UD-Q4\_ K\_ XL | ~ 21 GB | Slightly better quality |
289+ | UD-Q5\_ K\_ XL | ~ 25 GB | Higher quality, slower |
290+
291+ :::caution[ SWA Cache Gotcha]
292+ Without ` --swa-full ` , this model reprocesses the
293+ entire ~ 35k token prompt on every request, turning
294+ a 10-second follow-up into a 93-second wait. This
295+ is due to the sliding window attention architecture
296+ invalidating the KV cache. See
297+ [ llama.cpp #19858 ] ( https://github.com/ggml-org/llama.cpp/issues/19858 )
298+ for details.
299+ :::
300+
241301### Qwen3-Next-80B-A3B -- Better Long Context
242302
243303Slower generation but performance does not degrade
@@ -349,6 +409,7 @@ llama-server \
349409| Qwen3-VL-30B | 8128 | See [ Vision Models] ( #vision-models ) |
350410| GLM-4.7-Flash | 8129 | See full command above |
351411| Qwen3-Coder-Next | 8130 | See full command above (~ 46 GB) |
412+ | Qwen3.5-35B-A3B | 8131 | See full command above (needs ` --swa-full ` ) |
352413
353414## Vision Models
354415
0 commit comments