Skip to content

Commit 1e5daa2

Browse files
committed
docs: update Qwen3.5-35B settings with thinking-disabled flag
- Add --chat-template-kwargs to disable thinking mode - Generation speed 21-23 tok/s (up from 12 tok/s) - Cached follow-ups ~3s, prompt eval 374-408 tok/s
1 parent 6f345aa commit 1e5daa2

1 file changed

Lines changed: 5 additions & 4 deletions

File tree

docs-site/src/content/docs/integrations/local-llms.mdx

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -261,6 +261,7 @@ llama-server \
261261
--cache-type-v q8_0 \
262262
--swa-full \
263263
--no-context-shift \
264+
--chat-template-kwargs '{"enable_thinking": false}' \
264265
--mlock \
265266
--no-mmap
266267
```
@@ -269,6 +270,7 @@ llama-server \
269270

270271
| Setting | Why |
271272
|---------|-----|
273+
| `--chat-template-kwargs ...` | Disables thinking mode -- nearly 2x faster generation without meaningful quality loss for agentic workflows |
272274
| `--swa-full` | Expands SWA cache to full context, enabling prompt caching (uses more RAM) |
273275
| `--no-context-shift` | Required -- context shift is incompatible with SWA |
274276
| `--cache-type-k/v q8_0` | "Basically free" quality-wise, boosts throughput |
@@ -277,10 +279,9 @@ llama-server \
277279

278280
**Performance (M1 Max 64 GB):**
279281

280-
- Cold start: ~93 seconds (processing ~35k token system prompt)
281-
- Cached follow-ups: ~10 seconds
282-
- Prompt eval: ~245--375 tok/s
283-
- Generation: ~12 tok/s
282+
- Cached follow-ups: ~3 seconds
283+
- Prompt eval: ~374--408 tok/s
284+
- Generation: ~21--23 tok/s
284285

285286
| Quant | Size | Notes |
286287
|-------|------|-------|

0 commit comments

Comments
 (0)