File tree Expand file tree Collapse file tree
docs-site/src/content/docs/integrations Expand file tree Collapse file tree Original file line number Diff line number Diff line change @@ -261,6 +261,7 @@ llama-server \
261261 --cache-type-v q8_0 \
262262 --swa-full \
263263 --no-context-shift \
264+ --chat-template-kwargs ' {"enable_thinking": false}' \
264265 --mlock \
265266 --no-mmap
266267```
@@ -269,6 +270,7 @@ llama-server \
269270
270271| Setting | Why |
271272| ---------| -----|
273+ | ` --chat-template-kwargs ... ` | Disables thinking mode -- nearly 2x faster generation without meaningful quality loss for agentic workflows |
272274| ` --swa-full ` | Expands SWA cache to full context, enabling prompt caching (uses more RAM) |
273275| ` --no-context-shift ` | Required -- context shift is incompatible with SWA |
274276| ` --cache-type-k/v q8_0 ` | "Basically free" quality-wise, boosts throughput |
@@ -277,10 +279,9 @@ llama-server \
277279
278280** Performance (M1 Max 64 GB):**
279281
280- - Cold start: ~ 93 seconds (processing ~ 35k token system prompt)
281- - Cached follow-ups: ~ 10 seconds
282- - Prompt eval: ~ 245--375 tok/s
283- - Generation: ~ 12 tok/s
282+ - Cached follow-ups: ~ 3 seconds
283+ - Prompt eval: ~ 374--408 tok/s
284+ - Generation: ~ 21--23 tok/s
284285
285286| Quant | Size | Notes |
286287| -------| ------| -------|
You can’t perform that action at this time.
0 commit comments