You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: tools/server/README.md
+29Lines changed: 29 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -213,6 +213,7 @@ For the full list of features, please refer to [server's changelog](https://gith
213
213
|`--models-preset PATH`| path to INI file containing model presets for the router server (default: disabled)<br/>(env: LLAMA_ARG_MODELS_PRESET) |
214
214
|`--models-max N`| for router server, maximum number of models to load simultaneously (default: 4, 0 = unlimited)<br/>(env: LLAMA_ARG_MODELS_MAX) |
215
215
|`--models-autoload, --no-models-autoload`| for router server, whether to automatically load models (default: enabled)<br/>(env: LLAMA_ARG_MODELS_AUTOLOAD) |
216
+
|`--models-cache [LIST]`| cache GGUF files in page cache for fast model swapping (non-router mode). No argument: cache all models. Comma-separated list: cache only specified models.<br/>(env: LLAMA_ARG_MODELS_CACHE) |
216
217
|`--jinja, --no-jinja`| whether to use jinja template engine for chat (default: enabled)<br/>(env: LLAMA_ARG_JINJA) |
217
218
|`--reasoning-format FORMAT`| controls whether thought tags are allowed and/or extracted from the response, and in which format they're returned; one of:<br/>- none: leaves thoughts unparsed in `message.content`<br/>- deepseek: puts thoughts in `message.reasoning_content`<br/>- deepseek-legacy: keeps `<think>` tags in `message.content` while also populating `message.reasoning_content`<br/>(default: auto)<br/>(env: LLAMA_ARG_THINK) |
218
219
|`-rea, --reasoning [on\|off\|auto]`| Use reasoning/thinking in the chat ('on', 'off', or 'auto', default: 'auto' (detect from template))<br/>(env: LLAMA_ARG_REASONING) |
@@ -1599,6 +1600,7 @@ The precedence rule for preset options is as follows:
1599
1600
1600
1601
We also offer additional options that are exclusive to presets (these aren't treated as command-line arguments):
1601
1602
-`load-on-startup` (boolean): Controls whether the model loads automatically when the server starts
1603
+
-`cache-on-startup` (boolean): Controls whether the model's GGUF file is cached in page cache on startup
1602
1604
-`stop-timeout` (int, seconds): After requested unload, wait for this many seconds before forcing termination (default: 10)
1603
1605
1604
1606
### Routing requests
@@ -1732,6 +1734,33 @@ Response:
1732
1734
}
1733
1735
```
1734
1736
1737
+
### POST `/models/cache`: Cache a model's GGUF file
1738
+
1739
+
Cache a model's GGUF file in the OS page cache (RAM) for fast model swapping. This fills the file into the page cache using `mmap` + `madvise(POSIX_MADV_WILLNEED)` (Linux/macOS) or `PrefetchVirtualMemory` (Windows), without loading the model weights into memory.
1740
+
1741
+
Payload:
1742
+
1743
+
```json
1744
+
{
1745
+
"model": "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M"
1746
+
}
1747
+
```
1748
+
1749
+
Response:
1750
+
1751
+
```json
1752
+
{
1753
+
"success": true
1754
+
}
1755
+
```
1756
+
1757
+
**Notes:**
1758
+
- The `cached` field in the `/models` response indicates whether a model's file has been cached in page cache.
1759
+
- Use `--models-cache` CLI flag or `cache-on-startup` preset option to cache models automatically on startup.
1760
+
-`--models-cache` (no argument): caches all registered models.
1761
+
-`--models-cache modelA,modelB`: caches only the specified models.
1762
+
- Page cache warming uses `mmap` + `madvise(POSIX_MADV_WILLNEED)` (Linux/macOS) or `PrefetchVirtualMemory` (Windows) — the model weights are not loaded into memory.
1763
+
1735
1764
## API errors
1736
1765
1737
1766
`llama-server` returns errors in the same format as OAI: https://github.com/openai/openai-openapi
0 commit comments