You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`-np, --parallel N`| number of parallel sequences to decode (default: 1)<br/>(env: LLAMA_ARG_N_PARALLEL) |
58
+
|`--rpc SERVERS`| comma-separated list of RPC servers (host:port)<br/>(env: LLAMA_ARG_RPC) |
58
59
|`--mlock`| force system to keep model in RAM rather than swapping or compressing<br/>(env: LLAMA_ARG_MLOCK) |
59
60
|`--mmap, --no-mmap`| whether to memory-map model. (if mmap disabled, slower load but may reduce pageouts if not using mlock) (default: enabled)<br/>(env: LLAMA_ARG_MMAP) |
60
61
|`-dio, --direct-io, -ndio, --no-direct-io`| use DirectIO if available. (default: disabled)<br/>(env: LLAMA_ARG_DIO) |
@@ -198,7 +199,7 @@
198
199
|`--spec-draft-device, -devd, --device-draft <dev1,dev2,..>`| comma-separated list of devices to use for offloading the draft model (none = don't offload)<br/>use --list-devices to see a list of available devices |
199
200
|`--spec-draft-ngl, -ngld, --gpu-layers-draft, --n-gpu-layers-draft N`| max. number of draft model layers to store in VRAM, either an exact number, 'auto', or 'all' (default: auto)<br/>(env: LLAMA_ARG_N_GPU_LAYERS_DRAFT) |
200
201
|`--spec-draft-model, -md, --model-draft FNAME`| draft model for speculative decoding (default: unused)<br/>(env: LLAMA_ARG_SPEC_DRAFT_MODEL) |
201
-
|`--spec-type [none\|ngram-cache\|ngram-simple\|ngram-map-k\|ngram-map-k4v\|ngram-mod]`|type of speculative decoding to use when no draft model is provided (default: none)<br/><br/>(env: LLAMA_ARG_SPEC_TYPE) |
202
+
|`--spec-type none,draft-simple,draft-eagle3,ngram-simple,ngram-map-k,ngram-map-k4v,ngram-mod,ngram-cache`|comma-separated list of types of speculative decoding to use (default: none)<br/><br/>(env: LLAMA_ARG_SPEC_TYPE) |
202
203
|`--spec-ngram-mod-n-min N`| minimum number of ngram tokens to use for ngram-based speculative decoding (default: 48) |
203
204
|`--spec-ngram-mod-n-max N`| maximum number of ngram tokens to use for ngram-based speculative decoding (default: 64) |
|`-np, --parallel N`| number of parallel sequences to decode (default: 1)<br/>(env: LLAMA_ARG_N_PARALLEL) |
141
+
|`--rpc SERVERS`| comma-separated list of RPC servers (host:port)<br/>(env: LLAMA_ARG_RPC) |
141
142
|`--mlock`| force system to keep model in RAM rather than swapping or compressing<br/>(env: LLAMA_ARG_MLOCK) |
142
143
|`--mmap, --no-mmap`| whether to memory-map model. (if mmap disabled, slower load but may reduce pageouts if not using mlock) (default: enabled)<br/>(env: LLAMA_ARG_MMAP) |
143
144
|`-dio, --direct-io, -ndio, --no-direct-io`| use DirectIO if available. (default: disabled)<br/>(env: LLAMA_ARG_DIO) |
|`--rpc SERVERS`| comma-separated list of RPC servers (host:port)<br/>(env: LLAMA_ARG_RPC) |
75
76
|`--mlock`| force system to keep model in RAM rather than swapping or compressing<br/>(env: LLAMA_ARG_MLOCK) |
76
77
|`--mmap, --no-mmap`| whether to memory-map model. (if mmap disabled, slower load but may reduce pageouts if not using mlock) (default: enabled)<br/>(env: LLAMA_ARG_MMAP) |
77
78
|`-dio, --direct-io, -ndio, --no-direct-io`| use DirectIO if available. (default: disabled)<br/>(env: LLAMA_ARG_DIO) |
@@ -247,7 +248,7 @@ For the full list of features, please refer to [server's changelog](https://gith
247
248
|`--spec-draft-device, -devd, --device-draft <dev1,dev2,..>`| comma-separated list of devices to use for offloading the draft model (none = don't offload)<br/>use --list-devices to see a list of available devices |
248
249
|`--spec-draft-ngl, -ngld, --gpu-layers-draft, --n-gpu-layers-draft N`| max. number of draft model layers to store in VRAM, either an exact number, 'auto', or 'all' (default: auto)<br/>(env: LLAMA_ARG_N_GPU_LAYERS_DRAFT) |
249
250
|`--spec-draft-model, -md, --model-draft FNAME`| draft model for speculative decoding (default: unused)<br/>(env: LLAMA_ARG_SPEC_DRAFT_MODEL) |
250
-
|`--spec-type [none\|ngram-cache\|ngram-simple\|ngram-map-k\|ngram-map-k4v\|ngram-mod]`|type of speculative decoding to use when no draft model is provided (default: none)<br/><br/>(env: LLAMA_ARG_SPEC_TYPE) |
251
+
|`--spec-type none,draft-simple,draft-eagle3,ngram-simple,ngram-map-k,ngram-map-k4v,ngram-mod,ngram-cache`|comma-separated list of types of speculative decoding to use (default: none)<br/><br/>(env: LLAMA_ARG_SPEC_TYPE) |
251
252
|`--spec-ngram-mod-n-min N`| minimum number of ngram tokens to use for ngram-based speculative decoding (default: 48) |
252
253
|`--spec-ngram-mod-n-max N`| maximum number of ngram tokens to use for ngram-based speculative decoding (default: 64) |
0 commit comments