@@ -33,18 +33,18 @@ An example to use this approach can be the rewriting of source code by a LLM.
3333This implementation looks for the last n-gram in history that matches the current n-gram and creates a draft using the m tokens following the matched n-gram. It is the simplest self-speculative approach with minimal overhead.
3434
3535```
36- llama-server [...] --spec-type ngram-simple --draft-max 64
36+ llama-server [...] --spec-type ngram-simple --spec- draft-n -max 64
3737```
3838
3939#### n-gram Map Key (` ngram-map-k ` )
4040
41- This implementation looks for the current n-gram of size n (called the _ key_ ) in the token history. If the key n-gram is followed by the same m tokens (called the _ mgram_ ) multiple times, it creates a draft using these m tokens. This approach requires a minimum number of occurrences (argument ` --spec-ngram-min-hits ` , default is 1) before generating drafts.
41+ This implementation looks for the current n-gram of size n (called the _ key_ ) in the token history. If the key n-gram is followed by the same m tokens (called the _ mgram_ ) multiple times, it creates a draft using these m tokens. This approach requires a minimum number of occurrences (argument ` --spec-ngram-map-k- min-hits ` , default is 1) before generating drafts.
4242
4343The number of accepted tokens is stored for each used n-gram.
4444
4545** Example:**
4646```
47- llama-server [...] --spec-type ngram-map-k --draft-max 64
47+ llama-server [...] --spec-type ngram-map-k --spec- draft-n -max 64
4848```
4949
5050#### n-gram Map Key-4-Values (` ngram-map-k4v ` )
@@ -55,7 +55,7 @@ The number of accepted tokens is stored for each used n-gram.
5555
5656** Example:** Server options to be used if there are a lot of longer repetitions.
5757```
58- llama-server [...] --spec-type ngram-map-k4v --spec-ngram-size-n 8 --spec-ngram-size-m 8 --spec-ngram-min-hits 2 --draft-max 64
58+ llama-server [...] --spec-type ngram-map-k4v --spec-ngram-map-k4v- size-n 8 --spec-ngram-map-k4v- size-m 8 --spec-ngram-map-k4v- min-hits 2 --spec- draft-n -max 64
5959```
6060
6161### n-gram Mod (` ngram-mod ` )
@@ -80,9 +80,9 @@ Currently, a single hash pool is shared across all server slots, so different re
8080# notes:
8181# - small `n` are not recommended
8282# - MoEs require long drafts
83- # - dense models: can reduce `--draft- min` and `--draft -max`
83+ # - dense models: can reduce `--spec-ngram-mod-n- min` and `--spec-ngram-mod-n -max`
8484
85- llama-server ... --spec-type ngram-mod --spec-ngram-size-n 24 --draft- min 48 --draft -max 64
85+ llama-server ... --spec-type ngram-mod --spec-ngram-mod-n-match 24 --spec-ngram-mod-n- min 48 --spec-ngram-mod-n -max 64
8686```
8787
8888Applications:
@@ -105,21 +105,90 @@ Example Video:
105105
106106If a draft model is combined with a draftless decoding the draftless decoding has higher precedence.
107107
108+ ### General Speculative Parameters
109+
108110```
109- --draft, --draft-n, --draft-max N number of tokens to draft for speculative decoding (default: 16)
110- (env: LLAMA_ARG_DRAFT_MAX)
111- --draft-min, --draft-n-min N minimum number of draft tokens to use for speculative decoding
112- (default: 0)
113- (env: LLAMA_ARG_DRAFT_MIN)
114- [...]
115111--spec-type [none|ngram-cache|ngram-simple|ngram-map-k|ngram-map-k4v|ngram-mod]
116112 type of speculative decoding to use when no draft model is provided
117113 (default: none)
118- --spec-ngram-size-n N ngram size N for ngram-simple/ngram-map speculative decoding, length
119- of lookup n-gram (default: 12)
120- --spec-ngram-size-m N ngram size M for ngram-simple/ngram-map speculative decoding, length
121- of draft m-gram (default: 48)
122- --spec-ngram-min-hits N minimum hits for ngram-map speculative decoding (default: 1)
114+ (env: LLAMA_ARG_SPEC_TYPE)
115+ --spec-default use default speculative decoding
116+ ```
117+
118+ ### Draft Model Parameters
119+
120+ ```
121+ --spec-draft-model, -md, --model-draft FNAME
122+ draft model for speculative decoding (default: unused)
123+ (env: LLAMA_ARG_SPEC_DRAFT_MODEL)
124+ --spec-draft-hf, -hfd, -hfrd, --hf-repo-draft <user>/<model>[:quant]
125+ HuggingFace repository for the draft model
126+ --spec-draft-n-max N
127+ number of tokens to draft for speculative decoding (default: 16)
128+ (env: LLAMA_ARG_SPEC_DRAFT_N_MAX)
129+ --spec-draft-n-min N
130+ minimum number of draft tokens to use for speculative decoding (default: 0)
131+ (env: LLAMA_ARG_SPEC_DRAFT_N_MIN)
132+ --spec-draft-p-split, --draft-p-split P
133+ speculative decoding split probability (default: 0.10)
134+ (env: LLAMA_ARG_SPEC_DRAFT_P_SPLIT)
135+ --spec-draft-p-min, --draft-p-min P
136+ minimum speculative decoding probability (greedy) (default: 0.75)
137+ (env: LLAMA_ARG_SPEC_DRAFT_P_MIN)
138+ --spec-draft-ctx-size, -cd, --ctx-size-draft N
139+ size of the prompt context for the draft model (default: 0, 0 = loaded from model)
140+ (env: LLAMA_ARG_SPEC_DRAFT_CTX_SIZE)
141+ --spec-draft-ngl, -ngld, --gpu-layers-draft, --n-gpu-layers-draft N
142+ max. number of draft model layers to store in VRAM, either an exact number, 'auto', or 'all' (default: auto)
143+ (env: LLAMA_ARG_N_GPU_LAYERS_DRAFT)
144+ --spec-draft-device, -devd, --device-draft <dev1,dev2,..>
145+ comma-separated list of devices to use for offloading the draft model
146+ --spec-draft-replace, --spec-replace TARGET DRAFT
147+ translate the string in TARGET into DRAFT if the draft model and main model are not compatible
148+ ```
149+
150+ ### n-gram Mod Parameters
151+
152+ ```
153+ --spec-ngram-mod-n-match N
154+ ngram-mod lookup length (default: 24)
155+ --spec-ngram-mod-n-min N
156+ minimum number of ngram tokens to use for ngram-based speculative decoding (default: 48)
157+ --spec-ngram-mod-n-max N
158+ maximum number of ngram tokens to use for ngram-based speculative decoding (default: 64)
159+ ```
160+
161+ ### n-gram Simple Parameters
162+
163+ ```
164+ --spec-ngram-simple-size-n N
165+ ngram size N for ngram-simple speculative decoding, length of lookup n-gram (default: 12)
166+ --spec-ngram-simple-size-m N
167+ ngram size M for ngram-simple speculative decoding, length of draft m-gram (default: 48)
168+ --spec-ngram-simple-min-hits N
169+ minimum hits for ngram-simple speculative decoding (default: 1)
170+ ```
171+
172+ ### n-gram Map Key Parameters
173+
174+ ```
175+ --spec-ngram-map-k-size-n N
176+ ngram size N for ngram-map-k speculative decoding, length of lookup n-gram (default: 12)
177+ --spec-ngram-map-k-size-m N
178+ ngram size M for ngram-map-k speculative decoding, length of draft m-gram (default: 48)
179+ --spec-ngram-map-k-min-hits N
180+ minimum hits for ngram-map-k speculative decoding (default: 1)
181+ ```
182+
183+ ### n-gram Map Key-4-Values Parameters
184+
185+ ```
186+ --spec-ngram-map-k4v-size-n N
187+ ngram size N for ngram-map-k4v speculative decoding, length of lookup n-gram (default: 12)
188+ --spec-ngram-map-k4v-size-m N
189+ ngram size M for ngram-map-k4v speculative decoding, length of draft m-gram (default: 48)
190+ --spec-ngram-map-k4v-min-hits N
191+ minimum hits for ngram-map-k4v speculative decoding (default: 1)
123192```
124193
125194### ` --spec-type TYPE `
@@ -140,21 +209,40 @@ Specifies a type of speculative decoding without draft model.
140209./llama-server [...] --spec-type ngram-simple
141210```
142211
143- ### ` --spec-ngram-size-n N `
212+ ### ` --spec-ngram-*- size-n N `
144213
145214Sets the size N of the lookup n-gram for n-gram map based speculative decoding.
146215The n-gram size N determines how many tokens in a row to look back when searching for matching patterns.
147216
148- ### ` --spec-ngram-size-m M `
217+ Each n-gram implementation has its own parameter:
218+
219+ - ` --spec-ngram-simple-size-n ` for ` ngram-simple `
220+ - ` --spec-ngram-map-k-size-n ` for ` ngram-map-k `
221+ - ` --spec-ngram-map-k4v-size-n ` for ` ngram-map-k4v `
222+ - ` --spec-ngram-mod-n-match ` for ` ngram-mod `
223+
224+ ### ` --spec-ngram-*-size-m M `
149225
150226Sets the size M of the draft m-gram for n-gram map based speculative decoding.
151227The m-gram size determines how many tokens to draft when a match is found.
152228Larger values can provide more speedup but may reduce acceptance rate.
153229
154- ### ` --spec-ngram-min-hits H `
230+ Each n-gram implementation has its own parameter:
231+
232+ - ` --spec-ngram-simple-size-m ` for ` ngram-simple `
233+ - ` --spec-ngram-map-k-size-m ` for ` ngram-map-k `
234+ - ` --spec-ngram-map-k4v-size-m ` for ` ngram-map-k4v `
235+
236+ ### ` --spec-ngram-*-min-hits H `
155237
156238This option defines how often a key has to appear in the token history to be used as a draft (default is 1).
157239
240+ Each n-gram implementation has its own parameter:
241+
242+ - ` --spec-ngram-simple-min-hits ` for ` ngram-simple `
243+ - ` --spec-ngram-map-k-min-hits ` for ` ngram-map-k `
244+ - ` --spec-ngram-map-k4v-min-hits ` for ` ngram-map-k4v `
245+
158246## Statistics
159247Each speculative decoding implementation prints statistics.
160248
@@ -180,4 +268,3 @@ statistics ngram_map_k: #calls(b,g,a) = 6 1690 26, #gen drafts = 26, #acc drafts
180268- ` #gen tokens ` : number of tokens generated by this implementation (including rejected tokens)
181269- ` #acc tokens ` : number of tokens accepted by the main model
182270- `dur(b,g,a): durations of begin (new prompt), generation and accumulation (process acceptance).
183-
0 commit comments