Skip to content

Commit 846262d

Browse files
authored
docs : update speculative decoding parameters after refactor (ggml-org#22397) (ggml-org#22539)
* docs : update speculative decoding parameters after refactor (ggml-org#22397) Update docs/speculative.md to reflect the new parameter naming scheme introduced in PR ggml-org#22397: - Replace --draft-max/--draft-min with --spec-draft-n-max/--spec-draft-n-min - Replace --spec-ngram-size-n/m with per-implementation variants - Add documentation for all new --spec-ngram-*- parameters - Update all example commands Assisted-by: llama.cpp:local pi * pi : add rule to use gh CLI for GitHub resources Assisted-by: llama.cpp:local pi * docs : run llama-gen-docs * arg : fix typo
1 parent 6dcd824 commit 846262d

6 files changed

Lines changed: 209 additions & 69 deletions

File tree

.pi/gg/SYSTEM.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ General:
44
- By very precise and concise when writing code, comments, explanations, etc.
55
- PR and commit titles format: `<module> : <title>`. Lookup recents for examples
66
- Don't try to build or run the code unless you are explicitly asked to do so
7+
- Use the `gh` CLI tool when querying PRs, issues, or other GitHub resources
78

89
Coding:
910
- When in doubt, always refer to the CONTRIBUTING.md file of the project

common/arg.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3380,7 +3380,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
33803380
).set_spec().set_examples({LLAMA_EXAMPLE_SPECULATIVE, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_CLI}));
33813381
add_opt(common_arg(
33823382
{"--spec-draft-poll", "--poll-draft"}, "<0|1>",
3383-
"Use polling to wait for draft model work (default: same as --poll])",
3383+
"Use polling to wait for draft model work (default: same as --poll)",
33843384
[](common_params & params, int value) {
33853385
params.speculative.draft.cpuparams.poll = value;
33863386
}

docs/speculative.md

Lines changed: 108 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -33,18 +33,18 @@ An example to use this approach can be the rewriting of source code by a LLM.
3333
This implementation looks for the last n-gram in history that matches the current n-gram and creates a draft using the m tokens following the matched n-gram. It is the simplest self-speculative approach with minimal overhead.
3434

3535
```
36-
llama-server [...] --spec-type ngram-simple --draft-max 64
36+
llama-server [...] --spec-type ngram-simple --spec-draft-n-max 64
3737
```
3838

3939
#### n-gram Map Key (`ngram-map-k`)
4040

41-
This implementation looks for the current n-gram of size n (called the _key_) in the token history. If the key n-gram is followed by the same m tokens (called the _mgram_) multiple times, it creates a draft using these m tokens. This approach requires a minimum number of occurrences (argument `--spec-ngram-min-hits`, default is 1) before generating drafts.
41+
This implementation looks for the current n-gram of size n (called the _key_) in the token history. If the key n-gram is followed by the same m tokens (called the _mgram_) multiple times, it creates a draft using these m tokens. This approach requires a minimum number of occurrences (argument `--spec-ngram-map-k-min-hits`, default is 1) before generating drafts.
4242

4343
The number of accepted tokens is stored for each used n-gram.
4444

4545
**Example:**
4646
```
47-
llama-server [...] --spec-type ngram-map-k --draft-max 64
47+
llama-server [...] --spec-type ngram-map-k --spec-draft-n-max 64
4848
```
4949

5050
#### n-gram Map Key-4-Values (`ngram-map-k4v`)
@@ -55,7 +55,7 @@ The number of accepted tokens is stored for each used n-gram.
5555

5656
**Example:** Server options to be used if there are a lot of longer repetitions.
5757
```
58-
llama-server [...] --spec-type ngram-map-k4v --spec-ngram-size-n 8 --spec-ngram-size-m 8 --spec-ngram-min-hits 2 --draft-max 64
58+
llama-server [...] --spec-type ngram-map-k4v --spec-ngram-map-k4v-size-n 8 --spec-ngram-map-k4v-size-m 8 --spec-ngram-map-k4v-min-hits 2 --spec-draft-n-max 64
5959
```
6060

6161
### n-gram Mod (`ngram-mod`)
@@ -80,9 +80,9 @@ Currently, a single hash pool is shared across all server slots, so different re
8080
# notes:
8181
# - small `n` are not recommended
8282
# - MoEs require long drafts
83-
# - dense models: can reduce `--draft-min` and `--draft-max`
83+
# - dense models: can reduce `--spec-ngram-mod-n-min` and `--spec-ngram-mod-n-max`
8484
85-
llama-server ... --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64
85+
llama-server ... --spec-type ngram-mod --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 48 --spec-ngram-mod-n-max 64
8686
```
8787

8888
Applications:
@@ -105,21 +105,90 @@ Example Video:
105105

106106
If a draft model is combined with a draftless decoding the draftless decoding has higher precedence.
107107

108+
### General Speculative Parameters
109+
108110
```
109-
--draft, --draft-n, --draft-max N number of tokens to draft for speculative decoding (default: 16)
110-
(env: LLAMA_ARG_DRAFT_MAX)
111-
--draft-min, --draft-n-min N minimum number of draft tokens to use for speculative decoding
112-
(default: 0)
113-
(env: LLAMA_ARG_DRAFT_MIN)
114-
[...]
115111
--spec-type [none|ngram-cache|ngram-simple|ngram-map-k|ngram-map-k4v|ngram-mod]
116112
type of speculative decoding to use when no draft model is provided
117113
(default: none)
118-
--spec-ngram-size-n N ngram size N for ngram-simple/ngram-map speculative decoding, length
119-
of lookup n-gram (default: 12)
120-
--spec-ngram-size-m N ngram size M for ngram-simple/ngram-map speculative decoding, length
121-
of draft m-gram (default: 48)
122-
--spec-ngram-min-hits N minimum hits for ngram-map speculative decoding (default: 1)
114+
(env: LLAMA_ARG_SPEC_TYPE)
115+
--spec-default use default speculative decoding
116+
```
117+
118+
### Draft Model Parameters
119+
120+
```
121+
--spec-draft-model, -md, --model-draft FNAME
122+
draft model for speculative decoding (default: unused)
123+
(env: LLAMA_ARG_SPEC_DRAFT_MODEL)
124+
--spec-draft-hf, -hfd, -hfrd, --hf-repo-draft <user>/<model>[:quant]
125+
HuggingFace repository for the draft model
126+
--spec-draft-n-max N
127+
number of tokens to draft for speculative decoding (default: 16)
128+
(env: LLAMA_ARG_SPEC_DRAFT_N_MAX)
129+
--spec-draft-n-min N
130+
minimum number of draft tokens to use for speculative decoding (default: 0)
131+
(env: LLAMA_ARG_SPEC_DRAFT_N_MIN)
132+
--spec-draft-p-split, --draft-p-split P
133+
speculative decoding split probability (default: 0.10)
134+
(env: LLAMA_ARG_SPEC_DRAFT_P_SPLIT)
135+
--spec-draft-p-min, --draft-p-min P
136+
minimum speculative decoding probability (greedy) (default: 0.75)
137+
(env: LLAMA_ARG_SPEC_DRAFT_P_MIN)
138+
--spec-draft-ctx-size, -cd, --ctx-size-draft N
139+
size of the prompt context for the draft model (default: 0, 0 = loaded from model)
140+
(env: LLAMA_ARG_SPEC_DRAFT_CTX_SIZE)
141+
--spec-draft-ngl, -ngld, --gpu-layers-draft, --n-gpu-layers-draft N
142+
max. number of draft model layers to store in VRAM, either an exact number, 'auto', or 'all' (default: auto)
143+
(env: LLAMA_ARG_N_GPU_LAYERS_DRAFT)
144+
--spec-draft-device, -devd, --device-draft <dev1,dev2,..>
145+
comma-separated list of devices to use for offloading the draft model
146+
--spec-draft-replace, --spec-replace TARGET DRAFT
147+
translate the string in TARGET into DRAFT if the draft model and main model are not compatible
148+
```
149+
150+
### n-gram Mod Parameters
151+
152+
```
153+
--spec-ngram-mod-n-match N
154+
ngram-mod lookup length (default: 24)
155+
--spec-ngram-mod-n-min N
156+
minimum number of ngram tokens to use for ngram-based speculative decoding (default: 48)
157+
--spec-ngram-mod-n-max N
158+
maximum number of ngram tokens to use for ngram-based speculative decoding (default: 64)
159+
```
160+
161+
### n-gram Simple Parameters
162+
163+
```
164+
--spec-ngram-simple-size-n N
165+
ngram size N for ngram-simple speculative decoding, length of lookup n-gram (default: 12)
166+
--spec-ngram-simple-size-m N
167+
ngram size M for ngram-simple speculative decoding, length of draft m-gram (default: 48)
168+
--spec-ngram-simple-min-hits N
169+
minimum hits for ngram-simple speculative decoding (default: 1)
170+
```
171+
172+
### n-gram Map Key Parameters
173+
174+
```
175+
--spec-ngram-map-k-size-n N
176+
ngram size N for ngram-map-k speculative decoding, length of lookup n-gram (default: 12)
177+
--spec-ngram-map-k-size-m N
178+
ngram size M for ngram-map-k speculative decoding, length of draft m-gram (default: 48)
179+
--spec-ngram-map-k-min-hits N
180+
minimum hits for ngram-map-k speculative decoding (default: 1)
181+
```
182+
183+
### n-gram Map Key-4-Values Parameters
184+
185+
```
186+
--spec-ngram-map-k4v-size-n N
187+
ngram size N for ngram-map-k4v speculative decoding, length of lookup n-gram (default: 12)
188+
--spec-ngram-map-k4v-size-m N
189+
ngram size M for ngram-map-k4v speculative decoding, length of draft m-gram (default: 48)
190+
--spec-ngram-map-k4v-min-hits N
191+
minimum hits for ngram-map-k4v speculative decoding (default: 1)
123192
```
124193

125194
### `--spec-type TYPE`
@@ -140,21 +209,40 @@ Specifies a type of speculative decoding without draft model.
140209
./llama-server [...] --spec-type ngram-simple
141210
```
142211

143-
### `--spec-ngram-size-n N`
212+
### `--spec-ngram-*-size-n N`
144213

145214
Sets the size N of the lookup n-gram for n-gram map based speculative decoding.
146215
The n-gram size N determines how many tokens in a row to look back when searching for matching patterns.
147216

148-
### `--spec-ngram-size-m M`
217+
Each n-gram implementation has its own parameter:
218+
219+
- `--spec-ngram-simple-size-n` for `ngram-simple`
220+
- `--spec-ngram-map-k-size-n` for `ngram-map-k`
221+
- `--spec-ngram-map-k4v-size-n` for `ngram-map-k4v`
222+
- `--spec-ngram-mod-n-match` for `ngram-mod`
223+
224+
### `--spec-ngram-*-size-m M`
149225

150226
Sets the size M of the draft m-gram for n-gram map based speculative decoding.
151227
The m-gram size determines how many tokens to draft when a match is found.
152228
Larger values can provide more speedup but may reduce acceptance rate.
153229

154-
### `--spec-ngram-min-hits H`
230+
Each n-gram implementation has its own parameter:
231+
232+
- `--spec-ngram-simple-size-m` for `ngram-simple`
233+
- `--spec-ngram-map-k-size-m` for `ngram-map-k`
234+
- `--spec-ngram-map-k4v-size-m` for `ngram-map-k4v`
235+
236+
### `--spec-ngram-*-min-hits H`
155237

156238
This option defines how often a key has to appear in the token history to be used as a draft (default is 1).
157239

240+
Each n-gram implementation has its own parameter:
241+
242+
- `--spec-ngram-simple-min-hits` for `ngram-simple`
243+
- `--spec-ngram-map-k-min-hits` for `ngram-map-k`
244+
- `--spec-ngram-map-k4v-min-hits` for `ngram-map-k4v`
245+
158246
## Statistics
159247
Each speculative decoding implementation prints statistics.
160248

@@ -180,4 +268,3 @@ statistics ngram_map_k: #calls(b,g,a) = 6 1690 26, #gen drafts = 26, #acc drafts
180268
- `#gen tokens`: number of tokens generated by this implementation (including rejected tokens)
181269
- `#acc tokens`: number of tokens accepted by the main model
182270
- `dur(b,g,a): durations of begin (new prompt), generation and accumulation (process acceptance).
183-

0 commit comments

Comments
 (0)