You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Multinomial defaults change + min_p support (#4194)
Introducing new default behavior for multinomial sampling. When
temperature > 0:
- default top_k is set to 40 (used to be inactive - all tokens
considered)
- default randomization (random seed for every request if not specified)
- new default behavior is non-deterministic, used to be deterministic
(same, hardcoded seed with every request)
Additionally adding support to `min_p` sampling option.
| temperature | ✅ | ✅ | ✅ | float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
237
237
| top_p | ✅ | ✅ | ✅ | float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
238
-
| top_k | ✅ | ❌ | ✅ | int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
238
+
| min_p | ✅ | ❌ | ✅ | float (default: `0.0`) | Minimum probability threshold relative to the most likely token. Tokens with probability below `min_p` × the top token probability are filtered out. `0.0` (default) disables the filter. Typical values: `0.05`–`0.1`. Must be in `[0.0, 1.0)`. |
239
+
| top_k | ✅ | ❌ | ✅ | int (default: `40`) | Controls the number of top tokens to consider. When multinomial sampling is active, defaults to `40` if not set. Set to `-1` to consider all tokens. |
239
240
| repetition_penalty | ✅ | ❌ | ✅ | float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
240
241
| frequency_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
241
242
| presence_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
242
-
| seed | ✅ | ✅ | ✅ | integer (default: `0`) | Random seed to use for the generation. |
243
+
| seed | ✅ | ✅ | ✅ | integer (default: random) | Random seed for generation in range `[0, 4294967295]`. Omit to use a random seed (non-deterministic). Set explicitly to get reproducible output. Note: `rng_seed` set in `generation_config.json` is not honoured for multinomial sampling — only a per-request seed is applied. |
243
244
244
245
#### Speculative decoding specific
245
246
@@ -275,7 +276,6 @@ If any of those parameters is not specified and request is made to Prompt Lookup
275
276
- functions
276
277
277
278
#### Unsupported params from vLLM:
278
-
- min_p
279
279
- use_beam_search (**In OpenVINO Model Server just simply increase _best_of_ param to enable beam search**)
| temperature | ✅ | ✅ | ✅ | float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
78
78
| top_p | ✅ | ✅ | ✅ | float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
79
-
| top_k | ✅ | ❌ | ✅ | int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
79
+
| min_p | ✅ | ❌ | ✅ | float (default: `0.0`) | Minimum probability threshold relative to the most likely token. Tokens with probability below `min_p` × the top token probability are filtered out. `0.0` (default) disables the filter. Typical values: `0.05`–`0.1`. Must be in `[0.0, 1.0)`. |
80
+
| top_k | ✅ | ❌ | ✅ | int (default: `40`) | Controls the number of top tokens to consider. When multinomial sampling is active, defaults to `40` if not set. Set to `-1` to consider all tokens. |
80
81
| repetition_penalty | ✅ | ❌ | ✅ | float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
81
82
| frequency_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
82
83
| presence_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
83
-
| seed | ✅ | ✅ | ✅ | integer (default: `0`) | Random seed to use for the generation. |
84
+
| seed | ✅ | ✅ | ✅ | integer (default: random) | Random seed for generation in range `[0, 4294967295]`. Omit to use a random seed (non-deterministic). Set explicitly to get reproducible output. Note: `rng_seed` set in `generation_config.json` is not honoured for multinomial sampling — only a per-request seed is applied. |
84
85
85
86
#### Speculative decoding specific
86
87
@@ -106,7 +107,6 @@ Note that below parameters are valid only for prompt lookup pipeline. Add `"prom
106
107
107
108
108
109
#### Unsupported params from vLLM:
109
-
- min_p
110
110
- use_beam_search (**In OpenVINO Model Server just simply increase _best_of_ param to enable beam search**)
| temperature | ✅ | ✅ | float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
122
122
| top_p | ✅ | ✅ | float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
123
-
| top_k | ✅ | ❌ | int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
123
+
| min_p | ✅ | ❌ | float (default: `0.0`) | Minimum probability threshold relative to the most likely token. Tokens with probability below `min_p` × the top token probability are filtered out. `0.0` (default) disables the filter. Typical values: `0.05`–`0.1`. Must be in `[0.0, 1.0)`. |
124
+
| top_k | ✅ | ❌ | int (default: `40`) | Controls the number of top tokens to consider. When multinomial sampling is active, defaults to `40` if not set. Set to `-1` to consider all tokens. |
124
125
| repetition_penalty | ✅ | ❌ | float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
125
126
| frequency_penalty | ✅ | ❌ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
126
127
| presence_penalty | ✅ | ❌ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
127
-
| seed | ✅ | ❌ | integer (default: `0`) | Random seed to use for the generation. |
128
+
| seed | ✅ | ❌ | integer (default: random) | Random seed for generation in range `[0, 4294967295]`. Omit to use a random seed (non-deterministic). Set explicitly to get reproducible output. Note: `rng_seed` set in `generation_config.json` is not honoured for multinomial sampling — only a per-request seed is applied. |
0 commit comments