Skip to content

Commit 865379f

Browse files
authored
Multinomial defaults change + min_p support (#4194)
Introducing new default behavior for multinomial sampling. When temperature > 0: - default top_k is set to 40 (used to be inactive - all tokens considered) - default randomization (random seed for every request if not specified) - new default behavior is non-deterministic, used to be deterministic (same, hardcoded seed with every request) Additionally adding support to `min_p` sampling option.
1 parent 8cc70f4 commit 865379f

8 files changed

Lines changed: 260 additions & 27 deletions

docs/model_server_rest_api_chat.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -235,11 +235,12 @@ Some parameters, especially related to sampling (like `temperature`, `top_p` etc
235235
|-------|----------|----------|----------|---------|-----|
236236
| temperature |||| float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
237237
| top_p |||| float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
238-
| top_k |||| int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
238+
| min_p |||| float (default: `0.0`) | Minimum probability threshold relative to the most likely token. Tokens with probability below `min_p` × the top token probability are filtered out. `0.0` (default) disables the filter. Typical values: `0.05``0.1`. Must be in `[0.0, 1.0)`. |
239+
| top_k |||| int (default: `40`) | Controls the number of top tokens to consider. When multinomial sampling is active, defaults to `40` if not set. Set to `-1` to consider all tokens. |
239240
| repetition_penalty |||| float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
240241
| frequency_penalty |||| float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
241242
| presence_penalty |||| float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
242-
| seed |||| integer (default: `0`) | Random seed to use for the generation. |
243+
| seed |||| integer (default: random) | Random seed for generation in range `[0, 4294967295]`. Omit to use a random seed (non-deterministic). Set explicitly to get reproducible output. Note: `rng_seed` set in `generation_config.json` is not honoured for multinomial sampling — only a per-request seed is applied. |
243244

244245
#### Speculative decoding specific
245246

@@ -275,7 +276,6 @@ If any of those parameters is not specified and request is made to Prompt Lookup
275276
- functions
276277

277278
#### Unsupported params from vLLM:
278-
- min_p
279279
- use_beam_search (**In OpenVINO Model Server just simply increase _best_of_ param to enable beam search**)
280280
- early_stopping
281281
- stop_token_ids

docs/model_server_rest_api_completions.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -76,11 +76,12 @@ curl http://localhost/v3/completions \
7676
|-------|----------|----------|----------|---------|-----|
7777
| temperature |||| float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
7878
| top_p |||| float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
79-
| top_k |||| int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
79+
| min_p |||| float (default: `0.0`) | Minimum probability threshold relative to the most likely token. Tokens with probability below `min_p` × the top token probability are filtered out. `0.0` (default) disables the filter. Typical values: `0.05``0.1`. Must be in `[0.0, 1.0)`. |
80+
| top_k |||| int (default: `40`) | Controls the number of top tokens to consider. When multinomial sampling is active, defaults to `40` if not set. Set to `-1` to consider all tokens. |
8081
| repetition_penalty |||| float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
8182
| frequency_penalty |||| float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
8283
| presence_penalty |||| float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
83-
| seed |||| integer (default: `0`) | Random seed to use for the generation. |
84+
| seed |||| integer (default: random) | Random seed for generation in range `[0, 4294967295]`. Omit to use a random seed (non-deterministic). Set explicitly to get reproducible output. Note: `rng_seed` set in `generation_config.json` is not honoured for multinomial sampling — only a per-request seed is applied. |
8485

8586
#### Speculative decoding specific
8687

@@ -106,7 +107,6 @@ Note that below parameters are valid only for prompt lookup pipeline. Add `"prom
106107

107108

108109
#### Unsupported params from vLLM:
109-
- min_p
110110
- use_beam_search (**In OpenVINO Model Server just simply increase _best_of_ param to enable beam search**)
111111
- early_stopping
112112
- stop_token_ids

docs/model_server_rest_api_responses.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -120,11 +120,12 @@ curl http://localhost/v3/responses \
120120
|-------|----------|----------|---------|-----|
121121
| temperature ||| float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
122122
| top_p ||| float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
123-
| top_k ||| int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
123+
| min_p ||| float (default: `0.0`) | Minimum probability threshold relative to the most likely token. Tokens with probability below `min_p` × the top token probability are filtered out. `0.0` (default) disables the filter. Typical values: `0.05``0.1`. Must be in `[0.0, 1.0)`. |
124+
| top_k ||| int (default: `40`) | Controls the number of top tokens to consider. When multinomial sampling is active, defaults to `40` if not set. Set to `-1` to consider all tokens. |
124125
| repetition_penalty ||| float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
125126
| frequency_penalty ||| float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
126127
| presence_penalty ||| float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
127-
| seed ||| integer (default: `0`) | Random seed to use for the generation. |
128+
| seed ||| integer (default: random) | Random seed for generation in range `[0, 4294967295]`. Omit to use a random seed (non-deterministic). Set explicitly to get reproducible output. Note: `rng_seed` set in `generation_config.json` is not honoured for multinomial sampling — only a per-request seed is applied. |
128129

129130
#### Speculative decoding specific
130131

src/llm/apis/openai_api_handler.cpp

Lines changed: 34 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -740,21 +740,48 @@ absl::Status OpenAIApiHandler::parseCommonPart(std::optional<uint32_t> maxTokens
740740
return absl::InvalidArgumentError("top_p out of range(0.0, 1.0)");
741741
}
742742

743-
// top_k: int; optional - defaults to 0
744-
// Extension, unsupported by OpenAI API, however supported by vLLM and CB lib
743+
// min_p: float; optional - defaults to 0 (disabled)
744+
// Extension, unsupported by OpenAI API, however supported by vLLM and GenAI
745+
it = doc.FindMember("min_p");
746+
if (it != doc.MemberEnd() && !it->value.IsNull()) {
747+
if (!it->value.IsDouble() && !it->value.IsInt())
748+
return absl::InvalidArgumentError("min_p is not a valid number");
749+
const float minPValue = static_cast<float>(it->value.GetDouble());
750+
if (minPValue < 0.0f || minPValue >= 1.0f)
751+
return absl::InvalidArgumentError("min_p out of range [0.0, 1.0)");
752+
request.minP = minPValue;
753+
}
754+
755+
// top_k: int; optional - when multinomial sampling is active, defaults to 40 if not set. Pass -1 to consider all tokens.
756+
// Extension, unsupported by OpenAI API, however supported by vLLM and GenAI
745757
it = doc.FindMember("top_k");
746758
if (it != doc.MemberEnd() && !it->value.IsNull()) {
747759
if (!it->value.IsInt())
748760
return absl::InvalidArgumentError("top_k is not an integer");
749-
request.topK = it->value.GetInt();
761+
const int topKValue = it->value.GetInt();
762+
if (topKValue < -1 || topKValue == 0)
763+
return absl::InvalidArgumentError("top_k must be -1 (all tokens) or a positive integer");
764+
request.topK = topKValue;
750765
}
751766

752-
// seed: int; optional - defaults to 0 (not set)
767+
// seed: uint32; optional - omit to use a random seed
753768
it = doc.FindMember("seed");
754769
if (it != doc.MemberEnd() && !it->value.IsNull()) {
755-
if (!it->value.IsUint())
756-
return absl::InvalidArgumentError("seed is not an unsigned integer");
757-
request.seed = it->value.GetUint();
770+
if (!it->value.IsInt() && !it->value.IsUint() && !it->value.IsInt64() && !it->value.IsUint64())
771+
return absl::InvalidArgumentError("seed is not an integer");
772+
if (it->value.IsUint64()) {
773+
const uint64_t raw = it->value.GetUint64();
774+
if (raw > std::numeric_limits<uint32_t>::max())
775+
return absl::InvalidArgumentError("seed out of range [0, 4294967295]");
776+
request.seed = static_cast<uint32_t>(raw);
777+
} else if (it->value.IsUint()) {
778+
request.seed = it->value.GetUint();
779+
} else {
780+
const int64_t raw = it->value.GetInt64();
781+
if (raw < 0 || raw > static_cast<int64_t>(std::numeric_limits<uint32_t>::max()))
782+
return absl::InvalidArgumentError("seed out of range [0, 4294967295]");
783+
request.seed = static_cast<uint32_t>(raw);
784+
}
758785
}
759786

760787
// stop: string or array; optional - defaults to null (not set)

src/llm/apis/openai_request.hpp

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
// Type that holds vector of pairs where first element is chat turn index and second is image tensor
1818
// this way we store information about which image is associated with which chat turn
1919
#pragma once
20+
#include <cstdint>
2021
#include <map>
2122
#include <optional>
2223
#include <string>
@@ -57,8 +58,9 @@ struct OpenAIRequest {
5758
// Multinomial decoding specific
5859
std::optional<float> temperature{std::nullopt};
5960
std::optional<float> topP{std::nullopt};
61+
std::optional<float> minP{std::nullopt};
6062
std::optional<int> topK{std::nullopt};
61-
std::optional<int> seed{std::nullopt};
63+
std::optional<uint32_t> seed{std::nullopt};
6264
std::optional<float> frequencyPenalty{std::nullopt};
6365
std::optional<float> presencePenalty{std::nullopt};
6466
std::optional<float> repetitionPenalty{std::nullopt};

src/llm/apis/openai_responses.cpp

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -406,6 +406,10 @@ void OpenAIResponsesHandler::serializeCommonResponseParameters(Writer<StringBuff
406406
writer.String("top_p");
407407
writer.Double(static_cast<double>(request.topP.value()));
408408
}
409+
if (request.minP.has_value()) {
410+
writer.String("min_p");
411+
writer.Double(static_cast<double>(request.minP.value()));
412+
}
409413
writer.String("truncation");
410414
writer.String("disabled");
411415
// TODO: user not supported

src/llm/io_processing/base_generation_config_builder.cpp

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616

1717
#include "../../logging.hpp"
1818
#include <limits>
19+
#include <random>
1920
#include <string>
2021
#include <openvino/genai/generation_config.hpp>
2122
#include "base_generation_config_builder.hpp"
@@ -118,9 +119,11 @@ void BaseGenerationConfigBuilder::parseConfigFromRequest(const OpenAIRequest& re
118119
if (request.temperature.has_value())
119120
config.temperature = request.temperature.value();
120121
if (request.topK.has_value())
121-
config.top_k = request.topK.value();
122+
config.top_k = (request.topK.value() == -1) ? std::numeric_limits<size_t>::max() : static_cast<size_t>(request.topK.value());
122123
if (request.topP.has_value())
123124
config.top_p = request.topP.value();
125+
if (request.minP.has_value())
126+
config.min_p = request.minP.value();
124127
if (request.seed.has_value())
125128
config.rng_seed = request.seed.value();
126129
if (request.stop.has_value())
@@ -133,6 +136,26 @@ void BaseGenerationConfigBuilder::parseConfigFromRequest(const OpenAIRequest& re
133136
config.presence_penalty = request.presencePenalty.value();
134137
config.do_sample = config.temperature > 0.0f && config.num_beams == 1;
135138

139+
// Apply multinomial sampling defaults when not explicitly set
140+
if (config.do_sample) {
141+
if (!request.topK.has_value() && config.top_k == std::numeric_limits<size_t>::max()) {
142+
config.top_k = 40;
143+
SPDLOG_LOGGER_DEBUG(llm_calculator_logger, "Defaulting top_k to 40 for multinomial sampling.");
144+
}
145+
// Use random seed for multinomial sampling to ensure non-deterministic behavior by default.
146+
// Note: rng_seed from generation_config.json is not honoured — only an explicit per-request
147+
// seed produces deterministic output.
148+
// Use a thread_local mt19937 seeded once via std::random_device to avoid per-request overhead.
149+
if (!request.seed.has_value()) {
150+
static thread_local std::mt19937 rng{std::random_device{}()};
151+
size_t seed = 0;
152+
while (seed == 0)
153+
seed = rng();
154+
config.rng_seed = seed;
155+
SPDLOG_LOGGER_DEBUG(llm_calculator_logger, "Randomizing rng_seed for multinomial sampling: {}.", config.rng_seed);
156+
}
157+
}
158+
136159
if (request.logprobschat || request.logprobs)
137160
config.logprobs = 1;
138161
// Assisted decoding specific

0 commit comments

Comments
 (0)