Multinomial defaults change + min_p support (#4194)

mzegla · web-flow · commit 865379fc5a63 · 2026-05-11T14:43:43.000+02:00
Introducing new default behavior for multinomial sampling. When
temperature &gt; 0:
- default top_k is set to 40 (used to be inactive - all tokens
considered)
- default randomization (random seed for every request if not specified)
- new default behavior is non-deterministic, used to be deterministic
(same, hardcoded seed with every request)

Additionally adding support to `min_p` sampling option.
diff --git a/docs/model_server_rest_api_chat.md b/docs/model_server_rest_api_chat.md
@@ -235,11 +235,12 @@ Some parameters, especially related to sampling (like `temperature`, `top_p` etc
 |-------|----------|----------|----------|---------|-----|
 | temperature | ✅ | ✅ | ✅ | float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
 | top_p | ✅ | ✅ | ✅ | float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
-| top_k | ✅ | ❌ | ✅ | int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
+| min_p | ✅ | ❌ | ✅ | float (default: `0.0`) | Minimum probability threshold relative to the most likely token. Tokens with probability below `min_p` × the top token probability are filtered out. `0.0` (default) disables the filter. Typical values: `0.05`–`0.1`. Must be in `[0.0, 1.0)`. |
+| top_k | ✅ | ❌ | ✅ | int (default: `40`) | Controls the number of top tokens to consider. When multinomial sampling is active, defaults to `40` if not set. Set to `-1` to consider all tokens. |
 | repetition_penalty | ✅ | ❌ | ✅ | float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
 | frequency_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
 | presence_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
-| seed | ✅ | ✅ | ✅ | integer (default: `0`) | Random seed to use for the generation. |
+| seed | ✅ | ✅ | ✅ | integer (default: random) | Random seed for generation in range `[0, 4294967295]`. Omit to use a random seed (non-deterministic). Set explicitly to get reproducible output. Note: `rng_seed` set in `generation_config.json` is not honoured for multinomial sampling — only a per-request seed is applied. |
 
 #### Speculative decoding specific
 
@@ -275,7 +276,6 @@ If any of those parameters is not specified and request is made to Prompt Lookup
 - functions
 
 #### Unsupported params from vLLM:
-- min_p
 - use_beam_search (**In OpenVINO Model Server just simply increase _best_of_ param to enable beam search**)
 - early_stopping
 - stop_token_ids
diff --git a/docs/model_server_rest_api_completions.md b/docs/model_server_rest_api_completions.md
@@ -76,11 +76,12 @@ curl http://localhost/v3/completions \
 |-------|----------|----------|----------|---------|-----|
 | temperature | ✅ | ✅ | ✅ | float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
 | top_p | ✅ | ✅ | ✅ | float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
-| top_k | ✅ | ❌ | ✅ | int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
+| min_p | ✅ | ❌ | ✅ | float (default: `0.0`) | Minimum probability threshold relative to the most likely token. Tokens with probability below `min_p` × the top token probability are filtered out. `0.0` (default) disables the filter. Typical values: `0.05`–`0.1`. Must be in `[0.0, 1.0)`. |
+| top_k | ✅ | ❌ | ✅ | int (default: `40`) | Controls the number of top tokens to consider. When multinomial sampling is active, defaults to `40` if not set. Set to `-1` to consider all tokens. |
 | repetition_penalty | ✅ | ❌ | ✅ | float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
 | frequency_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
 | presence_penalty | ✅ | ✅ | ✅ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
-| seed | ✅ | ✅ | ✅ | integer (default: `0`) | Random seed to use for the generation. |
+| seed | ✅ | ✅ | ✅ | integer (default: random) | Random seed for generation in range `[0, 4294967295]`. Omit to use a random seed (non-deterministic). Set explicitly to get reproducible output. Note: `rng_seed` set in `generation_config.json` is not honoured for multinomial sampling — only a per-request seed is applied. |
 
 #### Speculative decoding specific
 
@@ -106,7 +107,6 @@ Note that below parameters are valid only for prompt lookup pipeline. Add `"prom
 
 
 #### Unsupported params from vLLM:
-- min_p
 - use_beam_search (**In OpenVINO Model Server just simply increase _best_of_ param to enable beam search**)
 - early_stopping
 - stop_token_ids
diff --git a/docs/model_server_rest_api_responses.md b/docs/model_server_rest_api_responses.md
@@ -120,11 +120,12 @@ curl http://localhost/v3/responses \
 |-------|----------|----------|---------|-----|
 | temperature | ✅ | ✅ | float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
 | top_p | ✅ | ✅ | float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
-| top_k | ✅ | ❌ | int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
+| min_p | ✅ | ❌ | float (default: `0.0`) | Minimum probability threshold relative to the most likely token. Tokens with probability below `min_p` × the top token probability are filtered out. `0.0` (default) disables the filter. Typical values: `0.05`–`0.1`. Must be in `[0.0, 1.0)`. |
+| top_k | ✅ | ❌ | int (default: `40`) | Controls the number of top tokens to consider. When multinomial sampling is active, defaults to `40` if not set. Set to `-1` to consider all tokens. |
 | repetition_penalty | ✅ | ❌ | float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
 | frequency_penalty | ✅ | ❌ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
 | presence_penalty | ✅ | ❌ | float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
-| seed | ✅ | ❌ | integer (default: `0`) | Random seed to use for the generation. |
+| seed | ✅ | ❌ | integer (default: random) | Random seed for generation in range `[0, 4294967295]`. Omit to use a random seed (non-deterministic). Set explicitly to get reproducible output. Note: `rng_seed` set in `generation_config.json` is not honoured for multinomial sampling — only a per-request seed is applied. |
 
 #### Speculative decoding specific
 
diff --git a/src/llm/apis/openai_api_handler.cpp b/src/llm/apis/openai_api_handler.cpp
@@ -740,21 +740,48 @@ absl::Status OpenAIApiHandler::parseCommonPart(std::optional<uint32_t> maxTokens
             return absl::InvalidArgumentError("top_p out of range(0.0, 1.0)");
     }
 
-    // top_k: int; optional - defaults to 0
-    // Extension, unsupported by OpenAI API, however supported by vLLM and CB lib
+    // min_p: float; optional - defaults to 0 (disabled)
+    // Extension, unsupported by OpenAI API, however supported by vLLM and GenAI
+    it = doc.FindMember("min_p");
+    if (it != doc.MemberEnd() && !it->value.IsNull()) {
+        if (!it->value.IsDouble() && !it->value.IsInt())
+            return absl::InvalidArgumentError("min_p is not a valid number");
+        const float minPValue = static_cast<float>(it->value.GetDouble());
+        if (minPValue < 0.0f || minPValue >= 1.0f)
+            return absl::InvalidArgumentError("min_p out of range [0.0, 1.0)");
+        request.minP = minPValue;
+    }
+
+    // top_k: int; optional - when multinomial sampling is active, defaults to 40 if not set. Pass -1 to consider all tokens.
+    // Extension, unsupported by OpenAI API, however supported by vLLM and GenAI
     it = doc.FindMember("top_k");
     if (it != doc.MemberEnd() && !it->value.IsNull()) {
         if (!it->value.IsInt())
             return absl::InvalidArgumentError("top_k is not an integer");
-        request.topK = it->value.GetInt();
+        const int topKValue = it->value.GetInt();
+        if (topKValue < -1 || topKValue == 0)
+            return absl::InvalidArgumentError("top_k must be -1 (all tokens) or a positive integer");
+        request.topK = topKValue;
     }
 
-    // seed: int; optional - defaults to 0 (not set)
+    // seed: uint32; optional - omit to use a random seed
     it = doc.FindMember("seed");
     if (it != doc.MemberEnd() && !it->value.IsNull()) {
-        if (!it->value.IsUint())
-            return absl::InvalidArgumentError("seed is not an unsigned integer");
-        request.seed = it->value.GetUint();
+        if (!it->value.IsInt() && !it->value.IsUint() && !it->value.IsInt64() && !it->value.IsUint64())
+            return absl::InvalidArgumentError("seed is not an integer");
+        if (it->value.IsUint64()) {
+            const uint64_t raw = it->value.GetUint64();
+            if (raw > std::numeric_limits<uint32_t>::max())
+                return absl::InvalidArgumentError("seed out of range [0, 4294967295]");
+            request.seed = static_cast<uint32_t>(raw);
+        } else if (it->value.IsUint()) {
+            request.seed = it->value.GetUint();
+        } else {
+            const int64_t raw = it->value.GetInt64();
+            if (raw < 0 || raw > static_cast<int64_t>(std::numeric_limits<uint32_t>::max()))
+                return absl::InvalidArgumentError("seed out of range [0, 4294967295]");
+            request.seed = static_cast<uint32_t>(raw);
+        }
     }
 
     // stop: string or array; optional - defaults to null (not set)
diff --git a/src/llm/apis/openai_request.hpp b/src/llm/apis/openai_request.hpp
@@ -17,6 +17,7 @@
 // Type that holds vector of pairs where first element is chat turn index and second is image tensor
 // this way we store information about which image is associated with which chat turn
 #pragma once
+#include <cstdint>
 #include <map>
 #include <optional>
 #include <string>
@@ -57,8 +58,9 @@ struct OpenAIRequest {
     // Multinomial decoding specific
     std::optional<float> temperature{std::nullopt};
     std::optional<float> topP{std::nullopt};
+    std::optional<float> minP{std::nullopt};
     std::optional<int> topK{std::nullopt};
-    std::optional<int> seed{std::nullopt};
+    std::optional<uint32_t> seed{std::nullopt};
     std::optional<float> frequencyPenalty{std::nullopt};
     std::optional<float> presencePenalty{std::nullopt};
     std::optional<float> repetitionPenalty{std::nullopt};
diff --git a/src/llm/apis/openai_responses.cpp b/src/llm/apis/openai_responses.cpp
@@ -406,6 +406,10 @@ void OpenAIResponsesHandler::serializeCommonResponseParameters(Writer<StringBuff
         writer.String("top_p");
         writer.Double(static_cast<double>(request.topP.value()));
     }
+    if (request.minP.has_value()) {
+        writer.String("min_p");
+        writer.Double(static_cast<double>(request.minP.value()));
+    }
     writer.String("truncation");
     writer.String("disabled");
     // TODO: user not supported
diff --git a/src/llm/io_processing/base_generation_config_builder.cpp b/src/llm/io_processing/base_generation_config_builder.cpp
@@ -16,6 +16,7 @@
 
 #include "../../logging.hpp"
 #include <limits>
+#include <random>
 #include <string>
 #include <openvino/genai/generation_config.hpp>
 #include "base_generation_config_builder.hpp"
@@ -118,9 +119,11 @@ void BaseGenerationConfigBuilder::parseConfigFromRequest(const OpenAIRequest& re
     if (request.temperature.has_value())
         config.temperature = request.temperature.value();
     if (request.topK.has_value())
-        config.top_k = request.topK.value();
+        config.top_k = (request.topK.value() == -1) ? std::numeric_limits<size_t>::max() : static_cast<size_t>(request.topK.value());
     if (request.topP.has_value())
         config.top_p = request.topP.value();
+    if (request.minP.has_value())
+        config.min_p = request.minP.value();
     if (request.seed.has_value())
         config.rng_seed = request.seed.value();
     if (request.stop.has_value())
@@ -133,6 +136,26 @@ void BaseGenerationConfigBuilder::parseConfigFromRequest(const OpenAIRequest& re
         config.presence_penalty = request.presencePenalty.value();
     config.do_sample = config.temperature > 0.0f && config.num_beams == 1;
 
+    // Apply multinomial sampling defaults when not explicitly set
+    if (config.do_sample) {
+        if (!request.topK.has_value() && config.top_k == std::numeric_limits<size_t>::max()) {
+            config.top_k = 40;
+            SPDLOG_LOGGER_DEBUG(llm_calculator_logger, "Defaulting top_k to 40 for multinomial sampling.");
+        }
+        // Use random seed for multinomial sampling to ensure non-deterministic behavior by default.
+        // Note: rng_seed from generation_config.json is not honoured — only an explicit per-request
+        // seed produces deterministic output.
+        // Use a thread_local mt19937 seeded once via std::random_device to avoid per-request overhead.
+        if (!request.seed.has_value()) {
+            static thread_local std::mt19937 rng{std::random_device{}()};
+            size_t seed = 0;
+            while (seed == 0)
+                seed = rng();
+            config.rng_seed = seed;
+            SPDLOG_LOGGER_DEBUG(llm_calculator_logger, "Randomizing rng_seed for multinomial sampling: {}.", config.rng_seed);
+        }
+    }
+
     if (request.logprobschat || request.logprobs)
         config.logprobs = 1;
     // Assisted decoding specific
diff --git a/src/test/llm/llmnode_test.cpp b/src/test/llm/llmnode_test.cpp