PaddlePaddle · ckl117 · May 21, 2026 · May 21, 2026 · May 21, 2026 · May 21, 2026
diff --git a/docs/features/sampling.md b/docs/features/sampling.md
@@ -19,6 +19,10 @@ Sampling strategies are used to determine how to select the next token from the
    * Min-p sampling calculates `pivot=max_prob * min_p`, then retains only tokens with probabilities greater than the `pivot` (setting others to zero) for subsequent sampling.
    * It filters out tokens with relatively low probabilities, sampling only from high-probability tokens to improve generation quality.
 
+4. Sampling Threshold
+
+   * Only tokens with probability greater than or equal to `sampling_threshold` are retained (others are set to zero) for subsequent sampling.
+   * Filters out tokens whose absolute probability falls below the threshold, sampling only from sufficiently high-probability tokens to improve generation quality.
 ## Usage Instructions
 
 During deployment, you can choose the sampling algorithm by setting the environment variable `FD_SAMPLING_CLASS`. Available values are `base`, `base_non_truncated`, `air`, or `rejection`.
@@ -211,6 +215,49 @@ for chunk in response:
 print('\n')
 ```
 
+### Setting Sampling Threshold
+
+If you want to apply a sampling threshold after min-p filtering but before top-p or top-k_top-p sampling, specify the following parameters when sending a request:
+
+* Example request with curl:
+
+```bash
+curl -X POST "http://0.0.0.0:9222/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": "How old are you"}
+  ],
+  "min_p": 0.1,
+  "sampling_threshold": 0.00002,
+  "top_p": 0.8,
+  "top_k": 20
+}'
+```
+
+* Example request with Python:
+
+```python
+import openai
+host = "0.0.0.0"
+port = "8170"
+client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
+
+response = client.chat.completions.create(
+    model="null",
+    messages=[
+        {"role": "system", "content": "I'm a helpful AI assistant."},
+    ],
+    stream=True,
+    top_p=0.8,
+    extra_body={"top_k": 20, "min_p": 0.1, "sampling_threshold": 0.00002}
+)
+for chunk in response:
+    if chunk.choices[0].delta:
+        print(chunk.choices[0].delta.content, end='')
+print('\n')
+```
+
 With the above configurations, you can flexibly choose and use the appropriate sampling strategy according to the needs of specific generation tasks.
 
 ## Parameter Description
@@ -221,6 +268,8 @@ With the above configurations, you can flexibly choose and use the appropriate s
 
 `min_p`: Low probability filtering threshold, considering only the token set with probability greater than or equal to (`max_prob*min_p`). It is a float type, with a range of [0.0, 1.0].
 
+`sampling_threshold`: Absolute probability filtering threshold, only retaining tokens with probability greater than or equal to `sampling_threshold`. It is a float type, with a range of [0.0, 1.0).
+
 # Bad Words
 
 Used to prevent the model from generating certain specific words during the inference process. Commonly applied in safety control, content filtering, and behavioral constraints of the model.

diff --git a/docs/offline_inference.md b/docs/offline_inference.md
@@ -185,6 +185,7 @@ For ``LLM`` configuration, refer to [Parameter Documentation](parameters.md).
 * top_p(float): Probability threshold for token selection
 * top_k(int): Number of tokens considered for sampling
 * min_p(float): Minimum probability relative to the maximum probability for a token to be considered (>0 filters low-probability tokens to improve quality)
+* sampling_threshold(float): Minimum absolute probability threshold for a token to be considered (>0 filters low-probability tokens by absolute value to improve generation quality). Range [0.0, 1.0).
 * max_tokens(int): Maximum generated tokens (input + output)
 * min_tokens(int): Minimum forced generation length
 * bad_words(list[str]): Prohibited words

diff --git a/docs/online_serving/README.md b/docs/online_serving/README.md
@@ -157,6 +157,9 @@ top_k: Optional[int] = None
 min_p: Optional[float] = None
 # Nucleus sampling threshold, only retaining tokens whose cumulative probability exceeds min_p (default None means disabled).
 
+sampling_threshold: Optional[float] = None
+# Absolute probability filtering threshold, only retaining tokens with probability greater than or equal to sampling_threshold (default None means disabled). Range [0.0, 1.0).
+
 min_tokens: Optional[int] = None
 # Forces a minimum number of tokens to be generated, avoiding premature truncation (default None means no limit).
 
@@ -434,6 +437,9 @@ top_k: Optional[int] = None
 min_p: Optional[float] = None
 # Nucleus sampling threshold, only retaining tokens whose cumulative probability exceeds min_p (default None means disabled).
 
+sampling_threshold: Optional[float] = None
+# Absolute probability filtering threshold, only retaining tokens with probability greater than or equal to sampling_threshold (default None means disabled). Range [0.0, 1.0).
+
 min_tokens: Optional[int] = None
 # Forces a minimum number of tokens to be generated, avoiding premature truncation (default None means no limit).
 

diff --git a/docs/zh/features/sampling.md b/docs/zh/features/sampling.md
@@ -17,6 +17,11 @@
    * Min-p 采样首先计算 pivot=max_prob * min_p，然后只保留概率大于pivot的token(其余设置为0)进行后续的采样。
    * 用于过滤掉相对概率过低的token，只从高概率token中采样，提高生成质量。
 
+4. 采样阈值
+
+   * 只对大于或等于`sampling_threshold`的token(其余设置为0)进行采样。
+   * 用于过滤掉绝对概率过低的token，只从高概率token中采样，提高生成质量。
+
 ## 使用说明
 
 在部署时，可以通过设置环境变量 `FD_SAMPLING_CLASS` 来选择采样算法。可选择的值有 `base`, `base_non_truncated`, `air`或 `rejection`。
@@ -214,13 +219,58 @@ for chunk in response:
 print('\n')
 ```
 
+### 设置采样阈值
+
+如果你希望在min_p之后， top_p 或 top_k_top_p之前设置采样阈值，在发送请求时指定以下参数：
+
+* 使用 curl 命令发送用户请求示例如下：
+
+```bash
+curl -X POST "http://0.0.0.0:9222/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": "How old are you"}
+  ],
+  "min_p": 0.1,
+  "sampling_threshold": 0.00002,
+  "top_p": 0.8,
+  "top_k": 20
+}'
+```
+
+* 使用 python 脚本发送用户请求示例如下：
+
+```python
+import openai
+host = "0.0.0.0"
+port = "8170"
+client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
+
+response = client.chat.completions.create(
+    model="null",
+    messages=[
+        {"role": "system", "content": "I'm a helpful AI assistant."},
+        {"role": "user", "content": "把李白的静夜思改写为现代诗"},
+    ],
+    stream=True,
+    top_p=0.8,
+    extra_body={"top_k": 20, "min_p": 0.1, "sampling_threshold": 0.00002}
+)
+for chunk in response:
+    if chunk.choices[0].delta:
+        print(chunk.choices[0].delta.content, end='')
+print('\n')
+```
+
 通过上述配置，你可以根据具体的生成任务需求，灵活选择和使用合适的采样策略。
 
 ## 参数说明
 
 * `top_p`: 概率累积分布截断阈值，仅考虑累计概率达到此阈值的最可能token集合。float类型，取值范围为[0.0,1.0]。当top_p=1.0时，考虑所有token；当top_p=0.0时，退化为greedy search。
 * `top_k`: 采样概率最高的token数量，考虑概率最高的k个token进行采样范围限制。int类型，取值范围为[0,vocab_size]
 * `min_p`：低概率过滤阈值，仅考虑概率大于等于(max_prob*min_p)的token集合。float类型，取值范围为[0.0,1.0]
+* `sampling_threshold`：低概率过滤阈值，仅考虑概率大于等于sampling_threshold的token集合。float类型，取值范围为[0.0,1.0)
 
 # Bad Words
 

diff --git a/docs/zh/offline_inference.md b/docs/zh/offline_inference.md
@@ -185,6 +185,7 @@ for output in outputs:
 * top_p(float): 概率累积分布截断阈值，仅考虑累计概率达到此阈值的最可能token集合
 * top_k(int): 采样概率最高的token数量，考虑概率最高的k个token进行采样
 * min_p(float): token入选的最小概率阈值(相对于最高概率token的比值，设为>0可通过过滤低概率token来提升文本生成质量)
+* sampling_threshold(float): token入选的最小概率阈值(绝对概率值,设为>0可通过过滤低概率token来提升文本生成质量)
 * max_tokens(int): 限制模型生成的最大token数量（包括输入和输出）
 * min_tokens(int): 强制模型生成的最少token数量，避免过早结束
 * bad_words(list[str]): 禁止生成的词列表, 防止模型生成不希望出现的词

diff --git a/docs/zh/online_serving/README.md b/docs/zh/online_serving/README.md
@@ -155,6 +155,9 @@ top_k: Optional[int] = None
 min_p: Optional[float] = None
 # 核采样（nucleus sampling）阈值，只保留概率累计超过 min_p 的 token（默认 None 表示禁用）。
 
+sampling_threshold: Optional[float] = None
+# 采样阈值，只保留概率值大于或等于 sampling_threshold 的 token（默认 None 表示禁用）。取值范围[0.0, 1.0)。
+
 min_tokens: Optional[int] = None
 # 强制生成的最小 token 数，避免过早截断（默认 None 表示不限制）。
 
@@ -425,6 +428,9 @@ top_k: Optional[int] = None
 min_p: Optional[float] = None
 # 核采样（nucleus sampling）阈值，只保留概率累计超过 min_p 的 token（默认 None 表示禁用）。
 
+sampling_threshold: Optional[float] = None
+# 采样阈值，只保留概率值大于或等于 sampling_threshold 的 token（默认 None 表示禁用）。取值范围[0.0, 1.0)。
+
 min_tokens: Optional[int] = None
 # 强制生成的最小 token 数，避免过早截断（默认 None 表示不限制）。
 

diff --git a/fastdeploy/engine/sampling_params.py b/fastdeploy/engine/sampling_params.py
@@ -94,6 +94,7 @@ class SamplingParams:
     top_p: float = None
     top_k: int = 0
     min_p: float = 0.0
+    sampling_threshold: float = 0.0
     seed: Optional[int] = None
     stop: Optional[Union[str, List[str]]] = None
     stop_token_ids: Optional[List[int]] = None
@@ -160,6 +161,11 @@ def from_generic_request(cls, req: T) -> SamplingParams:
             top_p=getattr(req, "top_p", None) if getattr(req, "top_p", None) is not None else cls.top_p,
             top_k=getattr(req, "top_k", None) if getattr(req, "top_k", None) is not None else cls.top_k,
             min_p=getattr(req, "min_p", None) if getattr(req, "min_p", None) is not None else cls.min_p,
+            sampling_threshold=(
+                getattr(req, "sampling_threshold", None)
+                if getattr(req, "sampling_threshold", None) is not None
+                else cls.sampling_threshold
+            ),
             seed=getattr(req, "seed", None) if getattr(req, "seed", None) is not None else cls.seed,
             stop=getattr(req, "stop", None) if getattr(req, "stop", None) is not None else cls.stop,
             stop_token_ids=(
@@ -234,6 +240,7 @@ def from_optional(
         top_p,
         top_k,
         min_p,
+        sampling_threshold=None,
         seed=None,
         stop=None,
         stop_token_ids=None,
@@ -259,6 +266,7 @@ def from_optional(
             top_p=top_p,
             top_k=top_k if top_k is not None else 0,
             min_p=min_p if min_p is not None else 0.0,
+            sampling_threshold=sampling_threshold if sampling_threshold is not None else 0.0,
             seed=seed,
             stop=stop,
             stop_token_ids=stop_token_ids,
@@ -305,6 +313,10 @@ def _verify_args(self) -> None:
             raise TypeError(f"top_k must be an integer, got {type(self.top_k).__name__}")
         if not 0.0 <= self.min_p <= 1.0:
             raise ValueError("min_p must be in [0,1],got f{self.min_p}")
+        if not isinstance(self.sampling_threshold, float):
+            raise TypeError(f"sampling_threshold must be a float, got {type(self.sampling_threshold).__name__}")
+        if not 0.0 <= self.sampling_threshold < 1.0:
+            raise ValueError(f"sampling_threshold must be in [0.0, 1.0), got {self.sampling_threshold}.")
 
         if self.max_tokens is not None and self.max_tokens < 1:
             raise ValueError(f"max_tokens must be at least 1, got {self.max_tokens}.")

diff --git a/fastdeploy/entrypoints/openai/protocol.py b/fastdeploy/entrypoints/openai/protocol.py
@@ -524,6 +524,7 @@ class CompletionRequest(BaseModel):
     # doc: begin-completion-sampling-params
     top_k: Optional[int] = None
     min_p: Optional[float] = None
+    sampling_threshold: Optional[float] = Field(default=None, ge=0.0, lt=1.0)
     repetition_penalty: Optional[float] = None
     stop_token_ids: Optional[List[int]] = Field(default_factory=list)
     min_tokens: Optional[int] = None
@@ -703,6 +704,7 @@ class ChatCompletionRequest(BaseModel):
     # doc: begin-chat-completion-sampling-params
     top_k: Optional[int] = None
     min_p: Optional[float] = None
+    sampling_threshold: Optional[float] = Field(default=None, ge=0.0, lt=1.0)
     min_tokens: Optional[int] = None
     include_stop_str_in_output: Optional[bool] = False
     bad_words: Optional[List[str]] = None

diff --git a/fastdeploy/model_executor/layers/sample/meta_data.py b/fastdeploy/model_executor/layers/sample/meta_data.py
@@ -48,6 +48,7 @@ class SamplingMetadata:
     top_k_list: Optional[list] = None
     min_p: Optional[paddle.Tensor] = None
     min_p_list: Optional[list] = None
+    sampling_threshold: Optional[paddle.Tensor] = None
     seed: Optional[paddle.Tensor] = None
     max_num_logprobs: Optional[int] = None
     enable_early_stop: Optional[int] = False

diff --git a/fastdeploy/model_executor/layers/sample/sampler.py b/fastdeploy/model_executor/layers/sample/sampler.py
@@ -646,6 +646,7 @@ def forward_cuda(
                 sampling_metadata.top_p,
                 sampling_metadata.top_k,
                 sampling_metadata.top_k_list,
+                threshold=sampling_metadata.sampling_threshold,
                 topp_seed=sampling_metadata.seed,
             )
 
@@ -926,7 +927,12 @@ def _verify_and_sample(
                     increment_value,
                 )
                 _, target_tokens = top_k_top_p_sampling(
-                    probs, top_p=top_p, top_k=top_k, top_k_list=sampling_metadata.top_k_list, topp_seed=topp_seed
+                    probs,
+                    top_p=top_p,
+                    top_k=top_k,
+                    top_k_list=sampling_metadata.top_k_list,
+                    threshold=sampling_metadata.sampling_threshold,
+                    topp_seed=topp_seed,
                 )
         elif self.verify_strategy == VerifyStrategy.GREEDY:
             # GREEDY: deterministic argmax in target_tokens, no candidates needed
@@ -1021,6 +1027,7 @@ def _normal_sample(
                 sampling_metadata.top_p,
                 sampling_metadata.top_k,
                 sampling_metadata.top_k_list,
+                threshold=sampling_metadata.sampling_threshold,
                 topp_seed=sampling_metadata.seed,
             )
 
@@ -1193,6 +1200,7 @@ def _normal_sample_xpu(
             top_p=top_p,
             top_k=top_k,
             top_k_list=sampling_metadata.top_k_list,
+            threshold=sampling_metadata.sampling_threshold,
             topp_seed=topp_seed,
         )
         real_bsz = share_inputs["seq_lens_this_time"].shape[0]
@@ -1238,6 +1246,7 @@ def _verify_and_sample_xpu(
                 top_p=top_p,
                 top_k=top_k,
                 top_k_list=sampling_metadata.top_k_list,
+                threshold=sampling_metadata.sampling_threshold,
                 topp_seed=topp_seed,
             )
         elif self.verify_strategy == VerifyStrategy.GREEDY:

diff --git a/fastdeploy/worker/gpu_model_runner.py b/fastdeploy/worker/gpu_model_runner.py
@@ -1045,10 +1045,14 @@ def insert_tasks_v1(self, req_dicts: BatchRequest, num_running_requests: int = N
             assert len(request.eos_token_ids) == self.model_config.eos_tokens_lens
             self.share_inputs["min_p_list"][idx] = request.get("min_p", 0.0)
             self.share_inputs["top_k_list"][idx] = request.get("top_k", 0)
+            self.share_inputs["sampling_threshold_list"][idx] = request.get("sampling_threshold", 0.0)
             async_set_value(self.share_inputs["eos_token_id"][:], request.eos_token_ids)
             async_set_value(self.share_inputs["top_p"][idx : idx + 1], request.get("top_p", 0.7))
             async_set_value(self.share_inputs["top_k"][idx : idx + 1], request.get("top_k", 0))
             async_set_value(self.share_inputs["min_p"][idx : idx + 1], request.get("min_p", 0.0))
+            async_set_value(
+                self.share_inputs["sampling_threshold"][idx : idx + 1], request.get("sampling_threshold", 0.0)
+            )
             async_set_value(self.share_inputs["temperature"][idx : idx + 1], request.get("temperature", 0.95))
             async_set_value(self.share_inputs["penalty_score"][idx : idx + 1], request.get("repetition_penalty", 1.0))
             async_set_value(self.share_inputs["frequency_score"][idx : idx + 1], request.get("frequency_penalty", 0.0))
@@ -1345,6 +1349,11 @@ def _prepare_inputs(self, cached_token_num=-1, cached_real_bsz=-1, is_dummy_or_p
         self.initialize_forward_meta(is_dummy_or_profile_run=is_dummy_or_profile_run)
         self.forward_meta.real_bsz = real_bsz
 
+        sampling_threshold_list = self.share_inputs["sampling_threshold_list"]
+        sampling_threshold_tensor = (
+            self.share_inputs["sampling_threshold"] if any(v > 0.0 for v in sampling_threshold_list) else None
+        )
+
         # Get sampling metadata
         self.sampling_metadata = SamplingMetadata(
             temperature=self.share_inputs["temperature"],
@@ -1353,6 +1362,7 @@ def _prepare_inputs(self, cached_token_num=-1, cached_real_bsz=-1, is_dummy_or_p
             top_k_list=self.share_inputs["top_k_list"],
             min_p=self.share_inputs["min_p"],
             min_p_list=self.share_inputs["min_p_list"],
+            sampling_threshold=sampling_threshold_tensor,
             seed=self.share_inputs["infer_seed"],
             step_idx=self.share_inputs["step_idx"],
             token_ids_all=self.share_inputs["token_ids_all"],

diff --git a/fastdeploy/worker/input_batch.py b/fastdeploy/worker/input_batch.py
@@ -127,6 +127,8 @@ def init_share_inputs(self):
         self.top_k_list = [0] * max_num_seqs
         self.min_p = paddle.full([max_num_seqs, 1], 0.0, dtype="float32")
         self.min_p_list = [0.0] * max_num_seqs
+        self.sampling_threshold = paddle.full([max_num_seqs, 1], 0.0, dtype="float32")
+        self.sampling_threshold_list = [0.0] * max_num_seqs
         self.temperature = paddle.full([max_num_seqs, 1], self.model_config.temperature, dtype="float32")
         self.penalty_score = paddle.full([max_num_seqs, 1], self.model_config.penalty_score, dtype="float32")
         self.frequency_score = paddle.full(
@@ -394,6 +396,7 @@ def swap_data(tensor, idx1, idx2):
         swap_data(self.token_ids_all, i1, i2)
         swap_data(self.input_ids, i1, i2)
         swap_data(self.top_p, i1, i2)
+        swap_data(self.sampling_threshold, i1, i2)
         swap_data(self.top_k, i1, i2)
         swap_data(self.min_p, i1, i2)
         swap_data(self.temperature, i1, i2)
@@ -419,6 +422,10 @@ def swap_data(tensor, idx1, idx2):
         # # Swap list-based arrays (lists don't need clone)
         self.top_k_list[i1], self.top_k_list[i2] = self.top_k_list[i2], self.top_k_list[i1]
         self.min_p_list[i1], self.min_p_list[i2] = self.min_p_list[i2], self.min_p_list[i1]
+        self.sampling_threshold_list[i1], self.sampling_threshold_list[i2] = (
+            self.sampling_threshold_list[i2],
+            self.sampling_threshold_list[i1],
+        )
 
         # Swap 1D arrays
         swap_data(self.bad_tokens, i1, i2)
@@ -563,6 +570,7 @@ def reset_share_inputs(self):
             fill_paddle_tensor(self, "top_p", self.model_config.top_p)
             fill_paddle_tensor(self, "top_k", 0)
             fill_paddle_tensor(self, "min_p", 0.0)
+            fill_paddle_tensor(self, "sampling_threshold", 0.0)
             fill_paddle_tensor(self, "temperature", self.model_config.temperature)
             fill_paddle_tensor(self, "penalty_score", self.model_config.penalty_score)
             fill_paddle_tensor(self, "frequency_score", self.model_config.frequency_score)
@@ -573,6 +581,7 @@ def reset_share_inputs(self):
             # Reset list variables (not paddle tensors)
             self.top_k_list = [0] * max_num_seqs
             self.min_p_list = [0.0] * max_num_seqs
+            self.sampling_threshold_list = [0.0] * max_num_seqs
 
             fill_paddle_tensor(self, "min_dec_len", self.model_config.min_length)
             fill_paddle_tensor(self, "max_dec_len", self.model_config.max_model_len)