Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions docs/features/sampling.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@ Sampling strategies are used to determine how to select the next token from the
* Min-p sampling calculates `pivot=max_prob * min_p`, then retains only tokens with probabilities greater than the `pivot` (setting others to zero) for subsequent sampling.
* It filters out tokens with relatively low probabilities, sampling only from high-probability tokens to improve generation quality.

4. Sampling Threshold

* Only tokens with probability greater than or equal to `sampling_threshold` are retained (others are set to zero) for subsequent sampling.
* Filters out tokens whose absolute probability falls below the threshold, sampling only from sufficiently high-probability tokens to improve generation quality.
## Usage Instructions

During deployment, you can choose the sampling algorithm by setting the environment variable `FD_SAMPLING_CLASS`. Available values are `base`, `base_non_truncated`, `air`, or `rejection`.
Expand Down Expand Up @@ -211,6 +215,49 @@ for chunk in response:
print('\n')
```

### Setting Sampling Threshold

If you want to apply a sampling threshold after min-p filtering but before top-p or top-k_top-p sampling, specify the following parameters when sending a request:

* Example request with curl:

```bash
curl -X POST "http://0.0.0.0:9222/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "How old are you"}
],
"min_p": 0.1,
"sampling_threshold": 0.00002,
"top_p": 0.8,
"top_k": 20
}'
```

* Example request with Python:

```python
import openai
host = "0.0.0.0"
port = "8170"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")

response = client.chat.completions.create(
model="null",
messages=[
{"role": "system", "content": "I'm a helpful AI assistant."},
],
stream=True,
top_p=0.8,
extra_body={"top_k": 20, "min_p": 0.1, "sampling_threshold": 0.00002}
)
for chunk in response:
if chunk.choices[0].delta:
print(chunk.choices[0].delta.content, end='')
print('\n')
```

With the above configurations, you can flexibly choose and use the appropriate sampling strategy according to the needs of specific generation tasks.

## Parameter Description
Expand All @@ -221,6 +268,8 @@ With the above configurations, you can flexibly choose and use the appropriate s

`min_p`: Low probability filtering threshold, considering only the token set with probability greater than or equal to (`max_prob*min_p`). It is a float type, with a range of [0.0, 1.0].

`sampling_threshold`: Absolute probability filtering threshold, only retaining tokens with probability greater than or equal to `sampling_threshold`. It is a float type, with a range of [0.0, 1.0).

# Bad Words

Used to prevent the model from generating certain specific words during the inference process. Commonly applied in safety control, content filtering, and behavioral constraints of the model.
Expand Down
1 change: 1 addition & 0 deletions docs/offline_inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,7 @@ For ``LLM`` configuration, refer to [Parameter Documentation](parameters.md).
* top_p(float): Probability threshold for token selection
* top_k(int): Number of tokens considered for sampling
* min_p(float): Minimum probability relative to the maximum probability for a token to be considered (>0 filters low-probability tokens to improve quality)
* sampling_threshold(float): Minimum absolute probability threshold for a token to be considered (>0 filters low-probability tokens by absolute value to improve generation quality). Range [0.0, 1.0).
* max_tokens(int): Maximum generated tokens (input + output)
* min_tokens(int): Minimum forced generation length
* bad_words(list[str]): Prohibited words
Expand Down
6 changes: 6 additions & 0 deletions docs/online_serving/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,9 @@ top_k: Optional[int] = None
min_p: Optional[float] = None
# Nucleus sampling threshold, only retaining tokens whose cumulative probability exceeds min_p (default None means disabled).

sampling_threshold: Optional[float] = None
# Absolute probability filtering threshold, only retaining tokens with probability greater than or equal to sampling_threshold (default None means disabled). Range [0.0, 1.0).

min_tokens: Optional[int] = None
# Forces a minimum number of tokens to be generated, avoiding premature truncation (default None means no limit).

Expand Down Expand Up @@ -434,6 +437,9 @@ top_k: Optional[int] = None
min_p: Optional[float] = None
# Nucleus sampling threshold, only retaining tokens whose cumulative probability exceeds min_p (default None means disabled).

sampling_threshold: Optional[float] = None
# Absolute probability filtering threshold, only retaining tokens with probability greater than or equal to sampling_threshold (default None means disabled). Range [0.0, 1.0).

min_tokens: Optional[int] = None
# Forces a minimum number of tokens to be generated, avoiding premature truncation (default None means no limit).

Expand Down
50 changes: 50 additions & 0 deletions docs/zh/features/sampling.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,11 @@
* Min-p 采样首先计算 pivot=max_prob * min_p,然后只保留概率大于pivot的token(其余设置为0)进行后续的采样。
* 用于过滤掉相对概率过低的token,只从高概率token中采样,提高生成质量。

4. 采样阈值

* 只对大于或等于`sampling_threshold`的token(其余设置为0)进行采样。
* 用于过滤掉绝对概率过低的token,只从高概率token中采样,提高生成质量。

## 使用说明

在部署时,可以通过设置环境变量 `FD_SAMPLING_CLASS` 来选择采样算法。可选择的值有 `base`, `base_non_truncated`, `air`或 `rejection`。
Expand Down Expand Up @@ -214,13 +219,58 @@ for chunk in response:
print('\n')
```

### 设置采样阈值

如果你希望在min_p之后, top_p 或 top_k_top_p之前设置采样阈值,在发送请求时指定以下参数:

* 使用 curl 命令发送用户请求示例如下:

```bash
curl -X POST "http://0.0.0.0:9222/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "How old are you"}
],
"min_p": 0.1,
"sampling_threshold": 0.00002,
"top_p": 0.8,
"top_k": 20
}'
```

* 使用 python 脚本发送用户请求示例如下:

```python
import openai
host = "0.0.0.0"
port = "8170"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")

response = client.chat.completions.create(
model="null",
messages=[
{"role": "system", "content": "I'm a helpful AI assistant."},
{"role": "user", "content": "把李白的静夜思改写为现代诗"},
],
stream=True,
top_p=0.8,
extra_body={"top_k": 20, "min_p": 0.1, "sampling_threshold": 0.00002}
)
for chunk in response:
if chunk.choices[0].delta:
print(chunk.choices[0].delta.content, end='')
print('\n')
```

通过上述配置,你可以根据具体的生成任务需求,灵活选择和使用合适的采样策略。

## 参数说明

* `top_p`: 概率累积分布截断阈值,仅考虑累计概率达到此阈值的最可能token集合。float类型,取值范围为[0.0,1.0]。当top_p=1.0时,考虑所有token;当top_p=0.0时,退化为greedy search。
* `top_k`: 采样概率最高的token数量,考虑概率最高的k个token进行采样范围限制。int类型,取值范围为[0,vocab_size]
* `min_p`:低概率过滤阈值,仅考虑概率大于等于(max_prob*min_p)的token集合。float类型,取值范围为[0.0,1.0]
* `sampling_threshold`:低概率过滤阈值,仅考虑概率大于等于sampling_threshold的token集合。float类型,取值范围为[0.0,1.0)

# Bad Words

Expand Down
1 change: 1 addition & 0 deletions docs/zh/offline_inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,7 @@ for output in outputs:
* top_p(float): 概率累积分布截断阈值,仅考虑累计概率达到此阈值的最可能token集合
* top_k(int): 采样概率最高的token数量,考虑概率最高的k个token进行采样
* min_p(float): token入选的最小概率阈值(相对于最高概率token的比值,设为>0可通过过滤低概率token来提升文本生成质量)
* sampling_threshold(float): token入选的最小概率阈值(绝对概率值,设为>0可通过过滤低概率token来提升文本生成质量)
* max_tokens(int): 限制模型生成的最大token数量(包括输入和输出)
* min_tokens(int): 强制模型生成的最少token数量,避免过早结束
* bad_words(list[str]): 禁止生成的词列表, 防止模型生成不希望出现的词
Expand Down
6 changes: 6 additions & 0 deletions docs/zh/online_serving/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,9 @@ top_k: Optional[int] = None
min_p: Optional[float] = None
# 核采样(nucleus sampling)阈值,只保留概率累计超过 min_p 的 token(默认 None 表示禁用)。

sampling_threshold: Optional[float] = None
# 采样阈值,只保留概率值大于或等于 sampling_threshold 的 token(默认 None 表示禁用)。取值范围[0.0, 1.0)。

min_tokens: Optional[int] = None
# 强制生成的最小 token 数,避免过早截断(默认 None 表示不限制)。

Expand Down Expand Up @@ -425,6 +428,9 @@ top_k: Optional[int] = None
min_p: Optional[float] = None
# 核采样(nucleus sampling)阈值,只保留概率累计超过 min_p 的 token(默认 None 表示禁用)。

sampling_threshold: Optional[float] = None
# 采样阈值,只保留概率值大于或等于 sampling_threshold 的 token(默认 None 表示禁用)。取值范围[0.0, 1.0)。

min_tokens: Optional[int] = None
# 强制生成的最小 token 数,避免过早截断(默认 None 表示不限制)。

Expand Down
12 changes: 12 additions & 0 deletions fastdeploy/engine/sampling_params.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,7 @@ class SamplingParams:
top_p: float = None
top_k: int = 0
min_p: float = 0.0
sampling_threshold: float = 0.0
seed: Optional[int] = None
stop: Optional[Union[str, List[str]]] = None
stop_token_ids: Optional[List[int]] = None
Expand Down Expand Up @@ -160,6 +161,11 @@ def from_generic_request(cls, req: T) -> SamplingParams:
top_p=getattr(req, "top_p", None) if getattr(req, "top_p", None) is not None else cls.top_p,
top_k=getattr(req, "top_k", None) if getattr(req, "top_k", None) is not None else cls.top_k,
min_p=getattr(req, "min_p", None) if getattr(req, "min_p", None) is not None else cls.min_p,
sampling_threshold=(
getattr(req, "sampling_threshold", None)
if getattr(req, "sampling_threshold", None) is not None
else cls.sampling_threshold
),
seed=getattr(req, "seed", None) if getattr(req, "seed", None) is not None else cls.seed,
stop=getattr(req, "stop", None) if getattr(req, "stop", None) is not None else cls.stop,
stop_token_ids=(
Expand Down Expand Up @@ -234,6 +240,7 @@ def from_optional(
top_p,
top_k,
min_p,
sampling_threshold=None,
seed=None,
stop=None,
stop_token_ids=None,
Expand All @@ -259,6 +266,7 @@ def from_optional(
top_p=top_p,
top_k=top_k if top_k is not None else 0,
min_p=min_p if min_p is not None else 0.0,
sampling_threshold=sampling_threshold if sampling_threshold is not None else 0.0,
seed=seed,
stop=stop,
stop_token_ids=stop_token_ids,
Expand Down Expand Up @@ -305,6 +313,10 @@ def _verify_args(self) -> None:
raise TypeError(f"top_k must be an integer, got {type(self.top_k).__name__}")
if not 0.0 <= self.min_p <= 1.0:
raise ValueError("min_p must be in [0,1],got f{self.min_p}")
if not isinstance(self.sampling_threshold, float):
raise TypeError(f"sampling_threshold must be a float, got {type(self.sampling_threshold).__name__}")
if not 0.0 <= self.sampling_threshold < 1.0:
raise ValueError(f"sampling_threshold must be in [0.0, 1.0), got {self.sampling_threshold}.")

if self.max_tokens is not None and self.max_tokens < 1:
raise ValueError(f"max_tokens must be at least 1, got {self.max_tokens}.")
Expand Down
2 changes: 2 additions & 0 deletions fastdeploy/entrypoints/openai/protocol.py
Original file line number Diff line number Diff line change
Expand Up @@ -524,6 +524,7 @@ class CompletionRequest(BaseModel):
# doc: begin-completion-sampling-params
top_k: Optional[int] = None
min_p: Optional[float] = None
sampling_threshold: Optional[float] = Field(default=None, ge=0.0, lt=1.0)
repetition_penalty: Optional[float] = None
stop_token_ids: Optional[List[int]] = Field(default_factory=list)
min_tokens: Optional[int] = None
Expand Down Expand Up @@ -703,6 +704,7 @@ class ChatCompletionRequest(BaseModel):
# doc: begin-chat-completion-sampling-params
top_k: Optional[int] = None
min_p: Optional[float] = None
sampling_threshold: Optional[float] = Field(default=None, ge=0.0, lt=1.0)
min_tokens: Optional[int] = None
include_stop_str_in_output: Optional[bool] = False
bad_words: Optional[List[str]] = None
Expand Down
1 change: 1 addition & 0 deletions fastdeploy/model_executor/layers/sample/meta_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ class SamplingMetadata:
top_k_list: Optional[list] = None
min_p: Optional[paddle.Tensor] = None
min_p_list: Optional[list] = None
sampling_threshold: Optional[paddle.Tensor] = None
seed: Optional[paddle.Tensor] = None
max_num_logprobs: Optional[int] = None
enable_early_stop: Optional[int] = False
Expand Down
11 changes: 10 additions & 1 deletion fastdeploy/model_executor/layers/sample/sampler.py
Original file line number Diff line number Diff line change
Expand Up @@ -646,6 +646,7 @@ def forward_cuda(
sampling_metadata.top_p,
sampling_metadata.top_k,
sampling_metadata.top_k_list,
threshold=sampling_metadata.sampling_threshold,

This comment was marked as outdated.

topp_seed=sampling_metadata.seed,
)

Expand Down Expand Up @@ -926,7 +927,12 @@ def _verify_and_sample(
increment_value,
)
_, target_tokens = top_k_top_p_sampling(
probs, top_p=top_p, top_k=top_k, top_k_list=sampling_metadata.top_k_list, topp_seed=topp_seed
probs,
top_p=top_p,
top_k=top_k,
top_k_list=sampling_metadata.top_k_list,
threshold=sampling_metadata.sampling_threshold,
topp_seed=topp_seed,
)
elif self.verify_strategy == VerifyStrategy.GREEDY:
# GREEDY: deterministic argmax in target_tokens, no candidates needed
Expand Down Expand Up @@ -1021,6 +1027,7 @@ def _normal_sample(
sampling_metadata.top_p,
sampling_metadata.top_k,
sampling_metadata.top_k_list,
threshold=sampling_metadata.sampling_threshold,
topp_seed=sampling_metadata.seed,
)

Expand Down Expand Up @@ -1193,6 +1200,7 @@ def _normal_sample_xpu(
top_p=top_p,
top_k=top_k,
top_k_list=sampling_metadata.top_k_list,
threshold=sampling_metadata.sampling_threshold,
topp_seed=topp_seed,
)
real_bsz = share_inputs["seq_lens_this_time"].shape[0]
Expand Down Expand Up @@ -1238,6 +1246,7 @@ def _verify_and_sample_xpu(
top_p=top_p,
top_k=top_k,
top_k_list=sampling_metadata.top_k_list,
threshold=sampling_metadata.sampling_threshold,
topp_seed=topp_seed,
)
elif self.verify_strategy == VerifyStrategy.GREEDY:
Expand Down
10 changes: 10 additions & 0 deletions fastdeploy/worker/gpu_model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -1045,10 +1045,14 @@ def insert_tasks_v1(self, req_dicts: BatchRequest, num_running_requests: int = N
assert len(request.eos_token_ids) == self.model_config.eos_tokens_lens
self.share_inputs["min_p_list"][idx] = request.get("min_p", 0.0)
self.share_inputs["top_k_list"][idx] = request.get("top_k", 0)
self.share_inputs["sampling_threshold_list"][idx] = request.get("sampling_threshold", 0.0)
async_set_value(self.share_inputs["eos_token_id"][:], request.eos_token_ids)
async_set_value(self.share_inputs["top_p"][idx : idx + 1], request.get("top_p", 0.7))
async_set_value(self.share_inputs["top_k"][idx : idx + 1], request.get("top_k", 0))
async_set_value(self.share_inputs["min_p"][idx : idx + 1], request.get("min_p", 0.0))
async_set_value(
self.share_inputs["sampling_threshold"][idx : idx + 1], request.get("sampling_threshold", 0.0)
)
async_set_value(self.share_inputs["temperature"][idx : idx + 1], request.get("temperature", 0.95))
async_set_value(self.share_inputs["penalty_score"][idx : idx + 1], request.get("repetition_penalty", 1.0))
async_set_value(self.share_inputs["frequency_score"][idx : idx + 1], request.get("frequency_penalty", 0.0))
Expand Down Expand Up @@ -1345,6 +1349,11 @@ def _prepare_inputs(self, cached_token_num=-1, cached_real_bsz=-1, is_dummy_or_p
self.initialize_forward_meta(is_dummy_or_profile_run=is_dummy_or_profile_run)
self.forward_meta.real_bsz = real_bsz

sampling_threshold_list = self.share_inputs["sampling_threshold_list"]
sampling_threshold_tensor = (
self.share_inputs["sampling_threshold"] if any(v > 0.0 for v in sampling_threshold_list) else None
)

# Get sampling metadata
self.sampling_metadata = SamplingMetadata(
temperature=self.share_inputs["temperature"],
Expand All @@ -1353,6 +1362,7 @@ def _prepare_inputs(self, cached_token_num=-1, cached_real_bsz=-1, is_dummy_or_p
top_k_list=self.share_inputs["top_k_list"],
min_p=self.share_inputs["min_p"],
min_p_list=self.share_inputs["min_p_list"],
sampling_threshold=sampling_threshold_tensor,

This comment was marked as outdated.

seed=self.share_inputs["infer_seed"],
step_idx=self.share_inputs["step_idx"],
token_ids_all=self.share_inputs["token_ids_all"],
Expand Down
9 changes: 9 additions & 0 deletions fastdeploy/worker/input_batch.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,8 @@ def init_share_inputs(self):
self.top_k_list = [0] * max_num_seqs
self.min_p = paddle.full([max_num_seqs, 1], 0.0, dtype="float32")
self.min_p_list = [0.0] * max_num_seqs
self.sampling_threshold = paddle.full([max_num_seqs, 1], 0.0, dtype="float32")
self.sampling_threshold_list = [0.0] * max_num_seqs
self.temperature = paddle.full([max_num_seqs, 1], self.model_config.temperature, dtype="float32")
self.penalty_score = paddle.full([max_num_seqs, 1], self.model_config.penalty_score, dtype="float32")
self.frequency_score = paddle.full(
Expand Down Expand Up @@ -394,6 +396,7 @@ def swap_data(tensor, idx1, idx2):
swap_data(self.token_ids_all, i1, i2)
swap_data(self.input_ids, i1, i2)
swap_data(self.top_p, i1, i2)
swap_data(self.sampling_threshold, i1, i2)
swap_data(self.top_k, i1, i2)
swap_data(self.min_p, i1, i2)
swap_data(self.temperature, i1, i2)
Expand All @@ -419,6 +422,10 @@ def swap_data(tensor, idx1, idx2):
# # Swap list-based arrays (lists don't need clone)
self.top_k_list[i1], self.top_k_list[i2] = self.top_k_list[i2], self.top_k_list[i1]
self.min_p_list[i1], self.min_p_list[i2] = self.min_p_list[i2], self.min_p_list[i1]
self.sampling_threshold_list[i1], self.sampling_threshold_list[i2] = (
self.sampling_threshold_list[i2],
self.sampling_threshold_list[i1],
)

# Swap 1D arrays
swap_data(self.bad_tokens, i1, i2)
Expand Down Expand Up @@ -563,6 +570,7 @@ def reset_share_inputs(self):
fill_paddle_tensor(self, "top_p", self.model_config.top_p)
fill_paddle_tensor(self, "top_k", 0)
fill_paddle_tensor(self, "min_p", 0.0)
fill_paddle_tensor(self, "sampling_threshold", 0.0)
fill_paddle_tensor(self, "temperature", self.model_config.temperature)
fill_paddle_tensor(self, "penalty_score", self.model_config.penalty_score)
fill_paddle_tensor(self, "frequency_score", self.model_config.frequency_score)
Expand All @@ -573,6 +581,7 @@ def reset_share_inputs(self):
# Reset list variables (not paddle tensors)
self.top_k_list = [0] * max_num_seqs
self.min_p_list = [0.0] * max_num_seqs
self.sampling_threshold_list = [0.0] * max_num_seqs

fill_paddle_tensor(self, "min_dec_len", self.model_config.min_length)
fill_paddle_tensor(self, "max_dec_len", self.model_config.max_model_len)
Expand Down
Loading
Loading