Skip to content

Commit 42c66a7

Browse files
[Feature] Add server-level token limits and prompt truncation control (#7842)
* 增加长度控制参数 * 修改参数名 * 修改参数校验 * add docs * fix default value * fix review * fix error messages * add truncate_prompt_tokens * fix unit test * fix review * add unit test & fix review * update processor * remove test_processor.py * fix doc * fix * fix unit test * fix unit test * remove truncate_prompt_tokens * remove truncate_prompt_tokens * fix review * fix unit test * fix review --------- Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
1 parent 4474188 commit 42c66a7

19 files changed

Lines changed: 832 additions & 27 deletions

docs/parameters.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,11 @@ When using FastDeploy to deploy models (including offline inference and service
1515
| ```engine_worker_queue_port``` | `list[int]` | FastDeploy internal engine communication port list, auto-allocated based on data_parallel_size |
1616
| ```cache_queue_port``` | `list[int]` | FastDeploy internal KVCache process communication port list, auto-allocated based on data_parallel_size |
1717
| ```max_model_len``` | `int` | Default maximum supported context length for inference, default: 2048 |
18+
| ```max_completion_tokens``` | `int` | Server-level maximum allowed completion token length (hard cap). Per-request max_tokens will be clamped to this value. Default: None (bounded by max_model_len - input_len) |
19+
| ```reasoning_max_tokens``` | `int` | Server-level maximum allowed reasoning/thinking token length (hard cap). Per-request value will be clamped to this value. Default: None (no cap) |
20+
| ```response_max_tokens``` | `int` | Server-level maximum allowed response token length (hard cap). Per-request value will be clamped to this value. Default: None (no cap) |
21+
| ```min_completion_tokens``` | `int` | Server-level minimum generation length floor. Effective min_tokens = max(server_value, per-request value). Default: None (no floor) |
22+
| ```input_max_tokens``` | `int` | Server-level maximum input token length. Requests with prompt longer than this will be rejected. Default: None (no limit, bounded by max_model_len) |
1823
| ```tensor_parallel_size``` | `int` | Default tensor parallelism degree for model, default: 1 |
1924
| ```data_parallel_size``` | `int` | Default data parallelism degree for model, default: 1 |
2025
| ```block_size``` | `int` | KVCache management granularity (Token count), recommended default: 64 |

docs/zh/parameters.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,11 @@
1313
| ```engine_worker_queue_port``` | `list[int]` | FastDeploy内部引擎进程通信端口列表,会根据data_parallel_size自动分配 |
1414
| ```cache_queue_port``` | `list[int]` | FastDeploy内部KVCache进程通信端口列表,会根据data_parallel_size自动分配 |
1515
| ```max_model_len``` | `int` | 推理默认最大支持上下文长度,默认2048 |
16+
| ```max_completion_tokens``` | `int` | 服务级最大生成token数硬上限。请求中的max_tokens会被截断至此值。默认:None(受max_model_len - input_len约束) |
17+
| ```reasoning_max_tokens``` | `int` | 服务级推理/思考token数硬上限。请求中的reasoning_max_tokens会被截断至此值。默认:None(不限制) |
18+
| ```response_max_tokens``` | `int` | 服务级回复token数硬上限。请求中的response_max_tokens会被截断至此值。默认:None(不限制) |
19+
| ```min_completion_tokens``` | `int` | 服务级最小生成长度下限。实际min_tokens = max(服务值, 请求值),请求不能低于此下限。默认:None(不限制) |
20+
| ```input_max_tokens``` | `int` | 服务级输入token数上限。超过此值的请求将被拒绝。默认:None(不限制,受max_model_len约束) |
1621
| ```tensor_parallel_size``` | `int` | 模型默认张量并行数,默认1 |
1722
| ```data_parallel_size``` | `int` | 模型默认数据并行数,默认1 |
1823
| ```block_size``` | `int` | KVCache管理粒度(Token数),推荐默认值64 |

fastdeploy/config.py

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1886,6 +1886,59 @@ def __str__(self):
18861886
return self.to_json_string()
18871887

18881888

1889+
class ServingLimitsConfig:
1890+
"""Server-level request length limits and policies."""
1891+
1892+
def __init__(self, args):
1893+
self.max_completion_tokens = None
1894+
self.reasoning_max_tokens = None
1895+
self.response_max_tokens = None
1896+
self.min_completion_tokens = None
1897+
self.input_max_tokens = None
1898+
1899+
for key, value in args.items():
1900+
if hasattr(self, key) and value != "None":
1901+
setattr(self, key, value)
1902+
1903+
def validate(self, max_model_len):
1904+
"""Validate serving limits against max_model_len at startup."""
1905+
for name in ("max_completion_tokens", "input_max_tokens", "response_max_tokens"):
1906+
value = getattr(self, name)
1907+
if value is not None and value <= 0:
1908+
flag = name.replace("_", "-")
1909+
raise ValueError(f"--{flag} ({value}) must be greater than 0.")
1910+
1911+
for name in ("reasoning_max_tokens", "min_completion_tokens"):
1912+
value = getattr(self, name)
1913+
if value is not None and value < 0:
1914+
flag = name.replace("_", "-")
1915+
raise ValueError(f"--{flag} ({value}) must be greater than or equal to 0.")
1916+
1917+
if self.min_completion_tokens is not None:
1918+
if self.min_completion_tokens >= max_model_len:
1919+
raise ValueError(
1920+
f"--min-completion-tokens ({self.min_completion_tokens}) must be less than "
1921+
f"--max-model-len ({max_model_len}). All requests would be rejected."
1922+
)
1923+
if self.max_completion_tokens is not None and self.min_completion_tokens > self.max_completion_tokens:
1924+
raise ValueError(
1925+
f"--min-completion-tokens ({self.min_completion_tokens}) must not exceed "
1926+
f"--max-completion-tokens ({self.max_completion_tokens})."
1927+
)
1928+
1929+
if self.max_completion_tokens is not None and self.max_completion_tokens > max_model_len:
1930+
logger.warning(
1931+
f"--max-completion-tokens ({self.max_completion_tokens}) > "
1932+
f"--max-model-len ({max_model_len}), it will have no effect."
1933+
)
1934+
1935+
if self.input_max_tokens is not None and self.input_max_tokens > max_model_len:
1936+
logger.warning(
1937+
f"--input-max-tokens ({self.input_max_tokens}) > "
1938+
f"--max-model-len ({max_model_len}), it will have no effect."
1939+
)
1940+
1941+
18891942
class BenchmarkMetricsConfig:
18901943
"""Configuration for in-process benchmark metrics logger.
18911944
@@ -1981,6 +2034,7 @@ def __init__(
19812034
routing_replay_config: Optional[RoutingReplayConfig] = None,
19822035
benchmark_metrics_config=None,
19832036
deploy_modality: DeployModality = DeployModality.MIXED,
2037+
serving_limits_config: ServingLimitsConfig = None, # resolved below
19842038
):
19852039
self.model_config: ModelConfig = model_config # type: ignore
19862040
self.cache_config: CacheConfig = cache_config # type: ignore
@@ -1999,6 +2053,7 @@ def __init__(
19992053
self.routing_replay_config = routing_replay_config
20002054
self.benchmark_metrics_config = benchmark_metrics_config
20012055
self.deploy_modality: DeployModality = deploy_modality
2056+
self.serving_limits_config: ServingLimitsConfig = serving_limits_config or ServingLimitsConfig({})
20022057
# Initialize cuda graph capture list
20032058
max_capture_shape = self.scheduler_config.max_num_seqs
20042059
if self.graph_opt_config.cudagraph_only_prefill:

fastdeploy/engine/args_utils.py

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@
3838
RouterConfig,
3939
RoutingReplayConfig,
4040
RunnerOption,
41+
ServingLimitsConfig,
4142
SpeculativeConfig,
4243
StructuredOutputsConfig,
4344
TaskOption,
@@ -111,6 +112,33 @@ class EngineArgs:
111112
"""
112113
Maximum context length supported by the model.
113114
"""
115+
max_completion_tokens: Optional[int] = None
116+
"""
117+
Server-level maximum allowed completion token length (hard cap).
118+
Per-request max_tokens will be clamped to this value. None means no server-level cap
119+
(bounded by max_model_len - input_len).
120+
"""
121+
reasoning_max_tokens: Optional[int] = None
122+
"""
123+
Server-level maximum allowed reasoning/thinking token length (hard cap).
124+
Per-request reasoning_max_tokens will be clamped to this value. None means no server-level cap.
125+
"""
126+
response_max_tokens: Optional[int] = None
127+
"""
128+
Server-level maximum allowed response token length (hard cap).
129+
Per-request response_max_tokens will be clamped to this value. None means no server-level cap.
130+
"""
131+
min_completion_tokens: Optional[int] = None
132+
"""
133+
Server-level minimum generation length floor.
134+
Effective min_tokens = max(server_value, per-request value). Requests cannot set min_tokens
135+
below this floor. None means no server-level floor.
136+
"""
137+
input_max_tokens: Optional[int] = None
138+
"""
139+
Server-level maximum input token length.
140+
Requests with prompt longer than this will be rejected. None means no limit (bounded by max_model_len).
141+
"""
114142
tensor_parallel_size: int = 1
115143
"""
116144
Degree of tensor parallelism.
@@ -768,6 +796,43 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser:
768796
default=EngineArgs.max_model_len,
769797
help="Maximum context length supported by the model.",
770798
)
799+
model_group.add_argument(
800+
"--max-completion-tokens",
801+
type=int,
802+
default=EngineArgs.max_completion_tokens,
803+
help="Server-level maximum allowed completion token length (hard cap). "
804+
"Per-request max_tokens will be clamped to this value. "
805+
"Default: None (bounded by max_model_len - input_len).",
806+
)
807+
model_group.add_argument(
808+
"--reasoning-max-tokens",
809+
type=int,
810+
default=EngineArgs.reasoning_max_tokens,
811+
help="Server-level maximum allowed reasoning/thinking token length (hard cap). "
812+
"Per-request reasoning_max_tokens will be clamped to this value. Default: None (no cap).",
813+
)
814+
model_group.add_argument(
815+
"--response-max-tokens",
816+
type=int,
817+
default=EngineArgs.response_max_tokens,
818+
help="Server-level maximum allowed response token length (hard cap). "
819+
"Per-request response_max_tokens will be clamped to this value. Default: None (no cap).",
820+
)
821+
model_group.add_argument(
822+
"--min-completion-tokens",
823+
type=int,
824+
default=EngineArgs.min_completion_tokens,
825+
help="Server-level minimum generation length floor. "
826+
"Effective min_tokens = max(server_value, per-request value). Default: None (no floor).",
827+
)
828+
model_group.add_argument(
829+
"--input-max-tokens",
830+
type=int,
831+
default=EngineArgs.input_max_tokens,
832+
help="Server-level maximum input token length. "
833+
"Requests with prompt longer than this will be rejected. "
834+
"Default: None (no limit, bounded by max_model_len).",
835+
)
771836
model_group.add_argument(
772837
"--block-size",
773838
type=int,
@@ -1577,6 +1642,8 @@ def create_engine_config(self) -> FDConfig:
15771642
cache_cfg = CacheConfig(all_dict)
15781643
load_cfg = LoadConfig(all_dict)
15791644
parallel_cfg = ParallelConfig(all_dict)
1645+
serving_limits_cfg = ServingLimitsConfig(all_dict)
1646+
serving_limits_cfg.validate(model_cfg.max_model_len)
15801647
scheduler_cfg = self.create_scheduler_config()
15811648
graph_opt_cfg = self.create_graph_optimization_config()
15821649
plas_attention_config = self.create_plas_attention_config()
@@ -1613,4 +1680,5 @@ def create_engine_config(self) -> FDConfig:
16131680
routing_replay_config=routing_replay_config,
16141681
benchmark_metrics_config=benchmark_metrics_cfg,
16151682
deploy_modality=DeployModality.from_str(self.deploy_modality),
1683+
serving_limits_config=serving_limits_cfg,
16161684
)

fastdeploy/engine/async_llm.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -299,6 +299,7 @@ def __init__(self, cfg, pid):
299299
)
300300
# Create data processor
301301
self.data_processor = self.input_processor.create_processor()
302+
self.data_processor.set_server_defaults(cfg.serving_limits_config)
302303

303304
# Create high-performance async connection manager
304305
self.connection_manager = None

fastdeploy/engine/common_engine.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -365,6 +365,7 @@ def create_data_processor(self):
365365
enable_mm_runtime=self.cfg.enable_mm_runtime,
366366
)
367367
self.data_processor = self.input_processor.create_processor()
368+
self.data_processor.set_server_defaults(self.cfg.serving_limits_config)
368369
self.mm_max_tokens_per_item = self.data_processor.get_mm_max_tokens_per_item(
369370
self.cfg.model_config.max_model_len
370371
)

fastdeploy/engine/engine.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -746,7 +746,10 @@ def _format_and_add_data(self, prompts: dict):
746746
prompts["prompt"] = query_list
747747

748748
if "max_tokens" not in prompts:
749-
prompts["max_tokens"] = self.cfg.model_config.max_model_len
749+
if self.cfg.serving_limits_config.max_completion_tokens is not None:
750+
prompts["max_tokens"] = self.cfg.serving_limits_config.max_completion_tokens
751+
else:
752+
prompts["max_tokens"] = self.cfg.model_config.max_model_len
750753

751754
self.add_requests(prompts)
752755
return prompts["request_id"]

fastdeploy/entrypoints/engine_client.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,12 +104,14 @@ def __init__(self, pid: int | str, port: int | str, fd_config: FDConfig, workers
104104
)
105105
self.enable_logprob = self.fd_config.model_config.enable_logprob
106106
self.data_processor = input_processor.create_processor()
107+
self.data_processor.set_server_defaults(self.fd_config.serving_limits_config)
107108
self.ori_vocab_size = (
108109
len(self.data_processor.tokenizer.sp_model)
109110
if hasattr(self.data_processor.tokenizer, "sp_model")
110111
else len(self.data_processor.tokenizer.vocab)
111112
)
112113
self.max_model_len = self.fd_config.model_config.max_model_len
114+
self.max_completion_tokens = self.fd_config.serving_limits_config.max_completion_tokens
113115
self.enable_prefix_caching = self.fd_config.cache_config.enable_prefix_caching
114116
self.enable_cache_transfer = (
115117
self.fd_config.cache_config.swap_space or self.fd_config.cache_config.kvcache_storage_backend
@@ -297,7 +299,10 @@ async def format_and_add_data(self, request: Request | dict):
297299
request["request_id"] = request_id
298300

299301
if "max_tokens" not in request:
300-
request["max_tokens"] = self.max_model_len - 1
302+
if self.max_completion_tokens is not None:
303+
request["max_tokens"] = self.max_completion_tokens
304+
else:
305+
request["max_tokens"] = self.max_model_len - 1
301306

302307
await self.add_requests(request)
303308
return request["prompt_token_ids"]

fastdeploy/input/base_processor.py

Lines changed: 69 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,27 @@ def __init__(self, model_name_or_path, tokenizer_type="auto", reasoning_parser_o
110110
self.tokenizer.pad_token_id = self.pad_token_id
111111
self._init_parsers(reasoning_parser_obj, tool_parser_obj)
112112

113+
# Server-level defaults (set via set_server_defaults after construction)
114+
self.max_completion_tokens = None
115+
self.reasoning_max_tokens = None
116+
self.response_max_tokens = None
117+
self.min_completion_tokens = None
118+
self.input_max_tokens = None
119+
120+
def set_server_defaults(self, serving_limits_config):
121+
"""Set server-level default values from serving limits config.
122+
123+
These defaults are applied in process_request_dict when per-request
124+
values are not specified.
125+
"""
126+
if serving_limits_config is None:
127+
return
128+
self.max_completion_tokens = serving_limits_config.max_completion_tokens
129+
self.reasoning_max_tokens = serving_limits_config.reasoning_max_tokens
130+
self.response_max_tokens = serving_limits_config.response_max_tokens
131+
self.min_completion_tokens = serving_limits_config.min_completion_tokens
132+
self.input_max_tokens = serving_limits_config.input_max_tokens
133+
113134
# ------------------------------------------------------------------
114135
# Abstract interface
115136
# ------------------------------------------------------------------
@@ -438,20 +459,63 @@ def process_request_dict(self, request, max_model_len=None, **kwargs):
438459
if request.get("completion_token_ids"):
439460
request["prompt_token_ids"].extend(request["completion_token_ids"])
440461

441-
# truncate prompts that exceed the length limit
462+
# Reject requests exceeding input_max_tokens
463+
if self.input_max_tokens is not None and len(request["prompt_token_ids"]) > self.input_max_tokens:
464+
raise ValueError(
465+
f"Input token length {len(request['prompt_token_ids'])} exceeds the configured input_max_tokens limit {self.input_max_tokens}"
466+
)
467+
442468
if max_model_len is not None and len(request["prompt_token_ids"]) > max_model_len:
443-
request["prompt_token_ids"] = request["prompt_token_ids"][: max_model_len - 1]
469+
raise ValueError(
470+
f"Input token length {len(request['prompt_token_ids'])} exceeds "
471+
f"the configured max_model_len {max_model_len}"
472+
)
444473

445474
logits_processors_args = self._update_thinking_prompt_state(
446475
request["prompt_token_ids"], request.get("logits_processors_args") or {}
447476
)
448477
request["logits_processors_args"] = logits_processors_args
449478

450-
max_tokens = max_model_len - len(request["prompt_token_ids"])
479+
# Compute effective length limits
480+
def _min_non_none(*values):
481+
return min(v for v in values if v is not None)
482+
483+
context_remaining = max(1, max_model_len - len(request["prompt_token_ids"]))
484+
451485
if request.get("max_tokens") is None:
452-
request["max_tokens"] = max(1, max_tokens)
486+
# User didn't specify: default to min(context_remaining, server_default)
487+
if self.max_completion_tokens is not None:
488+
request["max_tokens"] = max(1, min(context_remaining, self.max_completion_tokens))
489+
else:
490+
request["max_tokens"] = context_remaining
453491
else:
454-
request["max_tokens"] = min(max_tokens, request["max_tokens"])
492+
# User specified: clamp to min(context_remaining, max_completion_tokens)
493+
request["max_tokens"] = _min_non_none(context_remaining, self.max_completion_tokens, request["max_tokens"])
494+
495+
max_tokens = request["max_tokens"]
496+
if self.reasoning_max_tokens is not None or request.get("reasoning_max_tokens") is not None:
497+
request["reasoning_max_tokens"] = _min_non_none(
498+
max_tokens, self.reasoning_max_tokens, request.get("reasoning_max_tokens")
499+
)
500+
if self.response_max_tokens is not None or request.get("response_max_tokens") is not None:
501+
request["response_max_tokens"] = _min_non_none(
502+
max_tokens, self.response_max_tokens, request.get("response_max_tokens")
503+
)
504+
505+
# min_tokens: take the larger of server-level and user value, reject if > max_tokens
506+
server_min = self.min_completion_tokens
507+
user_min = request.get("min_tokens")
508+
if server_min is None:
509+
effective_min = user_min
510+
elif user_min is None:
511+
effective_min = server_min
512+
else:
513+
effective_min = max(server_min, user_min)
514+
if effective_min is not None:
515+
if effective_min > max_tokens:
516+
raise ValueError(f"min_tokens ({effective_min}) must not exceed max_tokens ({max_tokens})")
517+
request["min_tokens"] = effective_min
518+
455519
if request.get("temperature") < _SAMPLING_EPS:
456520
# zero temperature means greedy decoding: set top_k=1 to force argmax
457521
request["temperature"] = 1

0 commit comments

Comments
 (0)