Skip to content

Commit f821677

Browse files
committed
Add window_mode
1 parent 2eb10cc commit f821677

6 files changed

Lines changed: 496 additions & 33 deletions

File tree

docs/benchmark.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,3 +40,100 @@ python benchmark_serving.py \
4040
--max-concurrency 1 \
4141
--save-result
4242
```
43+
44+
## In-Process Benchmark Metrics Logger
45+
46+
FastDeploy provides a built-in performance monitoring module that runs inside the inference process. It collects per-request timing data and computes rolling statistics aligned with `benchmark_serving.py`, writing results to a JSONL file for real-time monitoring and post-hoc analysis.
47+
48+
### Enable
49+
50+
Add `--benchmark-metrics-config` with a JSON string to the service startup command:
51+
52+
```bash
53+
python -m fastdeploy.entrypoints.openai.api_server \
54+
--model baidu/ERNIE-4.5-0.3B-Base-Paddle \
55+
--benchmark-metrics-config '{"enable": true}'
56+
```
57+
58+
### Configuration Parameters
59+
60+
| Parameter | Type | Default | Description |
61+
| :-------- | :--- | :------ | :---------- |
62+
| `enable` | bool | `false` | Whether to enable the benchmark metrics logger. Must be set to `true` to activate. |
63+
| `window_size` | int | `0` | Number of recent requests to aggregate. `0` = cumulative (all requests since start). |
64+
| `window_mode` | str | `"sliding"` | Window aggregation mode. `"sliding"` = sliding window (keeps last N records, oldest automatically dropped). `"tumbling"` = tumbling window (clears and restarts after every N records). |
65+
| `percentiles` | str | `"50,90,95,99"` | Comma-separated percentile values to compute. |
66+
| `metrics` | str | `"all"` | Comma-separated metric names to report, or `"all"` for all metrics. |
67+
68+
### Available Metrics
69+
70+
Metrics are aligned with `benchmark_serving.py --percentile-metrics`:
71+
72+
| Metric Name | Description | Unit |
73+
| :---------- | :---------- | :--- |
74+
| `ttft` | Time to First Token (client arrival → first token) | ms |
75+
| `s_ttft` | Server TTFT (inference start → first token) | ms |
76+
| `tpot` | Time per Output Token (excluding first token) | ms |
77+
| `itl` | Inter-token Latency | ms |
78+
| `e2el` | End-to-end Latency (client arrival → last token) | ms |
79+
| `s_e2el` | Server E2EL (inference start → last token) | ms |
80+
| `s_decode` | Decode speed (excluding first token) | tok/s |
81+
| `input_len` | Prefix cache hit token count ("Cached Tokens") | tokens |
82+
| `s_input_len` | Infer input length (total prompt tokens) | tokens |
83+
| `output_len` | Output token length per request | tokens |
84+
85+
In addition, the following throughput metrics are always computed (not user-selectable) when there are 2+ records:
86+
87+
| Metric | Description | Unit |
88+
| :----- | :---------- | :--- |
89+
| `request_throughput` | Request throughput | req/s |
90+
| `output_throughput` | Output token throughput | tok/s |
91+
| `total_throughput` | Total token throughput (input + output) | tok/s |
92+
93+
### Window Modes
94+
95+
**Sliding Window** (`"sliding"`, default):
96+
97+
The window keeps the most recent N records. When a new record arrives and the window is full, the oldest record is automatically dropped. Each output line reflects the statistics of the latest N requests.
98+
99+
```bash
100+
--benchmark-metrics-config '{"enable": true, "window_size": 64, "window_mode": "sliding"}'
101+
```
102+
103+
**Tumbling Window** (`"tumbling"`):
104+
105+
The window accumulates records up to N, then clears and starts fresh. Each output line still reflects the current window's accumulated statistics, but the window resets at every boundary. This is useful for RL training scenarios where each step has a fixed batch size and you want per-step independent analysis.
106+
107+
```bash
108+
--benchmark-metrics-config '{"enable": true, "window_size": 64, "window_mode": "tumbling"}'
109+
```
110+
111+
**No Window** (`window_size: 0`):
112+
113+
All completed requests are accumulated. Statistics reflect the entire lifetime of the service.
114+
115+
```bash
116+
--benchmark-metrics-config '{"enable": true, "window_size": 0}'
117+
```
118+
119+
### Output
120+
121+
Results are written to `{FD_LOG_DIR}/benchmark_metrics.jsonl` (default: `./log/benchmark_metrics.jsonl`). Each line is a JSON object representing the window statistics at the time of a request completion.
122+
123+
Example output line:
124+
125+
```json
126+
{
127+
"timestamp": "2026-05-14T10:30:05.123",
128+
"window_size": 64,
129+
"window_mode": "sliding",
130+
"completed": 64,
131+
"total_input_tokens": 8192,
132+
"total_output_tokens": 16384,
133+
"request_throughput": 5.2,
134+
"output_throughput": 1250.0,
135+
"total_throughput": 2500.0,
136+
"ttft_ms": {"mean": 45.0, "median": 42.1, "p50": 42.1, "p90": 68.5, "p95": 82.3, "p99": 120.5},
137+
"s_decode": {"mean": 67.3, "median": 67.5, "p50": 67.5, "p90": 70.1, "p95": 71.2, "p99": 73.0}
138+
}
139+
```

docs/zh/benchmark.md

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,3 +40,106 @@ python benchmark_serving.py \
4040
--max-concurrency 1 \
4141
--save-result
4242
```
43+
44+
## 进程内性能监控(Benchmark Metrics Logger)
45+
46+
FastDeploy 提供了内置的进程内性能监控模块,在推理进程内部运行,复用已有的请求时间戳数据,每个请求完成时计算滚动统计并写入 JSONL 文件,可用于实时监控和事后分析。
47+
48+
### 启用方式
49+
50+
在服务启动命令中添加 `--benchmark-metrics-config` 参数,传入 JSON 配置字符串:
51+
52+
```bash
53+
python -m fastdeploy.entrypoints.openai.api_server \
54+
--model baidu/ERNIE-4.5-0.3B-Base-Paddle \
55+
--benchmark-metrics-config '{"enable": true}'
56+
```
57+
58+
### 配置参数
59+
60+
| 参数 | 类型 | 默认值 | 说明 |
61+
| :--- | :--- | :----- | :--- |
62+
| `enable` | bool | `false` | 是否启用性能监控。必须设置为 `true` 才会激活。 |
63+
| `window_size` | int | `0` | 统计窗口大小。`0` = 累计模式(统计所有请求);`>0` = 统计最近 N 个请求。 |
64+
| `window_mode` | str | `"sliding"` | 窗口聚合模式。`"sliding"` = 滑动窗口(保持最近 N 条,旧记录自动淘汰);`"tumbling"` = 翻滚窗口(满 N 条后清空重新累积)。 |
65+
| `percentiles` | str | `"50,90,95,99"` | 要计算的分位值,逗号分隔。 |
66+
| `metrics` | str | `"all"` | 要统计的指标子集,逗号分隔,或 `"all"` 表示全部指标。 |
67+
68+
### 可用指标
69+
70+
指标与 `benchmark_serving.py --percentile-metrics` 对齐:
71+
72+
| 指标名称 | 说明 | 单位 |
73+
| :------- | :--- | :--- |
74+
| `ttft` | 首 Token 时延(客户端到达 → 首 Token) | ms |
75+
| `s_ttft` | 服务端首 Token 时延(推理开始 → 首 Token) | ms |
76+
| `tpot` | 每 Token 输出时延(不含首 Token) | ms |
77+
| `itl` | Token 间时延 | ms |
78+
| `e2el` | 端到端时延(客户端到达 → 最后一个 Token) | ms |
79+
| `s_e2el` | 服务端端到端时延(推理开始 → 最后一个 Token) | ms |
80+
| `s_decode` | 解码速度(不含首 Token) | tok/s |
81+
| `input_len` | 前缀缓存命中 Token 数("Cached Tokens") | tokens |
82+
| `s_input_len` | 推理输入长度(总 prompt token 数) | tokens |
83+
| `output_len` | 输出 Token 长度 | tokens |
84+
85+
此外,以下吞吐量指标在有 2 个以上请求完成时自动计算(不受 `metrics` 参数控制):
86+
87+
| 指标 | 说明 | 单位 |
88+
| :--- | :--- | :--- |
89+
| `request_throughput` | 请求吞吐量 | req/s |
90+
| `output_throughput` | 输出 Token 吞吐量 | tok/s |
91+
| `total_throughput` | 总 Token 吞吐量(输入 + 输出) | tok/s |
92+
93+
### 窗口模式
94+
95+
**滑动窗口**`"sliding"`,默认):
96+
97+
窗口始终保持最近 N 条记录。当新记录到达且窗口已满时,最旧的记录自动淘汰。每行输出反映最近 N 个请求的统计值。
98+
99+
```bash
100+
--benchmark-metrics-config '{"enable": true, "window_size": 64, "window_mode": "sliding"}'
101+
```
102+
103+
**翻滚窗口**`"tumbling"`):
104+
105+
窗口累积到 N 条后清空重新开始。每行输出反映当前窗口已累积请求的统计值,窗口在边界处重置。适用于 RL 训练场景,每个 step 有固定 batch size,需要逐 step 独立分析。
106+
107+
```bash
108+
--benchmark-metrics-config '{"enable": true, "window_size": 64, "window_mode": "tumbling"}'
109+
```
110+
111+
**无窗口**`window_size: 0`):
112+
113+
所有已完成请求持续累积,统计值反映服务启动以来的全量数据。
114+
115+
```bash
116+
--benchmark-metrics-config '{"enable": true, "window_size": 0}'
117+
```
118+
119+
### 输出说明
120+
121+
结果写入 `{FD_LOG_DIR}/benchmark_metrics.jsonl`(默认路径:`./log/benchmark_metrics.jsonl`)。每行为一个 JSON 对象,表示某个请求完成时刻窗口内的统计快照。
122+
123+
输出示例:
124+
125+
```json
126+
{
127+
"timestamp": "2026-05-14T10:30:05.123",
128+
"window_size": 64,
129+
"window_mode": "sliding",
130+
"completed": 64,
131+
"total_input_tokens": 8192,
132+
"total_output_tokens": 16384,
133+
"request_throughput": 5.2,
134+
"output_throughput": 1250.0,
135+
"total_throughput": 2500.0,
136+
"ttft_ms": {"mean": 45.0, "median": 42.1, "p50": 42.1, "p90": 68.5, "p95": 82.3, "p99": 120.5},
137+
"s_decode": {"mean": 67.3, "median": 67.5, "p50": 67.5, "p90": 70.1, "p95": 71.2, "p99": 73.0}
138+
}
139+
```
140+
141+
读取最后一行即可获取当前最新的性能快照:
142+
143+
```bash
144+
tail -1 log/benchmark_metrics.jsonl | python -m json.tool
145+
```

fastdeploy/config.py

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1900,7 +1900,11 @@ class BenchmarkMetricsConfig:
19001900
"""Configuration for in-process benchmark metrics logger.
19011901
19021902
Args (passed as JSON dict via --benchmark-metrics-config):
1903+
enable: Whether to enable the benchmark metrics logger. Default: False.
19031904
window_size: Number of recent requests to aggregate. 0 = all requests (cumulative).
1905+
window_mode: Window aggregation mode. Default: "sliding".
1906+
"sliding" = sliding window (keep last N records),
1907+
"tumbling" = tumbling window (clear and restart after every N records).
19041908
percentiles: Comma-separated percentile values to compute, e.g. "50,90,95,99".
19051909
metrics: Comma-separated metric names to report, or "all".
19061910
Available metrics (aligned with benchmark_serving.py --percentile-metrics):
@@ -1917,7 +1921,9 @@ class BenchmarkMetricsConfig:
19171921
"""
19181922

19191923
_DEFAULTS = {
1924+
"enable": False,
19201925
"window_size": 0,
1926+
"window_mode": "sliding",
19211927
"percentiles": "50,90,95,99",
19221928
"metrics": "all",
19231929
}
@@ -2450,6 +2456,33 @@ def check(self):
24502456
" CUDA 12.x → pip install cuda-python==12.*\n"
24512457
)
24522458

2459+
if self.benchmark_metrics_config is not None:
2460+
cfg = self.benchmark_metrics_config
2461+
assert isinstance(
2462+
cfg.enable, bool
2463+
), f"BenchmarkMetricsConfig: 'enable' must be a bool, got {type(cfg.enable).__name__}"
2464+
assert (
2465+
isinstance(cfg.window_size, int) and cfg.window_size >= 0
2466+
), f"BenchmarkMetricsConfig: 'window_size' must be a non-negative integer, got {cfg.window_size!r}"
2467+
assert cfg.window_mode in (
2468+
"sliding",
2469+
"tumbling",
2470+
), f"BenchmarkMetricsConfig: 'window_mode' must be 'sliding' or 'tumbling', got {cfg.window_mode!r}"
2471+
assert (
2472+
isinstance(cfg.percentiles, str) and cfg.percentiles.strip()
2473+
), f"BenchmarkMetricsConfig: 'percentiles' must be a non-empty string, got {cfg.percentiles!r}"
2474+
for p in cfg.percentile_values:
2475+
assert 0 <= p <= 100, f"BenchmarkMetricsConfig: percentile value {p} out of range [0, 100]"
2476+
assert (
2477+
isinstance(cfg.metrics, str) and cfg.metrics.strip()
2478+
), f"BenchmarkMetricsConfig: 'metrics' must be a non-empty string, got {cfg.metrics!r}"
2479+
if cfg.metrics != "all":
2480+
invalid = cfg.selected_metrics - set(BenchmarkMetricsConfig._ALL_METRICS)
2481+
assert not invalid, (
2482+
f"BenchmarkMetricsConfig: unknown metric(s): {invalid}. "
2483+
f"Valid metrics: {BenchmarkMetricsConfig._ALL_METRICS}"
2484+
)
2485+
24532486
def print(self):
24542487
"""
24552488
print all config

fastdeploy/engine/common_engine.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -200,7 +200,7 @@ def __init__(self, cfg: FDConfig, start_queue=True, use_async_llm=False):
200200
self.resource_manager.scheduler_metrics_logger = self.scheduler_metrics_logger
201201
self.token_processor.set_scheduler_metrics_logger(self.scheduler_metrics_logger)
202202

203-
if self.cfg.benchmark_metrics_config is not None:
203+
if self.cfg.benchmark_metrics_config is not None and self.cfg.benchmark_metrics_config.enable:
204204
from fastdeploy.metrics.benchmark_metrics_logger import (
205205
BenchmarkMetricsLogger,
206206
)

fastdeploy/metrics/benchmark_metrics_logger.py

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -52,10 +52,13 @@ class BenchmarkMetricsLogger:
5252

5353
def __init__(self, config: BenchmarkMetricsConfig, log_dir: str, dp_rank: int = 0):
5454
self.config = config
55-
self.enabled = True
55+
self.enabled = config.enable
5656
self.dp_rank = dp_rank
5757

58-
self._window: deque = deque()
58+
if config.window_mode == "sliding" and config.window_size > 0:
59+
self._window: deque = deque(maxlen=config.window_size)
60+
else:
61+
self._window: deque = deque()
5962

6063
self._pending: deque = deque()
6164
self._condition = threading.Condition()
@@ -96,7 +99,12 @@ def _process_pending(self) -> None:
9699
stats = self._compute_rolling_stats()
97100
line = json.dumps(stats, ensure_ascii=False)
98101
self._file.write(line + "\n")
99-
if self.config.window_size > 0 and len(self._window) >= self.config.window_size:
102+
# Tumbling window: clear after reaching window_size
103+
if (
104+
self.config.window_mode == "tumbling"
105+
and self.config.window_size > 0
106+
and len(self._window) >= self.config.window_size
107+
):
100108
self._window.clear()
101109
self._file.flush()
102110

@@ -155,6 +163,7 @@ def _compute_rolling_stats(self) -> dict:
155163
result: dict[str, Any] = {
156164
"timestamp": datetime.now().isoformat(),
157165
"window_size": self.config.window_size,
166+
"window_mode": self.config.window_mode,
158167
"completed": n,
159168
"total_input_tokens": total_input,
160169
"total_output_tokens": total_output,

0 commit comments

Comments
 (0)