Add window_mode

Deleter-D · Deleter-D · commit f8216775ee8f · 2026-05-18T16:09:16.000+08:00
diff --git a/docs/benchmark.md b/docs/benchmark.md
@@ -40,3 +40,100 @@ python benchmark_serving.py \
   --max-concurrency 1 \
   --save-result
 ```
+
+## In-Process Benchmark Metrics Logger
+
+FastDeploy provides a built-in performance monitoring module that runs inside the inference process. It collects per-request timing data and computes rolling statistics aligned with `benchmark_serving.py`, writing results to a JSONL file for real-time monitoring and post-hoc analysis.
+
+### Enable
+
+Add `--benchmark-metrics-config` with a JSON string to the service startup command:
+
+```bash
+python -m fastdeploy.entrypoints.openai.api_server \
+       --model baidu/ERNIE-4.5-0.3B-Base-Paddle \
+       --benchmark-metrics-config '{"enable": true}'
+```
+
+### Configuration Parameters
+
+| Parameter | Type | Default | Description |
+| :-------- | :--- | :------ | :---------- |
+| `enable` | bool | `false` | Whether to enable the benchmark metrics logger. Must be set to `true` to activate. |
+| `window_size` | int | `0` | Number of recent requests to aggregate. `0` = cumulative (all requests since start). |
+| `window_mode` | str | `"sliding"` | Window aggregation mode. `"sliding"` = sliding window (keeps last N records, oldest automatically dropped). `"tumbling"` = tumbling window (clears and restarts after every N records). |
+| `percentiles` | str | `"50,90,95,99"` | Comma-separated percentile values to compute. |
+| `metrics` | str | `"all"` | Comma-separated metric names to report, or `"all"` for all metrics. |
+
+### Available Metrics
+
+Metrics are aligned with `benchmark_serving.py --percentile-metrics`:
+
+| Metric Name | Description | Unit |
+| :---------- | :---------- | :--- |
+| `ttft` | Time to First Token (client arrival → first token) | ms |
+| `s_ttft` | Server TTFT (inference start → first token) | ms |
+| `tpot` | Time per Output Token (excluding first token) | ms |
+| `itl` | Inter-token Latency | ms |
+| `e2el` | End-to-end Latency (client arrival → last token) | ms |
+| `s_e2el` | Server E2EL (inference start → last token) | ms |
+| `s_decode` | Decode speed (excluding first token) | tok/s |
+| `input_len` | Prefix cache hit token count ("Cached Tokens") | tokens |
+| `s_input_len` | Infer input length (total prompt tokens) | tokens |
+| `output_len` | Output token length per request | tokens |
+
+In addition, the following throughput metrics are always computed (not user-selectable) when there are 2+ records:
+
+| Metric | Description | Unit |
+| :----- | :---------- | :--- |
+| `request_throughput` | Request throughput | req/s |
+| `output_throughput` | Output token throughput | tok/s |
+| `total_throughput` | Total token throughput (input + output) | tok/s |
+
+### Window Modes
+
+**Sliding Window** (`"sliding"`, default):
+
+The window keeps the most recent N records. When a new record arrives and the window is full, the oldest record is automatically dropped. Each output line reflects the statistics of the latest N requests.
+
+```bash
+--benchmark-metrics-config '{"enable": true, "window_size": 64, "window_mode": "sliding"}'
+```
+
+**Tumbling Window** (`"tumbling"`):
+
+The window accumulates records up to N, then clears and starts fresh. Each output line still reflects the current window's accumulated statistics, but the window resets at every boundary. This is useful for RL training scenarios where each step has a fixed batch size and you want per-step independent analysis.
+
+```bash
+--benchmark-metrics-config '{"enable": true, "window_size": 64, "window_mode": "tumbling"}'
+```
+
+**No Window** (`window_size: 0`):
+
+All completed requests are accumulated. Statistics reflect the entire lifetime of the service.
+
+```bash
+--benchmark-metrics-config '{"enable": true, "window_size": 0}'
+```
+
+### Output
+
+Results are written to `{FD_LOG_DIR}/benchmark_metrics.jsonl` (default: `./log/benchmark_metrics.jsonl`). Each line is a JSON object representing the window statistics at the time of a request completion.
+
+Example output line:
+
+```json
+{
+  "timestamp": "2026-05-14T10:30:05.123",
+  "window_size": 64,
+  "window_mode": "sliding",
+  "completed": 64,
+  "total_input_tokens": 8192,
+  "total_output_tokens": 16384,
+  "request_throughput": 5.2,
+  "output_throughput": 1250.0,
+  "total_throughput": 2500.0,
+  "ttft_ms": {"mean": 45.0, "median": 42.1, "p50": 42.1, "p90": 68.5, "p95": 82.3, "p99": 120.5},
+  "s_decode": {"mean": 67.3, "median": 67.5, "p50": 67.5, "p90": 70.1, "p95": 71.2, "p99": 73.0}
+}
+```
diff --git a/docs/zh/benchmark.md b/docs/zh/benchmark.md
@@ -40,3 +40,106 @@ python benchmark_serving.py \
   --max-concurrency 1 \
   --save-result
 ```
+
+## 进程内性能监控（Benchmark Metrics Logger）
+
+FastDeploy 提供了内置的进程内性能监控模块，在推理进程内部运行，复用已有的请求时间戳数据，每个请求完成时计算滚动统计并写入 JSONL 文件，可用于实时监控和事后分析。
+
+### 启用方式
+
+在服务启动命令中添加 `--benchmark-metrics-config` 参数，传入 JSON 配置字符串：
+
+```bash
+python -m fastdeploy.entrypoints.openai.api_server \
+       --model baidu/ERNIE-4.5-0.3B-Base-Paddle \
+       --benchmark-metrics-config '{"enable": true}'
+```
+
+### 配置参数
+
+| 参数 | 类型 | 默认值 | 说明 |
+| :--- | :--- | :----- | :--- |
+| `enable` | bool | `false` | 是否启用性能监控。必须设置为 `true` 才会激活。 |
+| `window_size` | int | `0` | 统计窗口大小。`0` = 累计模式（统计所有请求）；`>0` = 统计最近 N 个请求。 |
+| `window_mode` | str | `"sliding"` | 窗口聚合模式。`"sliding"` = 滑动窗口（保持最近 N 条，旧记录自动淘汰）；`"tumbling"` = 翻滚窗口（满 N 条后清空重新累积）。 |
+| `percentiles` | str | `"50,90,95,99"` | 要计算的分位值，逗号分隔。 |
+| `metrics` | str | `"all"` | 要统计的指标子集，逗号分隔，或 `"all"` 表示全部指标。 |
+
+### 可用指标
+
+指标与 `benchmark_serving.py --percentile-metrics` 对齐：
+
+| 指标名称 | 说明 | 单位 |
+| :------- | :--- | :--- |
+| `ttft` | 首 Token 时延（客户端到达 → 首 Token） | ms |
+| `s_ttft` | 服务端首 Token 时延（推理开始 → 首 Token） | ms |
+| `tpot` | 每 Token 输出时延（不含首 Token） | ms |
+| `itl` | Token 间时延 | ms |
+| `e2el` | 端到端时延（客户端到达 → 最后一个 Token） | ms |
+| `s_e2el` | 服务端端到端时延（推理开始 → 最后一个 Token） | ms |
+| `s_decode` | 解码速度（不含首 Token） | tok/s |
+| `input_len` | 前缀缓存命中 Token 数（"Cached Tokens"） | tokens |
+| `s_input_len` | 推理输入长度（总 prompt token 数） | tokens |
+| `output_len` | 输出 Token 长度 | tokens |
+
+此外，以下吞吐量指标在有 2 个以上请求完成时自动计算（不受 `metrics` 参数控制）：
+
+| 指标 | 说明 | 单位 |
+| :--- | :--- | :--- |
+| `request_throughput` | 请求吞吐量 | req/s |
+| `output_throughput` | 输出 Token 吞吐量 | tok/s |
+| `total_throughput` | 总 Token 吞吐量（输入 + 输出） | tok/s |
+
+### 窗口模式
+
+**滑动窗口**（`"sliding"`，默认）：
+
+窗口始终保持最近 N 条记录。当新记录到达且窗口已满时，最旧的记录自动淘汰。每行输出反映最近 N 个请求的统计值。
+
+```bash
+--benchmark-metrics-config '{"enable": true, "window_size": 64, "window_mode": "sliding"}'
+```
+
+**翻滚窗口**（`"tumbling"`）：
+
+窗口累积到 N 条后清空重新开始。每行输出反映当前窗口已累积请求的统计值，窗口在边界处重置。适用于 RL 训练场景，每个 step 有固定 batch size，需要逐 step 独立分析。
+
+```bash
+--benchmark-metrics-config '{"enable": true, "window_size": 64, "window_mode": "tumbling"}'
+```
+
+**无窗口**（`window_size: 0`）：
+
+所有已完成请求持续累积，统计值反映服务启动以来的全量数据。
+
+```bash
+--benchmark-metrics-config '{"enable": true, "window_size": 0}'
+```
+
+### 输出说明
+
+结果写入 `{FD_LOG_DIR}/benchmark_metrics.jsonl`（默认路径：`./log/benchmark_metrics.jsonl`）。每行为一个 JSON 对象，表示某个请求完成时刻窗口内的统计快照。
+
+输出示例：
+
+```json
+{
+  "timestamp": "2026-05-14T10:30:05.123",
+  "window_size": 64,
+  "window_mode": "sliding",
+  "completed": 64,
+  "total_input_tokens": 8192,
+  "total_output_tokens": 16384,
+  "request_throughput": 5.2,
+  "output_throughput": 1250.0,
+  "total_throughput": 2500.0,
+  "ttft_ms": {"mean": 45.0, "median": 42.1, "p50": 42.1, "p90": 68.5, "p95": 82.3, "p99": 120.5},
+  "s_decode": {"mean": 67.3, "median": 67.5, "p50": 67.5, "p90": 70.1, "p95": 71.2, "p99": 73.0}
+}
+```
+
+读取最后一行即可获取当前最新的性能快照：
+
+```bash
+tail -1 log/benchmark_metrics.jsonl | python -m json.tool
+```
diff --git a/fastdeploy/config.py b/fastdeploy/config.py
@@ -1900,7 +1900,11 @@ class BenchmarkMetricsConfig:
     """Configuration for in-process benchmark metrics logger.
 
     Args (passed as JSON dict via --benchmark-metrics-config):
+        enable: Whether to enable the benchmark metrics logger. Default: False.
         window_size: Number of recent requests to aggregate. 0 = all requests (cumulative).
+        window_mode: Window aggregation mode. Default: "sliding".
+            "sliding" = sliding window (keep last N records),
+            "tumbling" = tumbling window (clear and restart after every N records).
         percentiles: Comma-separated percentile values to compute, e.g. "50,90,95,99".
         metrics: Comma-separated metric names to report, or "all".
             Available metrics (aligned with benchmark_serving.py --percentile-metrics):
@@ -1917,7 +1921,9 @@ class BenchmarkMetricsConfig:
     """
 
     _DEFAULTS = {
+        "enable": False,
         "window_size": 0,
+        "window_mode": "sliding",
         "percentiles": "50,90,95,99",
         "metrics": "all",
     }
@@ -2450,6 +2456,33 @@ def check(self):
                     "  CUDA 12.x → pip install cuda-python==12.*\n"
                 )
 
+        if self.benchmark_metrics_config is not None:
+            cfg = self.benchmark_metrics_config
+            assert isinstance(
+                cfg.enable, bool
+            ), f"BenchmarkMetricsConfig: 'enable' must be a bool, got {type(cfg.enable).__name__}"
+            assert (
+                isinstance(cfg.window_size, int) and cfg.window_size >= 0
+            ), f"BenchmarkMetricsConfig: 'window_size' must be a non-negative integer, got {cfg.window_size!r}"
+            assert cfg.window_mode in (
+                "sliding",
+                "tumbling",
+            ), f"BenchmarkMetricsConfig: 'window_mode' must be 'sliding' or 'tumbling', got {cfg.window_mode!r}"
+            assert (
+                isinstance(cfg.percentiles, str) and cfg.percentiles.strip()
+            ), f"BenchmarkMetricsConfig: 'percentiles' must be a non-empty string, got {cfg.percentiles!r}"
+            for p in cfg.percentile_values:
+                assert 0 <= p <= 100, f"BenchmarkMetricsConfig: percentile value {p} out of range [0, 100]"
+            assert (
+                isinstance(cfg.metrics, str) and cfg.metrics.strip()
+            ), f"BenchmarkMetricsConfig: 'metrics' must be a non-empty string, got {cfg.metrics!r}"
+            if cfg.metrics != "all":
+                invalid = cfg.selected_metrics - set(BenchmarkMetricsConfig._ALL_METRICS)
+                assert not invalid, (
+                    f"BenchmarkMetricsConfig: unknown metric(s): {invalid}. "
+                    f"Valid metrics: {BenchmarkMetricsConfig._ALL_METRICS}"
+                )
+
     def print(self):
         """
         print all config
diff --git a/fastdeploy/engine/common_engine.py b/fastdeploy/engine/common_engine.py
@@ -200,7 +200,7 @@ def __init__(self, cfg: FDConfig, start_queue=True, use_async_llm=False):
         self.resource_manager.scheduler_metrics_logger = self.scheduler_metrics_logger
         self.token_processor.set_scheduler_metrics_logger(self.scheduler_metrics_logger)
 
-        if self.cfg.benchmark_metrics_config is not None:
+        if self.cfg.benchmark_metrics_config is not None and self.cfg.benchmark_metrics_config.enable:
             from fastdeploy.metrics.benchmark_metrics_logger import (
                 BenchmarkMetricsLogger,
             )
diff --git a/fastdeploy/metrics/benchmark_metrics_logger.py b/fastdeploy/metrics/benchmark_metrics_logger.py
@@ -52,10 +52,13 @@ class BenchmarkMetricsLogger:
 
     def __init__(self, config: BenchmarkMetricsConfig, log_dir: str, dp_rank: int = 0):
         self.config = config
-        self.enabled = True
+        self.enabled = config.enable
         self.dp_rank = dp_rank
 
-        self._window: deque = deque()
+        if config.window_mode == "sliding" and config.window_size > 0:
+            self._window: deque = deque(maxlen=config.window_size)
+        else:
+            self._window: deque = deque()
 
         self._pending: deque = deque()
         self._condition = threading.Condition()
@@ -96,7 +99,12 @@ def _process_pending(self) -> None:
             stats = self._compute_rolling_stats()
             line = json.dumps(stats, ensure_ascii=False)
             self._file.write(line + "\n")
-            if self.config.window_size > 0 and len(self._window) >= self.config.window_size:
+            # Tumbling window: clear after reaching window_size
+            if (
+                self.config.window_mode == "tumbling"
+                and self.config.window_size > 0
+                and len(self._window) >= self.config.window_size
+            ):
                 self._window.clear()
         self._file.flush()
 
@@ -155,6 +163,7 @@ def _compute_rolling_stats(self) -> dict:
         result: dict[str, Any] = {
             "timestamp": datetime.now().isoformat(),
             "window_size": self.config.window_size,
+            "window_mode": self.config.window_mode,
             "completed": n,
             "total_input_tokens": total_input,
             "total_output_tokens": total_output,
diff --git a/tests/metrics/test_benchmark_metrics_logger.py b/tests/metrics/test_benchmark_metrics_logger.py

Original file line number	Diff line number	Diff line change
`@@ -200,7 +200,7 @@ def __init__(self, cfg: FDConfig, start_queue=True, use_async_llm=False):`
`200`	`200`	`self.resource_manager.scheduler_metrics_logger = self.scheduler_metrics_logger`
`201`	`201`	`self.token_processor.set_scheduler_metrics_logger(self.scheduler_metrics_logger)`
`202`	`202`
`203`		`- if self.cfg.benchmark_metrics_config is not None:`
	`203`	`+ if self.cfg.benchmark_metrics_config is not None and self.cfg.benchmark_metrics_config.enable:`
`204`	`204`	`from fastdeploy.metrics.benchmark_metrics_logger import (`
`205`	`205`	`BenchmarkMetricsLogger,`
`206`	`206`	`)`