You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/benchmark.md
+97Lines changed: 97 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -40,3 +40,100 @@ python benchmark_serving.py \
40
40
--max-concurrency 1 \
41
41
--save-result
42
42
```
43
+
44
+
## In-Process Benchmark Metrics Logger
45
+
46
+
FastDeploy provides a built-in performance monitoring module that runs inside the inference process. It collects per-request timing data and computes rolling statistics aligned with `benchmark_serving.py`, writing results to a JSONL file for real-time monitoring and post-hoc analysis.
47
+
48
+
### Enable
49
+
50
+
Add `--benchmark-metrics-config` with a JSON string to the service startup command:
|`enable`| bool |`false`| Whether to enable the benchmark metrics logger. Must be set to `true` to activate. |
63
+
|`window_size`| int |`0`| Number of recent requests to aggregate. `0` = cumulative (all requests since start). |
64
+
|`window_mode`| str |`"sliding"`| Window aggregation mode. `"sliding"` = sliding window (keeps last N records, oldest automatically dropped). `"tumbling"` = tumbling window (clears and restarts after every N records). |
65
+
|`percentiles`| str |`"50,90,95,99"`| Comma-separated percentile values to compute. |
66
+
|`metrics`| str |`"all"`| Comma-separated metric names to report, or `"all"` for all metrics. |
67
+
68
+
### Available Metrics
69
+
70
+
Metrics are aligned with `benchmark_serving.py --percentile-metrics`:
71
+
72
+
| Metric Name | Description | Unit |
73
+
| :---------- | :---------- | :--- |
74
+
|`ttft`| Time to First Token (client arrival → first token) | ms |
75
+
|`s_ttft`| Server TTFT (inference start → first token) | ms |
76
+
|`tpot`| Time per Output Token (excluding first token) | ms |
77
+
|`itl`| Inter-token Latency | ms |
78
+
|`e2el`| End-to-end Latency (client arrival → last token) | ms |
79
+
|`s_e2el`| Server E2EL (inference start → last token) | ms |
80
+
|`s_decode`| Decode speed (excluding first token) | tok/s |
|`total_throughput`| Total token throughput (input + output) | tok/s |
92
+
93
+
### Window Modes
94
+
95
+
**Sliding Window** (`"sliding"`, default):
96
+
97
+
The window keeps the most recent N records. When a new record arrives and the window is full, the oldest record is automatically dropped. Each output line reflects the statistics of the latest N requests.
The window accumulates records up to N, then clears and starts fresh. Each output line still reflects the current window's accumulated statistics, but the window resets at every boundary. This is useful for RL training scenarios where each step has a fixed batch size and you want per-step independent analysis.
Results are written to `{FD_LOG_DIR}/benchmark_metrics.jsonl` (default: `./log/benchmark_metrics.jsonl`). Each line is a JSON object representing the window statistics at the time of a request completion.
0 commit comments