[benchmark] Enhance benchmark metrics with ITL aggregation details#8063
Conversation
Added metrics for cleaned ITL aggregation including decode speed and counts for different ITL categories.
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-17 16:38:18
📋 Review 摘要
PR 概述:为 benchmark_serving.py 增加清洗后的 ITL 全局聚合解码速度及样本计数输出。
变更范围:benchmarks/benchmark_serving.py
影响面 Tag:[Benchmark]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | benchmarks/benchmark_serving.py:304 |
clean ITL 聚合与主统计使用不同有效样本口径,可能把 output_tokens == 0 的响应计入新增指标 |
| 🔴 安全 | benchmarks/benchmark_serving.py:471 |
warmup 成功路径无条件打印完整响应对象,可能泄露生成内容和 token ids |
📝 PR 规范检查
标题 Tag 大小写不符合官方 Tag,PR 描述保留了模板占位内容且关键章节未填写。
标题建议(可直接复制):
[Benchmark] Add cleaned ITL aggregation metrics
PR 描述建议(点击展开,可直接复制)
## Motivation
Add benchmark metrics that report cleaned global ITL aggregation, so decode speed can be inspected after filtering burst and preemption intervals.
## Modifications
- Updated `benchmarks/benchmark_serving.py` to collect global ITL samples, count burst/preemption/clean intervals, and report clean decode speed.
- Added the clean ITL aggregate fields to benchmark JSON results and console output.
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
新增指标方向合理,但当前实现存在统计口径不一致和日志暴露风险,建议修复后再合入。
| PREEMPT_THRESHOLD_S = 0.5 # 500 ms | ||
| all_itls_flat: list[float] = [] | ||
| for o in outputs: | ||
| if o.success: |
There was a problem hiding this comment.
🔴 Bug 新增 clean ITL 聚合只按 o.success 收集样本,会把主统计已经跳过的 output_tokens == 0 响应重新计入。
calculate_metrics() 前面的主循环在 output_len = outputs[i].output_tokens 后会对空输出直接 continue,因此这些请求不会进入 completed、total_output、普通 itl/s_decode 等指标。这里重新遍历 outputs 并只判断 o.success,会让 s_decode_clean、n_itls_* 与同一份结果中的主指标使用不同样本集;对某些流式后端,成功返回文本但没有 usage/completion_tokens 时,已有代码正是走这个跳过路径。
建议修复方式:复用主循环已经确认有效的 itls 样本,例如直接基于 itls 构造 _arr,或把 clean 聚合移动到 if not output_len: continue 之后,保证与 completed/total_output/普通 ITL 指标口径一致。
| out_list, metrics = test_output | ||
| test_output = out_list[0] | ||
|
|
||
| print("test_output:", test_output, flush=True) |
There was a problem hiding this comment.
🔴 安全 这里把 warmup 成功响应也无条件打印到日志,会泄露完整 RequestFuncOutput。
RequestFuncOutput 的 repr 包含 generated_text、reasoning_content、output_ids、tool_calls 等字段。此前只在 warmup 失败时打印,主要用于排错;现在成功请求也会进入 stdout,在 benchmark 日志被采集或共享时会暴露生成内容和 token ids。
建议修复方式:恢复为仅失败时打印完整 test_output;如成功路径确实需要调试信息,请放到 if debug: 或只打印 request id、latency、output_tokens 等非内容字段。
Added metrics for cleaned ITL aggregation including decode speed and counts for different ITL categories.
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.