Skip to content

[benchmark] Enhance benchmark metrics with ITL aggregation details#8063

Merged
EmmonsCurse merged 1 commit into
developfrom
ZhangYulongg-patch-1
Jun 17, 2026
Merged

[benchmark] Enhance benchmark metrics with ITL aggregation details#8063
EmmonsCurse merged 1 commit into
developfrom
ZhangYulongg-patch-1

Conversation

@ZhangYulongg

Copy link
Copy Markdown
Collaborator

Added metrics for cleaned ITL aggregation including decode speed and counts for different ITL categories.

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Added metrics for cleaned ITL aggregation including decode speed and counts for different ITL categories.
@EmmonsCurse EmmonsCurse merged commit 5372fe5 into develop Jun 17, 2026
41 of 43 checks passed
@EmmonsCurse EmmonsCurse deleted the ZhangYulongg-patch-1 branch June 17, 2026 08:32

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-17 16:38:18

📋 Review 摘要

PR 概述:为 benchmark_serving.py 增加清洗后的 ITL 全局聚合解码速度及样本计数输出。
变更范围benchmarks/benchmark_serving.py
影响面 Tag[Benchmark]

问题

级别 文件 概述
🔴 Bug benchmarks/benchmark_serving.py:304 clean ITL 聚合与主统计使用不同有效样本口径,可能把 output_tokens == 0 的响应计入新增指标
🔴 安全 benchmarks/benchmark_serving.py:471 warmup 成功路径无条件打印完整响应对象,可能泄露生成内容和 token ids

📝 PR 规范检查

标题 Tag 大小写不符合官方 Tag,PR 描述保留了模板占位内容且关键章节未填写。

标题建议(可直接复制):

  • [Benchmark] Add cleaned ITL aggregation metrics
PR 描述建议(点击展开,可直接复制)
## Motivation
Add benchmark metrics that report cleaned global ITL aggregation, so decode speed can be inspected after filtering burst and preemption intervals.

## Modifications
- Updated `benchmarks/benchmark_serving.py` to collect global ITL samples, count burst/preemption/clean intervals, and report clean decode speed.
- Added the clean ITL aggregate fields to benchmark JSON results and console output.

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

新增指标方向合理,但当前实现存在统计口径不一致和日志暴露风险,建议修复后再合入。

PREEMPT_THRESHOLD_S = 0.5 # 500 ms
all_itls_flat: list[float] = []
for o in outputs:
if o.success:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 新增 clean ITL 聚合只按 o.success 收集样本,会把主统计已经跳过的 output_tokens == 0 响应重新计入。

calculate_metrics() 前面的主循环在 output_len = outputs[i].output_tokens 后会对空输出直接 continue,因此这些请求不会进入 completedtotal_output、普通 itl/s_decode 等指标。这里重新遍历 outputs 并只判断 o.success,会让 s_decode_cleann_itls_* 与同一份结果中的主指标使用不同样本集;对某些流式后端,成功返回文本但没有 usage/completion_tokens 时,已有代码正是走这个跳过路径。

建议修复方式:复用主循环已经确认有效的 itls 样本,例如直接基于 itls 构造 _arr,或把 clean 聚合移动到 if not output_len: continue 之后,保证与 completed/total_output/普通 ITL 指标口径一致。

out_list, metrics = test_output
test_output = out_list[0]

print("test_output:", test_output, flush=True)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 安全 这里把 warmup 成功响应也无条件打印到日志,会泄露完整 RequestFuncOutput

RequestFuncOutput 的 repr 包含 generated_textreasoning_contentoutput_idstool_calls 等字段。此前只在 warmup 失败时打印,主要用于排错;现在成功请求也会进入 stdout,在 benchmark 日志被采集或共享时会暴露生成内容和 token ids。

建议修复方式:恢复为仅失败时打印完整 test_output;如成功路径确实需要调试信息,请放到 if debug: 或只打印 request id、latency、output_tokens 等非内容字段。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants