Add benchmark-compare skill from #7803#7847
Merged
Merged
Conversation
|
Thanks for your contribution! |
|
chang-wenbin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
Jiang-Jia-Jun
approved these changes
May 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
在日常性能评估工作中,需要频繁对比 FastDeploy 与 SGLang 两个推理框架的性能表现。手动操作涉及环境安装、服务启动、健康检查、benchmark 执行、指标提取和报告生成等多个步骤,流程繁琐且易出错。本 PR 新增一个 Agent Skill(.claude/skills/benchmark-compare/),实现全流程自动化编排,支持通过自然语言或 /benchmark 命令一键完成性能对比测试并生成可视化 HTML 报告。
Modifications
新增 .claude/skills/benchmark-compare/ 目录,包含以下文件:
SKILL.md — 主技能定义,包含完整 12 步工作流编排、参数表、决策树和两种工作模式(全自动测试 / 仅生成报告)
README.md — 使用说明文档
scripts/launch_service.sh — 通用服务启动脚本,支持 FD/SG 两个框架和 single/TP/PD 多种部署模式
scripts/health_check.sh — 服务健康检查脚本,轮询 /v1/models 接口
scripts/run_benchmark.sh — Benchmark 执行封装脚本
scripts/extract_metrics.py — 从 benchmark 结果文件中提取核心指标(吞吐、延迟、TTFT 等)输出为 JSON
scripts/generate_report.py — 生成多模式可视化 HTML 对比报告
references/html_template.md — HTML 报告模板(含 CSS/JS 和占位符)
references/model_profiles.md — 模型推荐部署参数表
支持特性:
单卡 / 多卡 TP / PD 分离等多种部署模式
BF16 / FP8 等量化方式
自动 GPU 空闲检测和分配
自动匹配 hyperparameter YAML 配置
Usage or Command
作为 Agent Skill 使用(在 Claude Code / Ducc 中):
方式 1: slash command
/benchmark
方式 2: 自然语言
帮我跑 benchmark,模型用 /path/to/GLM-4.7-Flash,TP=2,并发 64,开启 fp8 量化
方式 3: 仅从已有数据生成报告
帮我根据这些日志生成 HTML 对比报告