[Optimization][Speculative Decoding]opt mtp logprob#7883
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览当前 未发现 Required 任务失败、运行中或等待中;Required 通过数为
2 任务状态汇总2.1 Required任务 : 7/10 通过
2.2 可选任务 — 30/36 通过
3 失败详情(仅 required)无 Required 失败任务,本轮未调用深度失败分析。 Optional 失败摘要(不阻塞,仅供参考)
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #7883 +/- ##
==========================================
Coverage ? 63.72%
==========================================
Files ? 467
Lines ? 65039
Branches ? 8890
==========================================
Hits ? 41444
Misses ? 20790
Partials ? 2805
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…into opt_mtp_logprob
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-26 16:01:44
📋 Review 摘要
PR 概述:MTP 投机解码 + logprob 场景性能优化,通过动态计算 max_logprobs(移除硬编码上限 20)并将 message_flag/max_num_logprobs 打包进 meta[1],减少无效 topk 槽位的写入与读取开销。
变更范围:custom_ops/gpu_ops/speculate_decoding/、fastdeploy/output/token_processor.py、fastdeploy/worker/gpu_model_runner.py
影响面 Tag:[Speculative Decoding] [OP] [DataProcessor]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | custom_ops/gpu_ops/speculate_decoding/speculate_save_output_with_topk.cc:160 |
移除 else 20 上限后未添加 max_num_logprobs <= SPEC_LOGPROB_K+1 保护,用户请求 top_logprobs > SPEC_LOGPROB_K 时循环会写越界 |
| 🟡 建议 | fastdeploy/output/token_processor.py:866 |
actual_topk=0(top_logprobs:0)时 tokens_lists[i][j] 为空列表,row[0] 与 scores_lists[i][j][0] 均会抛出 IndexError |
| ❓ 疑问 | custom_ops/gpu_ops/speculate_decoding/draft_model/mtp_save_first_token_with_topk.cc:122 |
PR Modifications 描述为"高16位",代码实际移位 8 位(高 24 位存储 max_num_logprobs),与注释"高24位"不一致 |
📝 PR 规范检查
PR 描述结构完整(Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist),标题包含合法 Tag [Optimization][Speculative Decoding],规范合格。✓
总体评价
位打包方案和批量 .tolist() 优化思路清晰,核心链路改动正确。但移除 else 20 上限后缺少对 struct 容量上界的显式保护,以及 actual_topk=0 时 Python 侧的边界处理,建议修复后合入。
|
/skip-ci pre_ce_test |
…) (#7884) * opt mtp logprob * fix * fix test and log * fix bits * Adapt logprobs baseline update in test_ernie_21b_mtp_multistep.py --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
Motivation
MTP 投机解码 + logprob(top_logprobs:0) 性能提升约 10%。
max_logprobs硬编码为 20 的上限,改为按实际请求动态计算,top_logprobs:0下节省一次topk计算;message_flag(低8位)和max_num_logprobs(高24位)打包进消息头meta[1],接收端只拷贝实际有效的 logprob 槽位,避免top_logprobs=0时写入和读取全量SPEC_LOGPROB_K+1个槽位的无效开销。Modifications
custom_ops/gpu_ops/speculate_decoding/speculate_save_output_with_topk.cc/draft_model/mtp_save_first_token_with_topk.cc:将message_flag(低8位)与max_num_logprobs(高16位)打包进meta[1];内层循环上界由SPEC_LOGPROB_K+1改为max_num_logprobs,只写入实际所需列custom_ops/gpu_ops/speculate_decoding/speculate_get_output_with_topk.cc:接收端解包actual_topk = (meta[1] >> 8) & 0xFFFF,copy 循环上界改为actual_topkfastdeploy/worker/gpu_model_runner.py:投机解码下移除max_logprobs硬编码为 20 的上限,改为按实际请求动态计算fastdeploy/output/token_processor.py:解包mtype/actual_topk并切片tokens/scores[:, :, :actual_topk];热路径中改用批量.tolist()减少 Python 侧逐元素开销tests/output/:更新测试以反映meta[1]打包格式Usage or Command
N/A
Accuracy Tests
tests/e2e/test_ernie_21b_mtp_multistep.py logprob结果与baseline存在差异,原因如下:
k 不同时,相等值之间的相对顺序不确定,本质上是 GPU 并行归约的非确定性。
当top3 和 top4 的value相等时,paddle.topk(logprobs, 3, axis=-1)[1][:3]与paddle.topk(logprobs, 20, axis=-1)[1][:3]的取值会有差异,但不影响结果的正确性,需更新baseline。
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.