[Optimization][Speculative Decoding]opt mtp logprob by Sunny-bot1 · Pull Request #7883 · PaddlePaddle/FastDeploy

Sunny-bot1 · 2026-05-21T12:00:57Z

Motivation

MTP 投机解码 + logprob(top_logprobs:0) 性能提升约 10%。

通过移除 max_logprobs 硬编码为 20 的上限，改为按实际请求动态计算，top_logprobs:0下节省一次topk计算；
通过将 message_flag（低8位）和 max_num_logprobs（高24位）打包进消息头 meta[1]，接收端只拷贝实际有效的 logprob 槽位，避免 top_logprobs=0 时写入和读取全量 SPEC_LOGPROB_K+1 个槽位的无效开销。

Modifications

custom_ops/gpu_ops/speculate_decoding/speculate_save_output_with_topk.cc / draft_model/mtp_save_first_token_with_topk.cc：将 message_flag（低8位）与 max_num_logprobs（高16位）打包进 meta[1]；内层循环上界由 SPEC_LOGPROB_K+1 改为 max_num_logprobs，只写入实际所需列
custom_ops/gpu_ops/speculate_decoding/speculate_get_output_with_topk.cc：接收端解包 actual_topk = (meta[1] >> 8) & 0xFFFF，copy 循环上界改为 actual_topk
fastdeploy/worker/gpu_model_runner.py：投机解码下移除 max_logprobs 硬编码为 20 的上限，改为按实际请求动态计算
fastdeploy/output/token_processor.py：解包 mtype/actual_topk 并切片 tokens/scores[:, :, :actual_topk]；热路径中改用批量 .tolist() 减少 Python 侧逐元素开销
tests/output/：更新测试以反映 meta[1] 打包格式

Usage or Command

N/A

Accuracy Tests

tests/e2e/test_ernie_21b_mtp_multistep.py logprob结果与baseline存在差异，原因如下：

k 不同时，相等值之间的相对顺序不确定，本质上是 GPU 并行归约的非确定性。

当top3 和 top4 的value相等时，paddle.topk(logprobs, 3, axis=-1)[1][:3]与paddle.topk(logprobs, 20, axis=-1)[1][:3]的取值会有差异，但不影响结果的正确性，需更新baseline。

>>> logprobs
Tensor(shape=[100], dtype=bfloat16, place=Place(gpu:0), stop_gradient=True,
       [0.99218750, 0.51562500, 0.52734375, 0.20312500, 0.81250000, 0.63671875,
        0.18164062, 0.22265625, 0.73828125, 0.91015625, 0.47851562, 0.62890625,
        0.03686523, 0.42382812, 0.68359375, 0.27929688, 0.70703125, 0.98437500,
        0.81640625, 0.19140625, 0.44726562, 0.36914062, 0.44335938, 0.98437500,
        0.56250000, 0.13476562, 0.97656250, 0.29687500, 0.89453125, 0.21777344,
        0.31445312, 0.10498047, 0.60156250, 0.23632812, 0.92968750, 0.88671875,
        0.61328125, 0.17480469, 0.80468750, 0.28906250, 0.87500000, 0.43359375,
        0.30273438, 0.50000000, 0.40039062, 0.55859375, 0.21972656, 0.41015625,
        0.41015625, 0.72656250, 0.97656250, 0.56640625, 0.25781250, 0.29296875,
        0.30273438, 0.25195312, 0.76171875, 0.03662109, 0.25195312, 0.55468750,
        0.86718750, 0.04736328, 0.35937500, 0.92187500, 0.34179688, 0.81250000,
        0.86718750, 0.58203125, 0.10742188, 0.90625000, 0.03784180, 0.41210938,
        0.57421875, 0.35937500, 0.28515625, 0.49023438, 0.08740234, 0.96484375,
        0.74218750, 0.29687500, 0.28515625, 0.41210938, 0.22460938, 0.76171875,
        0.77734375, 0.99218750, 0.47851562, 0.52734375, 0.75781250, 0.66406250,
        0.31054688, 0.13867188, 0.40820312, 0.53515625, 0.78125000, 0.86718750,
        0.25390625, 0.84375000, 0.35546875, 0.30859375])
>>> paddle.topk(logprobs, 3, axis=-1)[1]
Tensor(shape=[3], dtype=int64, place=Place(gpu:0), stop_gradient=True,
       [85, 0 , 17])
>>> paddle.topk(logprobs, 20, axis=-1)[1][:3]
Tensor(shape=[3], dtype=int64, place=Place(gpu:0), stop_gradient=True,
       [85, 0 , 23])
>>> paddle.topk(logprobs, 20, axis=-1)[1]
Tensor(shape=[20], dtype=int64, place=Place(gpu:0), stop_gradient=True,
       [85, 0 , 23, 17, 50, 26, 77, 34, 63, 9 , 69, 28, 35, 40, 60, 95, 66, 97, 18, 4 ])

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-21T12:01:05Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-21T12:17:05Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-26 17:44:27

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: cf9aaa8
Merge base: b336db7 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

当前 未发现 Required 任务失败、运行中或等待中；Required 通过数为 7/10，其余 Required 任务未出现在失败/运行/等待列表中（可能为 skipped/neutral 状态）。另外存在 2 个 Optional 失败和 1 个 Optional 等待中，仅供参考。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
78(32)	46	37	2	0	1	6

注意：action_required workflows 不计入上表的任务统计；当前 action_required_count=0。

2 任务状态汇总

2.1 Required任务 : 7/10 通过

必选任务阻塞合并，失败需优先处理。本轮未发现 Required 失败/运行/等待任务。

状态	任务	耗时	根因	修复建议	日志	重跑
✅	7 个必选任务通过	-	-	-	-	-
⏭️	其余 3 个必选任务非失败状态	-	skipped/neutral	无需深度分析	-	-

2.2 可选任务 — 30/36 通过

可选任务不阻塞合并，失败仅供参考；按流程不做深度日志分析。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	2m20s	Job	-
❌	`Trigger Jenkins for PR`	19m53s	Job	-
⏸️	`CI_HPU`	-	-	-
✅	其余 30 个可选任务通过	-	-	-

3 失败详情（仅 required）

无 Required 失败任务，本轮未调用深度失败分析。

Optional 失败摘要（不阻塞，仅供参考）

Run iluvatar Tests / run_iluvatar_cases：快速状态显示 self-hosted runner 的 custom container 执行失败，偏环境/Runner 问题，若需关注建议 rerun 或联系 Iluvatar CI 维护方。
Trigger Jenkins for PR：快速状态显示 Docker build failed，属于 Optional METAX Jenkins 触发链路失败，若需关注建议 rerun 或联系 METAX CI 维护方。

codecov-commenter · 2026-05-21T12:59:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@b336db7). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7883   +/-   ##
==========================================
  Coverage           ?   63.72%           
==========================================
  Files              ?      467           
  Lines              ?    65039           
  Branches           ?     8890           
==========================================
  Hits               ?    41444           
  Misses             ?    20790           
  Partials           ?     2805

Flag	Coverage Δ
GPU	`72.91% <100.00%> (?)`
XPU	`0.00% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…into opt_mtp_logprob

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-26 16:01:44

📋 Review 摘要

PR 概述：MTP 投机解码 + logprob 场景性能优化，通过动态计算 max_logprobs（移除硬编码上限 20）并将 message_flag/max_num_logprobs 打包进 meta[1]，减少无效 topk 槽位的写入与读取开销。

变更范围：custom_ops/gpu_ops/speculate_decoding/、fastdeploy/output/token_processor.py、fastdeploy/worker/gpu_model_runner.py

影响面 Tag：[Speculative Decoding] [OP] [DataProcessor]

问题

级别	文件	概述
🟡 建议	`custom_ops/gpu_ops/speculate_decoding/speculate_save_output_with_topk.cc:160`	移除 `else 20` 上限后未添加 `max_num_logprobs <= SPEC_LOGPROB_K+1` 保护，用户请求 `top_logprobs > SPEC_LOGPROB_K` 时循环会写越界
🟡 建议	`fastdeploy/output/token_processor.py:866`	`actual_topk=0`（top_logprobs:0）时 `tokens_lists[i][j]` 为空列表，`row[0]` 与 `scores_lists[i][j][0]` 均会抛出 IndexError
❓ 疑问	`custom_ops/gpu_ops/speculate_decoding/draft_model/mtp_save_first_token_with_topk.cc:122`	PR Modifications 描述为"高16位"，代码实际移位 8 位（高 24 位存储 `max_num_logprobs`），与注释"高24位"不一致

📝 PR 规范检查

PR 描述结构完整（Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist），标题包含合法 Tag [Optimization][Speculative Decoding]，规范合格。✓

总体评价

位打包方案和批量 .tolist() 优化思路清晰，核心链路改动正确。但移除 else 20 上限后缺少对 struct 容量上界的显式保护，以及 actual_topk=0 时 Python 侧的边界处理，建议修复后合入。

EmmonsCurse · 2026-05-26T09:06:03Z

/skip-ci pre_ce_test
/skip-ci build_xpu

…) (#7884) * opt mtp logprob * fix * fix test and log * fix bits * Adapt logprobs baseline update in test_ernie_21b_mtp_multistep.py --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>

opt mtp logprob

2aa8004

Sunny-bot1 had a problem deploying to Metax_ci May 21, 2026 12:01 — with GitHub Actions Failure

Sunny-bot1 changed the title ~~[Optimization][[Speculative Decoding]]opt mtp logprob~~ [Optimization][Speculative Decoding]opt mtp logprob May 21, 2026