Skip to content

[Optimization][Speculative Decoding]opt mtp logprob#7883

Merged
Sunny-bot1 merged 7 commits into
PaddlePaddle:developfrom
Sunny-bot1:opt_mtp_logprob
May 26, 2026
Merged

[Optimization][Speculative Decoding]opt mtp logprob#7883
Sunny-bot1 merged 7 commits into
PaddlePaddle:developfrom
Sunny-bot1:opt_mtp_logprob

Conversation

@Sunny-bot1
Copy link
Copy Markdown
Collaborator

@Sunny-bot1 Sunny-bot1 commented May 21, 2026

Motivation

MTP 投机解码 + logprob(top_logprobs:0) 性能提升约 10%。

  1. 通过移除 max_logprobs 硬编码为 20 的上限,改为按实际请求动态计算,top_logprobs:0下节省一次topk计算;
  2. 通过将 message_flag(低8位)和 max_num_logprobs(高24位)打包进消息头 meta[1],接收端只拷贝实际有效的 logprob 槽位,避免 top_logprobs=0 时写入和读取全量 SPEC_LOGPROB_K+1 个槽位的无效开销。

Modifications

  • custom_ops/gpu_ops/speculate_decoding/speculate_save_output_with_topk.cc / draft_model/mtp_save_first_token_with_topk.cc:将 message_flag(低8位)与 max_num_logprobs(高16位)打包进 meta[1];内层循环上界由 SPEC_LOGPROB_K+1 改为 max_num_logprobs,只写入实际所需列
  • custom_ops/gpu_ops/speculate_decoding/speculate_get_output_with_topk.cc:接收端解包 actual_topk = (meta[1] >> 8) & 0xFFFF,copy 循环上界改为 actual_topk
  • fastdeploy/worker/gpu_model_runner.py:投机解码下移除 max_logprobs 硬编码为 20 的上限,改为按实际请求动态计算
  • fastdeploy/output/token_processor.py:解包 mtype/actual_topk 并切片 tokens/scores[:, :, :actual_topk];热路径中改用批量 .tolist() 减少 Python 侧逐元素开销
  • tests/output/:更新测试以反映 meta[1] 打包格式

Usage or Command

N/A

Accuracy Tests

tests/e2e/test_ernie_21b_mtp_multistep.py logprob结果与baseline存在差异,原因如下:

k 不同时,相等值之间的相对顺序不确定,本质上是 GPU 并行归约的非确定性。

当top3 和 top4 的value相等时,paddle.topk(logprobs, 3, axis=-1)[1][:3]与paddle.topk(logprobs, 20, axis=-1)[1][:3]的取值会有差异,但不影响结果的正确性,需更新baseline。

>>> logprobs
Tensor(shape=[100], dtype=bfloat16, place=Place(gpu:0), stop_gradient=True,
       [0.99218750, 0.51562500, 0.52734375, 0.20312500, 0.81250000, 0.63671875,
        0.18164062, 0.22265625, 0.73828125, 0.91015625, 0.47851562, 0.62890625,
        0.03686523, 0.42382812, 0.68359375, 0.27929688, 0.70703125, 0.98437500,
        0.81640625, 0.19140625, 0.44726562, 0.36914062, 0.44335938, 0.98437500,
        0.56250000, 0.13476562, 0.97656250, 0.29687500, 0.89453125, 0.21777344,
        0.31445312, 0.10498047, 0.60156250, 0.23632812, 0.92968750, 0.88671875,
        0.61328125, 0.17480469, 0.80468750, 0.28906250, 0.87500000, 0.43359375,
        0.30273438, 0.50000000, 0.40039062, 0.55859375, 0.21972656, 0.41015625,
        0.41015625, 0.72656250, 0.97656250, 0.56640625, 0.25781250, 0.29296875,
        0.30273438, 0.25195312, 0.76171875, 0.03662109, 0.25195312, 0.55468750,
        0.86718750, 0.04736328, 0.35937500, 0.92187500, 0.34179688, 0.81250000,
        0.86718750, 0.58203125, 0.10742188, 0.90625000, 0.03784180, 0.41210938,
        0.57421875, 0.35937500, 0.28515625, 0.49023438, 0.08740234, 0.96484375,
        0.74218750, 0.29687500, 0.28515625, 0.41210938, 0.22460938, 0.76171875,
        0.77734375, 0.99218750, 0.47851562, 0.52734375, 0.75781250, 0.66406250,
        0.31054688, 0.13867188, 0.40820312, 0.53515625, 0.78125000, 0.86718750,
        0.25390625, 0.84375000, 0.35546875, 0.30859375])
>>> paddle.topk(logprobs, 3, axis=-1)[1]
Tensor(shape=[3], dtype=int64, place=Place(gpu:0), stop_gradient=True,
       [85, 0 , 17])
>>> paddle.topk(logprobs, 20, axis=-1)[1][:3]
Tensor(shape=[3], dtype=int64, place=Place(gpu:0), stop_gradient=True,
       [85, 0 , 23])
>>> paddle.topk(logprobs, 20, axis=-1)[1]
Tensor(shape=[20], dtype=int64, place=Place(gpu:0), stop_gradient=True,
       [85, 0 , 23, 17, 50, 26, 77, 34, 63, 9 , 69, 28, 35, 40, 60, 95, 66, 97, 18, 4 ])

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 21, 2026

Thanks for your contribution!

@Sunny-bot1 Sunny-bot1 changed the title [Optimization][[Speculative Decoding]]opt mtp logprob [Optimization][Speculative Decoding]opt mtp logprob May 21, 2026
@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 21, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-26 17:44:27

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前 未发现 Required 任务失败、运行中或等待中;Required 通过数为 7/10,其余 Required 任务未出现在失败/运行/等待列表中(可能为 skipped/neutral 状态)。另外存在 2 个 Optional 失败和 1 个 Optional 等待中,仅供参考。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
78(32) 46 37 2 0 1 6

注意:action_required workflows 不计入上表的任务统计;当前 action_required_count=0

2 任务状态汇总

2.1 Required任务 : 7/10 通过

必选任务阻塞合并,失败需优先处理。本轮未发现 Required 失败/运行/等待任务。

状态 任务 耗时 根因 修复建议 日志 重跑
7 个必选任务通过 - - - - -
⏭️ 其余 3 个必选任务非失败状态 - skipped/neutral 无需深度分析 - -

2.2 可选任务 — 30/36 通过

可选任务不阻塞合并,失败仅供参考;按流程不做深度日志分析。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 2m20s Job -
Trigger Jenkins for PR 19m53s Job -
⏸️ CI_HPU - - -
其余 30 个可选任务通过 - - -

3 失败详情(仅 required)

无 Required 失败任务,本轮未调用深度失败分析。

Optional 失败摘要(不阻塞,仅供参考)
  • Run iluvatar Tests / run_iluvatar_cases:快速状态显示 self-hosted runner 的 custom container 执行失败,偏环境/Runner 问题,若需关注建议 rerun 或联系 Iluvatar CI 维护方。
  • Trigger Jenkins for PR:快速状态显示 Docker build failed,属于 Optional METAX Jenkins 触发链路失败,若需关注建议 rerun 或联系 METAX CI 维护方。

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@b336db7). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7883   +/-   ##
==========================================
  Coverage           ?   63.72%           
==========================================
  Files              ?      467           
  Lines              ?    65039           
  Branches           ?     8890           
==========================================
  Hits               ?    41444           
  Misses             ?    20790           
  Partials           ?     2805           
Flag Coverage Δ
GPU 72.91% <100.00%> (?)
XPU 0.00% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

qingqing01
qingqing01 previously approved these changes May 26, 2026
Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-26 16:01:44

📋 Review 摘要

PR 概述:MTP 投机解码 + logprob 场景性能优化,通过动态计算 max_logprobs(移除硬编码上限 20)并将 message_flag/max_num_logprobs 打包进 meta[1],减少无效 topk 槽位的写入与读取开销。

变更范围custom_ops/gpu_ops/speculate_decoding/fastdeploy/output/token_processor.pyfastdeploy/worker/gpu_model_runner.py

影响面 Tag[Speculative Decoding] [OP] [DataProcessor]

问题

级别 文件 概述
🟡 建议 custom_ops/gpu_ops/speculate_decoding/speculate_save_output_with_topk.cc:160 移除 else 20 上限后未添加 max_num_logprobs <= SPEC_LOGPROB_K+1 保护,用户请求 top_logprobs > SPEC_LOGPROB_K 时循环会写越界
🟡 建议 fastdeploy/output/token_processor.py:866 actual_topk=0(top_logprobs:0)时 tokens_lists[i][j] 为空列表,row[0]scores_lists[i][j][0] 均会抛出 IndexError
❓ 疑问 custom_ops/gpu_ops/speculate_decoding/draft_model/mtp_save_first_token_with_topk.cc:122 PR Modifications 描述为"高16位",代码实际移位 8 位(高 24 位存储 max_num_logprobs),与注释"高24位"不一致

📝 PR 规范检查

PR 描述结构完整(Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist),标题包含合法 Tag [Optimization][Speculative Decoding],规范合格。✓

总体评价

位打包方案和批量 .tolist() 优化思路清晰,核心链路改动正确。但移除 else 20 上限后缺少对 struct 容量上界的显式保护,以及 actual_topk=0 时 Python 侧的边界处理,建议修复后合入。

@EmmonsCurse
Copy link
Copy Markdown
Collaborator

/skip-ci pre_ce_test
/skip-ci build_xpu

@Sunny-bot1 Sunny-bot1 merged commit 592b992 into PaddlePaddle:develop May 26, 2026
66 of 74 checks passed
Sunny-bot1 added a commit that referenced this pull request May 26, 2026
…) (#7884)

* opt mtp logprob

* fix

* fix test and log

* fix bits

* Adapt logprobs baseline update in test_ernie_21b_mtp_multistep.py

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants