Skip to content

[Cherry-Pick][Optimization][Speculative Decoding]opt mtp logprob (#7883)#7884

Merged
Sunny-bot1 merged 7 commits into
PaddlePaddle:release/2.6from
Sunny-bot1:opt_mtp_logprob_26
May 26, 2026
Merged

[Cherry-Pick][Optimization][Speculative Decoding]opt mtp logprob (#7883)#7884
Sunny-bot1 merged 7 commits into
PaddlePaddle:release/2.6from
Sunny-bot1:opt_mtp_logprob_26

Conversation

@Sunny-bot1
Copy link
Copy Markdown
Collaborator

@Sunny-bot1 Sunny-bot1 commented May 21, 2026

Motivation

MTP 投机解码 + logprob(top_logprobs:0) 性能提升约 10%。

  1. 通过移除 max_logprobs 硬编码为 20 的上限,改为按实际请求动态计算,top_logprobs:0下节省一次topk计算;
  2. 通过将 message_flag(低8位)和 max_num_logprobs(高24位)打包进消息头 meta[1],接收端只拷贝实际有效的 logprob 槽位,避免 top_logprobs=0 时写入和读取全量 SPEC_LOGPROB_K+1 个槽位的无效开销。

Modifications

  • custom_ops/gpu_ops/speculate_decoding/speculate_save_output_with_topk.cc / draft_model/mtp_save_first_token_with_topk.cc:将 message_flag(低8位)与 max_num_logprobs(高16位)打包进 meta[1];内层循环上界由 SPEC_LOGPROB_K+1 改为 max_num_logprobs,只写入实际所需列
  • custom_ops/gpu_ops/speculate_decoding/speculate_get_output_with_topk.cc:接收端解包 actual_topk = (meta[1] >> 8) & 0xFFFF,copy 循环上界改为 actual_topk
  • fastdeploy/worker/gpu_model_runner.py:投机解码下移除 max_logprobs 硬编码为 20 的上限,改为按实际请求动态计算
  • fastdeploy/output/token_processor.py:解包 mtype/actual_topk 并切片 tokens/scores[:, :, :actual_topk];热路径中改用批量 .tolist() 减少 Python 侧逐元素开销
  • tests/output/:更新测试以反映 meta[1] 打包格式

Usage or Command

N/A

Accuracy Tests

tests/e2e/test_ernie_21b_mtp_multistep.py logprob结果与baseline存在差异,原因如下:

k 不同时,相等值之间的相对顺序不确定,本质上是 GPU 并行归约的非确定性。

当top3 和 top4 的value相等时,paddle.topk(logprobs, 3, axis=-1)[1][:3]与paddle.topk(logprobs, 20, axis=-1)[1][:3]的取值会有差异,但不影响结果的正确性,需更新baseline

>>> logprobs
Tensor(shape=[100], dtype=bfloat16, place=Place(gpu:0), stop_gradient=True,
       [0.99218750, 0.51562500, 0.52734375, 0.20312500, 0.81250000, 0.63671875,
        0.18164062, 0.22265625, 0.73828125, 0.91015625, 0.47851562, 0.62890625,
        0.03686523, 0.42382812, 0.68359375, 0.27929688, 0.70703125, 0.98437500,
        0.81640625, 0.19140625, 0.44726562, 0.36914062, 0.44335938, 0.98437500,
        0.56250000, 0.13476562, 0.97656250, 0.29687500, 0.89453125, 0.21777344,
        0.31445312, 0.10498047, 0.60156250, 0.23632812, 0.92968750, 0.88671875,
        0.61328125, 0.17480469, 0.80468750, 0.28906250, 0.87500000, 0.43359375,
        0.30273438, 0.50000000, 0.40039062, 0.55859375, 0.21972656, 0.41015625,
        0.41015625, 0.72656250, 0.97656250, 0.56640625, 0.25781250, 0.29296875,
        0.30273438, 0.25195312, 0.76171875, 0.03662109, 0.25195312, 0.55468750,
        0.86718750, 0.04736328, 0.35937500, 0.92187500, 0.34179688, 0.81250000,
        0.86718750, 0.58203125, 0.10742188, 0.90625000, 0.03784180, 0.41210938,
        0.57421875, 0.35937500, 0.28515625, 0.49023438, 0.08740234, 0.96484375,
        0.74218750, 0.29687500, 0.28515625, 0.41210938, 0.22460938, 0.76171875,
        0.77734375, 0.99218750, 0.47851562, 0.52734375, 0.75781250, 0.66406250,
        0.31054688, 0.13867188, 0.40820312, 0.53515625, 0.78125000, 0.86718750,
        0.25390625, 0.84375000, 0.35546875, 0.30859375])
>>> paddle.topk(logprobs, 3, axis=-1)[1]
Tensor(shape=[3], dtype=int64, place=Place(gpu:0), stop_gradient=True,
       [85, 0 , 17])
>>> paddle.topk(logprobs, 20, axis=-1)[1][:3]
Tensor(shape=[3], dtype=int64, place=Place(gpu:0), stop_gradient=True,
       [85, 0 , 23])
>>> paddle.topk(logprobs, 20, axis=-1)[1]
Tensor(shape=[20], dtype=int64, place=Place(gpu:0), stop_gradient=True,
       [85, 0 , 23, 17, 50, 26, 77, 34, 63, 9 , 69, 28, 35, 40, 60, 95, 66, 97, 18, 4 ])

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 21, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release/2.6@e7a02e2). Learn more about missing BASE report.

Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.6    #7884   +/-   ##
==============================================
  Coverage               ?   72.46%           
==============================================
  Files                  ?      382           
  Lines                  ?    54470           
  Branches               ?     8522           
==============================================
  Hits                   ?    39474           
  Misses                 ?    12228           
  Partials               ?     2768           
Flag Coverage Δ
GPU 72.46% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 22, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-26 16:01:44

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前 Required CI 尚未完成:required 失败任务数 1,等待/运行中的 required 任务数 5。请先完成 Approval,并等待主测试及其余 required 任务结束后再判断是否可合入。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
35(0) 35 27 2 4 2 0

2 任务状态汇总

日志列说明:失败任务直接使用日志链接;运行中任务使用 Job 链接。

2.1 Required任务 : 4/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 18s 需要 Approval 请通过人工审批 Job -
Extracted partial CE model tasks to run in CI. / run_ce_cases - 运行中 - Job -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - Job -
xpu_4cards_case_test / run_xpu_4cards_cases - 运行中 - Job -
xpu_8cards_case_test / run_xpu_8cards_cases - 运行中 - Job -
⏸️ Run Four Cards Tests / run_4_cards_tests - 等待中 - - -
其余 4 个必选任务通过 - - - - -

2.2 可选任务 — 23/25 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 1m51s Job -
⏸️ CI_HPU - - -
其余 23 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 需要人工审批(置信度: 高)
  • 根因摘要:该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。
  • 修复建议摘要:请通过人工审批,然后等待后续 Required CI 继续运行。

4 本轮代码/日志上下文核验

  • 已读取 PR 变更概要与 CI 快速状态;本轮没有 required 测试/编译失败日志需要深度分析。
  • 已按要求抽查变更上下文:fastdeploy/output/token_processor.pycustom_ops/gpu_ops/speculate_decoding/speculate_save_output_with_topk.cccustom_ops/gpu_ops/speculate_decoding/speculate_get_output_with_topk.cc,并搜索 meta[1] / actual_topk / max_logprobs 相关引用。
  • 结论:当前阻塞合并的 required 失败是审批状态,不是 PR 代码触发的测试失败;主测试任务 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 仍在运行中。

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

qingqing01
qingqing01 previously approved these changes May 26, 2026
Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-26 16:00:56

📋 Review 摘要

PR 概述:通过位打包 meta[1]message_flag(低8位)与 max_num_logprobs(高24位)合并传递,并移除 speculative decoding 下 max_logprobs 硬编码为 20 的上限,实现 MTP logprob top_logprobs:0 场景下约 10% 的性能提升。

变更范围custom_ops/gpu_ops/speculate_decoding/(3个 C++ 文件)、fastdeploy/output/token_processor.pyfastdeploy/worker/gpu_model_runner.py、测试

影响面 Tag[Speculative Decoding] [OP] [DataProcessor]

问题

级别 文件 概述
🟡 建议 mtp_save_first_token_with_topk.cc:125 message_flag 打包前无高位截断保护
❓ 疑问 speculate_get_output_with_topk.cc:81 actual_topk=0 边界场景是否有保护
❓ 疑问 tests/output/test_process_batch_output.py:213 bit 宽注释不一致(C++ 写 24 bits,测试写 16 bits)

📝 PR 规范检查

标题含两个官方 Tag([Optimization] + [Speculative Decoding]),§D1 要求 Cherry-Pick 格式只包含一个官方 Tag。建议简化为一个更具体的 Tag。

标题建议(可直接复制):

  • [Cherry-Pick][Speculative Decoding] opt mtp logprob (#7883)

总体评价

优化思路清晰,位打包设计合理,源张量 stride 由 SPEC_LOGPROB_K+1 改为 max_num_logprobs 与上游张量分配一致。actual_topk=0 边界和 message_flag 截断两处需要作者确认或修复后可合入。

// Pack message_flag (low 8 bits) and max_num_logprobs (high 24 bits) into
// meta[1]. Receiver unpacks both to avoid reading unused topk slots.
int max_num_logprobs = logprob_token_ids.shape()[1];
msg_sed.meta[1] = message_flag | (max_num_logprobs << 8);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 message_flag 打包前未做高位截断

message_flag | (max_num_logprobs << 8) 假设 message_flag < 256,但没有显式截断。当前 mtype 取值为 3/4,实际安全;但若未来扩展 flag 值 ≥ 256,高位会污染 max_num_logprobs,导致接收端解包错误。

建议加防御性截断:

msg_sed.meta[1] = (message_flag & 0xFF) | (max_num_logprobs << 8);

speculate_save_output_with_topk.cc 同样位置同理。

// Unpack message_flag (low 8 bits) and actual_topk (high 24 bits) from
// meta[1]. Keep packed value; Python unpacks message_flag and actual_topk.
output_tokens_data[1] = (int64_t)msg_rcv.meta[1];
output_tokens_data[2] = (int64_t)msg_rcv.meta[2];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 actual_topk = 0 边界场景是否有保护?

top_logprobs=0 的请求进入 logprobs_reqs 时,max_logprobs = max([0]) = 0,从而 max_num_logprobs = 0,发送端内层循环不执行,sampled token 不写入消息结构体。

接收端 actual_topk = 0,copy 循环同样不执行,传到 Python 侧后:

tokens[:, :, :0]  # shape=[batch, MAX_DRAFT_TOKENS, 0]
token_ids = [row[0] for row in tokens_lists[i][:accept_num[i]]]  # IndexError!

请确认:

  1. top_logprobs=0 的请求是否会进入 logprobs_reqs(若不会则无问题)
  2. 若会,需要在 C++ 侧保证 max_num_logprobs >= 1(至少写入 sampled token),或在 Python 侧对 actual_topk == 0 分支特殊处理。

@@ -211,8 +211,9 @@ def test_speculative_decoding_use_logprobs(self):

# stop_flag
processor.output_tokens[0, 0].set_tensor(paddle.to_tensor(2))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 bit 宽注释不一致

此处注释写 actual_topk (high 16 bits),而 C++ 侧(mtp_save_first_token_with_topk.ccspeculate_save_output_with_topk.ccspeculate_get_output_with_topk.cc)的注释均写 high 24 bits

实际实现是 >> 8 在 int32 上取高 24 位,建议统一注释为 high 24 bits,避免误导维护者对 max_num_logprobs 的范围误判。

@Sunny-bot1 Sunny-bot1 merged commit c52b063 into PaddlePaddle:release/2.6 May 26, 2026
35 of 38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants