[Cherry-Pick][Optimization][Speculative Decoding]opt mtp logprob (#7883) by Sunny-bot1 · Pull Request #7884 · PaddlePaddle/FastDeploy

Sunny-bot1 · 2026-05-21T12:05:09Z

Motivation

MTP 投机解码 + logprob(top_logprobs:0) 性能提升约 10%。

通过移除 max_logprobs 硬编码为 20 的上限，改为按实际请求动态计算，top_logprobs:0下节省一次topk计算；
通过将 message_flag（低8位）和 max_num_logprobs（高24位）打包进消息头 meta[1]，接收端只拷贝实际有效的 logprob 槽位，避免 top_logprobs=0 时写入和读取全量 SPEC_LOGPROB_K+1 个槽位的无效开销。

Modifications

custom_ops/gpu_ops/speculate_decoding/speculate_save_output_with_topk.cc / draft_model/mtp_save_first_token_with_topk.cc：将 message_flag（低8位）与 max_num_logprobs（高16位）打包进 meta[1]；内层循环上界由 SPEC_LOGPROB_K+1 改为 max_num_logprobs，只写入实际所需列
custom_ops/gpu_ops/speculate_decoding/speculate_get_output_with_topk.cc：接收端解包 actual_topk = (meta[1] >> 8) & 0xFFFF，copy 循环上界改为 actual_topk
fastdeploy/worker/gpu_model_runner.py：投机解码下移除 max_logprobs 硬编码为 20 的上限，改为按实际请求动态计算
fastdeploy/output/token_processor.py：解包 mtype/actual_topk 并切片 tokens/scores[:, :, :actual_topk]；热路径中改用批量 .tolist() 减少 Python 侧逐元素开销
tests/output/：更新测试以反映 meta[1] 打包格式

Usage or Command

N/A

Accuracy Tests

tests/e2e/test_ernie_21b_mtp_multistep.py logprob结果与baseline存在差异，原因如下：

k 不同时，相等值之间的相对顺序不确定，本质上是 GPU 并行归约的非确定性。

当top3 和 top4 的value相等时，paddle.topk(logprobs, 3, axis=-1)[1][:3]与paddle.topk(logprobs, 20, axis=-1)[1][:3]的取值会有差异，但不影响结果的正确性，需更新baseline

>>> logprobs
Tensor(shape=[100], dtype=bfloat16, place=Place(gpu:0), stop_gradient=True,
       [0.99218750, 0.51562500, 0.52734375, 0.20312500, 0.81250000, 0.63671875,
        0.18164062, 0.22265625, 0.73828125, 0.91015625, 0.47851562, 0.62890625,
        0.03686523, 0.42382812, 0.68359375, 0.27929688, 0.70703125, 0.98437500,
        0.81640625, 0.19140625, 0.44726562, 0.36914062, 0.44335938, 0.98437500,
        0.56250000, 0.13476562, 0.97656250, 0.29687500, 0.89453125, 0.21777344,
        0.31445312, 0.10498047, 0.60156250, 0.23632812, 0.92968750, 0.88671875,
        0.61328125, 0.17480469, 0.80468750, 0.28906250, 0.87500000, 0.43359375,
        0.30273438, 0.50000000, 0.40039062, 0.55859375, 0.21972656, 0.41015625,
        0.41015625, 0.72656250, 0.97656250, 0.56640625, 0.25781250, 0.29296875,
        0.30273438, 0.25195312, 0.76171875, 0.03662109, 0.25195312, 0.55468750,
        0.86718750, 0.04736328, 0.35937500, 0.92187500, 0.34179688, 0.81250000,
        0.86718750, 0.58203125, 0.10742188, 0.90625000, 0.03784180, 0.41210938,
        0.57421875, 0.35937500, 0.28515625, 0.49023438, 0.08740234, 0.96484375,
        0.74218750, 0.29687500, 0.28515625, 0.41210938, 0.22460938, 0.76171875,
        0.77734375, 0.99218750, 0.47851562, 0.52734375, 0.75781250, 0.66406250,
        0.31054688, 0.13867188, 0.40820312, 0.53515625, 0.78125000, 0.86718750,
        0.25390625, 0.84375000, 0.35546875, 0.30859375])
>>> paddle.topk(logprobs, 3, axis=-1)[1]
Tensor(shape=[3], dtype=int64, place=Place(gpu:0), stop_gradient=True,
       [85, 0 , 17])
>>> paddle.topk(logprobs, 20, axis=-1)[1][:3]
Tensor(shape=[3], dtype=int64, place=Place(gpu:0), stop_gradient=True,
       [85, 0 , 23])
>>> paddle.topk(logprobs, 20, axis=-1)[1]
Tensor(shape=[20], dtype=int64, place=Place(gpu:0), stop_gradient=True,
       [85, 0 , 23, 17, 50, 26, 77, 34, 63, 9 , 69, 28, 35, 40, 60, 95, 66, 97, 18, 4 ])

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-21T12:05:20Z

Thanks for your contribution!

codecov-commenter · 2026-05-21T13:32:56Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release/2.6@e7a02e2). Learn more about missing BASE report.

Additional details and impacted files

@@              Coverage Diff               @@
##             release/2.6    #7884   +/-   ##
==============================================
  Coverage               ?   72.46%           
==============================================
  Files                  ?      382           
  Lines                  ?    54470           
  Branches               ?     8522           
==============================================
  Hits                   ?    39474           
  Misses                 ?    12228           
  Partials               ?     2768

Flag	Coverage Δ
GPU	`72.46% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot · 2026-05-22T07:28:10Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-26 16:01:44

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 3d636b5
Merge base: e7a02e2 (branch: release/2.6)
查看完整 Diff
CI 详情

1 任务总览

当前 Required CI 尚未完成：required 失败任务数 1，等待/运行中的 required 任务数 5。请先完成 Approval，并等待主测试及其余 required 任务结束后再判断是否可合入。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
35(0)	35	27	2	4	2	0

2 任务状态汇总

日志列说明：失败任务直接使用日志链接；运行中任务使用 Job 链接。

2.1 Required任务 : 4/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	18s	需要 Approval	请通过人工审批	Job	-
⏳	`Extracted partial CE model tasks to run in CI. / run_ce_cases`	-	运行中	-	Job	-
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	运行中	-	Job	-
⏳	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	运行中	-	Job	-
⏳	`xpu_8cards_case_test / run_xpu_8cards_cases`	-	运行中	-	Job	-
⏸️	`Run Four Cards Tests / run_4_cards_tests`	-	等待中	-	-	-
✅	其余 4 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 23/25 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	1m51s	Job	-
⏸️	`CI_HPU`	-	-	-
✅	其余 23 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — 需要人工审批（置信度: 高）

根因摘要：该 Job 需要人工 Approval，完成审批后 CI 才会继续执行。
修复建议摘要：请通过人工审批，然后等待后续 Required CI 继续运行。

4 本轮代码/日志上下文核验

已读取 PR 变更概要与 CI 快速状态；本轮没有 required 测试/编译失败日志需要深度分析。
已按要求抽查变更上下文：fastdeploy/output/token_processor.py、custom_ops/gpu_ops/speculate_decoding/speculate_save_output_with_topk.cc、custom_ops/gpu_ops/speculate_decoding/speculate_get_output_with_topk.cc，并搜索 meta[1] / actual_topk / max_logprobs 相关引用。
结论：当前阻塞合并的 required 失败是审批状态，不是 PR 代码触发的测试失败；主测试任务 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 仍在运行中。

into opt_mtp_logprob_26

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-26 16:00:56

📋 Review 摘要

PR 概述：通过位打包 meta[1] 将 message_flag（低8位）与 max_num_logprobs（高24位）合并传递，并移除 speculative decoding 下 max_logprobs 硬编码为 20 的上限，实现 MTP logprob top_logprobs:0 场景下约 10% 的性能提升。

变更范围：custom_ops/gpu_ops/speculate_decoding/（3个 C++ 文件）、fastdeploy/output/token_processor.py、fastdeploy/worker/gpu_model_runner.py、测试

影响面 Tag：[Speculative Decoding] [OP] [DataProcessor]

问题

级别	文件	概述
🟡 建议	`mtp_save_first_token_with_topk.cc:125`	`message_flag` 打包前无高位截断保护
❓ 疑问	`speculate_get_output_with_topk.cc:81`	`actual_topk=0` 边界场景是否有保护
❓ 疑问	`tests/output/test_process_batch_output.py:213`	bit 宽注释不一致（C++ 写 24 bits，测试写 16 bits）

📝 PR 规范检查

标题含两个官方 Tag（[Optimization] + [Speculative Decoding]），§D1 要求 Cherry-Pick 格式只包含一个官方 Tag。建议简化为一个更具体的 Tag。

标题建议（可直接复制）：

[Cherry-Pick][Speculative Decoding] opt mtp logprob (#7883)

总体评价

优化思路清晰，位打包设计合理，源张量 stride 由 SPEC_LOGPROB_K+1 改为 max_num_logprobs 与上游张量分配一致。actual_topk=0 边界和 message_flag 截断两处需要作者确认或修复后可合入。

PaddlePaddle-bot · 2026-05-26T08:08:28Z

+  // Pack message_flag (low 8 bits) and max_num_logprobs (high 24 bits) into
+  // meta[1]. Receiver unpacks both to avoid reading unused topk slots.
  int max_num_logprobs = logprob_token_ids.shape()[1];
+  msg_sed.meta[1] = message_flag | (max_num_logprobs << 8);


🟡 建议 message_flag 打包前未做高位截断

message_flag | (max_num_logprobs << 8) 假设 message_flag < 256，但没有显式截断。当前 mtype 取值为 3/4，实际安全；但若未来扩展 flag 值 ≥ 256，高位会污染 max_num_logprobs，导致接收端解包错误。

建议加防御性截断：

msg_sed.meta[1] = (message_flag & 0xFF) | (max_num_logprobs << 8);

speculate_save_output_with_topk.cc 同样位置同理。

PaddlePaddle-bot · 2026-05-26T08:08:28Z

+  // Unpack message_flag (low 8 bits) and actual_topk (high 24 bits) from
+  // meta[1]. Keep packed value; Python unpacks message_flag and actual_topk.
  output_tokens_data[1] = (int64_t)msg_rcv.meta[1];
  output_tokens_data[2] = (int64_t)msg_rcv.meta[2];


❓ 疑问 actual_topk = 0 边界场景是否有保护？

当 top_logprobs=0 的请求进入 logprobs_reqs 时，max_logprobs = max([0]) = 0，从而 max_num_logprobs = 0，发送端内层循环不执行，sampled token 不写入消息结构体。

接收端 actual_topk = 0，copy 循环同样不执行，传到 Python 侧后：

tokens[:, :, :0] # shape=[batch, MAX_DRAFT_TOKENS, 0] token_ids = [row[0] for row in tokens_lists[i][:accept_num[i]]] # IndexError!

请确认：

top_logprobs=0 的请求是否会进入 logprobs_reqs（若不会则无问题）

若会，需要在 C++ 侧保证 max_num_logprobs >= 1（至少写入 sampled token），或在 Python 侧对 actual_topk == 0 分支特殊处理。

PaddlePaddle-bot · 2026-05-26T08:08:28Z

@@ -211,8 +211,9 @@ def test_speculative_decoding_use_logprobs(self):

        # stop_flag
        processor.output_tokens[0, 0].set_tensor(paddle.to_tensor(2))


❓ 疑问 bit 宽注释不一致

此处注释写 actual_topk (high 16 bits)，而 C++ 侧（mtp_save_first_token_with_topk.cc、speculate_save_output_with_topk.cc、speculate_get_output_with_topk.cc）的注释均写 high 24 bits。

实际实现是 >> 8 在 int32 上取高 24 位，建议统一注释为 high 24 bits，避免误导维护者对 max_num_logprobs 的范围误判。

opt mtp logprob

a988c1a

Sunny-bot1 had a problem deploying to Metax_ci May 21, 2026 12:05 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

fix

4f35c72

Sunny-bot1 had a problem deploying to Metax_ci May 22, 2026 06:57 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

fix test and log

f8b507f

Sunny-bot1 had a problem deploying to Metax_ci May 25, 2026 07:09 — with GitHub Actions Error

Merge branch 'release/2.6' of https://github.com/PaddlePaddle/FastDeploy

f90dd1c

into opt_mtp_logprob_26

Sunny-bot1 had a problem deploying to Metax_ci May 25, 2026 07:10 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

fix bits

2654153

Sunny-bot1 had a problem deploying to Metax_ci May 25, 2026 07:42 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Merge branch 'release/2.6' into opt_mtp_logprob_26

70762af

qingqing01 previously approved these changes May 26, 2026

View reviewed changes

Adapt logprobs baseline update in test_ernie_21b_mtp_multistep.py

3d636b5

EmmonsCurse dismissed qingqing01’s stale review via 3d636b5 May 26, 2026 07:29

PaddlePaddle-bot reviewed May 26, 2026

View reviewed changes

Deleter-D approved these changes May 26, 2026

View reviewed changes

Sunny-bot1 merged commit c52b063 into PaddlePaddle:release/2.6 May 26, 2026
35 of 38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-Pick][Optimization][Speculative Decoding]opt mtp logprob (#7883)#7884

[Cherry-Pick][Optimization][Speculative Decoding]opt mtp logprob (#7883)#7884
Sunny-bot1 merged 7 commits into
PaddlePaddle:release/2.6from
Sunny-bot1:opt_mtp_logprob_26

Sunny-bot1 commented May 21, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented May 21, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented May 21, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 22, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 26, 2026

Uh oh!

PaddlePaddle-bot May 26, 2026

Uh oh!

PaddlePaddle-bot May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		@@ -211,8 +211,9 @@ def test_speculative_decoding_use_logprobs(self):

		# stop_flag
		processor.output_tokens[0, 0].set_tensor(paddle.to_tensor(2))

Conversation

Sunny-bot1 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 21, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 4/10 通过

2.2 可选任务 — 23/25 通过

3 失败详情（仅 required）

4 本轮代码/日志上下文核验

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Sunny-bot1 commented May 21, 2026 •

edited

Loading

codecov-commenter commented May 21, 2026 •

edited

Loading

PaddlePaddle-bot commented May 22, 2026 •

edited

Loading