[Feature] support decode unified attention by lizhenyun01 · Pull Request #7688 · PaddlePaddle/FastDeploy

lizhenyun01 · 2026-05-01T16:20:54Z

Motivation

C16/动态/静态C8 decode unified attention 支持，用于 decode 阶段 split-KV 注意力计算加速（SM90+ Hopper 设备）。

使用方式：

集中式（flash_attn 开启时）：export USE_DECODE_UNIFIED_ATTENTION=1
PD 分离 D 节点：export FD_ATTENTION_BACKEND=DECODE_UNIFIED_ATTN

Modifications

新增 CUDA 算子实现（仅在 SM90+ / NVCC≥12.0 下编译）：
- custom_ops/gpu_ops/decode_unified_attention.cu：注册 DecodeUnifiedAttention、DecoderWriteCacheWithRoPE、ConfigForAttention 三个算子
- custom_ops/gpu_ops/decode_unified_attention/：核心 kernel 实现（attention_func.cuh、decode_unified_attention_c16_impl.cuh、decode_unified_attention_c8_impl.cuh 等）
新增 Python 后端 DecodeUnifiedAttentionBackend（fastdeploy/model_executor/layers/attention/decode_unified_attention_backend.py），实现 forward_mixed 接口，流程：config_for_attention（layer 0 chunk 配置）→ decoder_write_cache_with_rope（写 KV cache + RoPE）→ decode_unified_attention（split-KV attention 计算）
新增环境变量 USE_DECODE_UNIFIED_ATTENTION（集中式开关，默认 0）和 FD_ATTENTION_BACKEND=DECODE_UNIFIED_ATTN（PD 分离开关）至 fastdeploy/envs.py
更新 fastdeploy/platforms/cuda.py 添加 DECODE_UNIFIED_ATTN 后端分发
更新 fastdeploy/worker/gpu_model_runner.py 添加 decode unified attention buffer 分配路径
更新 custom_ops/setup_ops.py 在 SM90+/NVCC≥12.0 分支加入新算子源文件
新增单测：tests/operators/attention/test_decode_unified_attention_c16.py、test_decode_unified_attention_c8.py

Usage or Command

集中式（flash_attn 开启时）：

export USE_DECODE_UNIFIED_ATTENTION=1

PD 分离 D 节点：

export FD_ATTENTION_BACKEND=DECODE_UNIFIED_ATTN

Accuracy Tests

N/A（本次 PR 为新增算子，尚未提供与参考实现的精度对比数据）

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-01T16:21:00Z

Thanks for your contribution!

codecov-commenter · 2026-05-01T17:49:01Z

Codecov Report

❌ Patch coverage is 77.45902% with 55 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@dad5a43). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...yers/attention/decode_unified_attention_backend.py	71.09%	26 Missing and 11 partials ⚠️
fastdeploy/spec_decode/mtp.py	57.57%	12 Missing and 2 partials ⚠️
...cutor/layers/attention/ops/config_for_attention.py	85.71%	0 Missing and 1 partial ⚠️
...r/layers/attention/ops/decode_unified_attention.py	88.88%	0 Missing and 1 partial ⚠️
...ers/attention/ops/decoder_write_cache_with_rope.py	88.88%	0 Missing and 1 partial ⚠️
fastdeploy/worker/gpu_model_runner.py	88.88%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7688   +/-   ##
==========================================
  Coverage           ?   63.64%           
==========================================
  Files              ?      466           
  Lines              ?    64726           
  Branches           ?     9918           
==========================================
  Hits               ?    41192           
  Misses             ?    20741           
  Partials           ?     2793

Flag	Coverage Δ
GPU	`72.73% <77.45%> (?)`
XPU	`7.09% <0.40%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot · 2026-05-06T13:01:03Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-06 20:58:21

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 3d19ff8
Merge base: 0397ab5 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

❌ 存在 1 个 Required 任务失败，需优先处理后方可合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
19(0)	19	12	2	2	3	0

2 任务状态汇总

2.1 Required任务 : 1/2 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	8s	PR问题：PR缺少4个必要审批（自定义算子及敏感文件）	联系相关RD审批（dangqingqing/jeff41404等）	Job	-
✅	其余 1 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 11/17 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Check PR Template`	12s	Job	-
⏳	`xpu_build_test / xpu-build-test`	-	Job	-
⏳	`Trigger Jenkins for PR`	-	Job	-
⏸️	`CI_HPU`	-	-	-
⏸️	`Run iluvatar Tests / run_iluvatar_cases`	-	-	-
⏸️	`FD-Build-Linux / fd-build`	-	-	-
✅	其余 11 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — 审批缺失（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 审批缺失
置信度: 高
根因摘要: PR缺少4个必要审批，涉及自定义算子及敏感文件修改
分析器: 通用分析(fallback)

根因详情:
check_approval.sh 脚本检测到本 PR 缺少 4 项必要审批。本次 PR 修改了自定义算子相关目录（fastdeploy/spec_decode、custom_ops/gpu_ops/speculate_decoding）以及敏感配置文件（fastdeploy/envs.py），这些改动需要对应模块负责人的显式 Approve 方可合并。

关键日志:

0. You must have one FastDeploy RD (qingqing01, Jiang-Jia-Jun, heavengate) approval for adding custom op.
1. You must have one PaddlePaddle RD (jeff41404, yongqiangma) approval for adding custom op.
2. You must have one FastDeploy RD (freeliuzc, Deleter-D) approval for modifying [fastdeploy/spec_decode, custom_ops/gpu_ops/speculate_decoding].
3. You must have one FastDeploy RD (Jiang-Jia-Jun, yuanlehome, rainyfly, Wanglongzhi2001) approval for modifying [fastdeploy/envs.py].
There are 4 approved errors.
##[error]Process completed with exit code 6.

修复建议:

请 @dangqingqing / @jiangjiajun / @DENGKAIPENG 中任意一位 FastDeploy RD Approve（自定义算子）
请 @gaoxiang / @mayongqiang 中任意一位 PaddlePaddle RD Approve（自定义算子）
请 @liuzichang01 / @wangyanpeng04 中任意一位 FastDeploy RD Approve（spec_decode/custom_ops 目录）
请 @jiangjiajun / @liuyuanle / @chenjian26 / @wanglongzhi 中任意一位 FastDeploy RD Approve（envs.py）

修复建议摘要: 请相关RD审批（dangqingqing/jeff41404/liuzichang01/jiangjiajun等）

链接: 查看日志

PaddlePaddle-bot · 2026-05-20T06:50:04Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-24 17:30:54

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 0b30335
Merge base: dad5a43 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

当前 Required 任务 9/10 通过，仍有 1 个 Required 任务失败：Approval。主测试任务 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 已通过；当前阻塞项为人工审批检查，需完成 Approval 后再观察后续状态。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
43(1)	42	38	4	0	0	0

2 任务状态汇总

日志列说明：失败任务直接使用日志链接；可选任务不阻塞合并，仅供参考。

2.1 Required任务 : 9/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	8s	需要 Approval：审批检查未通过	请通过人工审批	Job	🔄×1
✅	其余 9 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 29/32 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	3m15s	Job	-
❌	`CI_HPU`	1h25m	Job	-
❌	`Trigger Jenkins for PR`	7m51s	Job	-
✅	其余 29 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — approval_required（置信度: 高）

该 Job 需要人工 Approval，完成审批后 CI 才会继续执行。

补充信息：Approval workflow 在 .github/workflows/approve.yml 中执行 scripts/check_approval.sh 进行审批检查；本次主测试任务已通过，当前 Required 阻塞项不是测试失败。

* support decode unified attention * support dyc8 && test * opt memory

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-21 16:16:46

📋 Review 摘要

PR 概述：新增 decode unified attention 算子，支持 Hopper（SM90+）设备 decode 阶段 split-KV 注意力加速，覆盖 C16/动态C8/静态C8 三种 cache 量化模式。
变更范围：custom_ops/gpu_ops/decode_unified_attention/、fastdeploy/model_executor/layers/attention/、fastdeploy/worker/gpu_model_runner.py、fastdeploy/spec_decode/
影响面 Tag：[OP] [Feature] [Speculative Decoding] [PD Disaggregation]

问题

级别	文件	概述
🟡 建议	`custom_ops/gpu_ops/append_attn/utils.cuh:34`	`div_up` 修改超出本 PR 范围，会影响现有 append_attn 系列算子
🟡 建议	`custom_ops/gpu_ops/decode_unified_attention.cu:57`	`cache_k/v_dequant_scales` 为 optional，C8 路径直接调用 `.get()` 无 null 检查
❓ 疑问	`custom_ops/gpu_ops/cpp_extensions.cc:2045`	`decoder_write_cache_with_rope` 和 `config_for_attention` 未见 `PD_BUILD_STATIC_OP` 静态 Op 注册

📝 PR 规范检查

PR 标题 [Feature] support decode unified attention 使用官方 Tag，格式规范。PR 描述包含全部必填段落（Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist），结构合规，Accuracy Tests 明确说明 N/A。✓ PR 规范检查通过，无需修改。

总体评价

PR 整体实现完整，SM90+ 门控、setup_ops.py 编译注册、Python 调用侧均已同步，测试文件已补充。主要关注点是 C8 路径下 optional tensor 无 null 检查的崩溃风险，以及 append_attn/utils.cuh 中 div_up 修改的波及范围。建议作者确认上述两点后合入。

PaddlePaddle-bot · 2026-05-21T08:22:34Z


 __forceinline__ __host__ __device__ int div_up(int a, int b) {
-  return (a + b - 1) / b;
+  return a / b + (a % b != 0);


🟡 建议 div_up 修改超出本 PR 预期范围

此改动将 (a + b - 1) / b 改为 a / b + (a % b != 0)，虽对正整数数学等价，但修改了 append_attn/ 目录下所有算子共用的工具函数，而本 PR 的新算子并不在该目录下。

建议修复方式：

若目的是修复潜在的整数溢出风险（a + b - 1 对超大值可能溢出），请在 PR 描述中补充说明，并增加对现有 append_attn 算子的回归测试

若只是 decode_unified_attention 内部需要，建议只在 custom_ops/gpu_ops/decode_unified_attention/utils.cuh 中单独定义，避免修改无关文件

PaddlePaddle-bot · 2026-05-21T08:22:34Z

+    const paddle::optional<paddle::Tensor>& attn_mask,
+    const paddle::optional<paddle::Tensor>& cache_k_quant_scales,
+    const paddle::optional<paddle::Tensor>& cache_v_quant_scales,
+    const paddle::optional<paddle::Tensor>& cache_k_dequant_scales,


🟡 建议 cache_k/v_dequant_scales 为可选 tensor，C8 路径未做 null 检查

在函数体的 C8 dispatch 路径中，代码直接调用：

cache_quant_type == "block_wise_fp8" ? cache_k_quant_scales.get() : cache_k_dequant_scales.get() // ← 若未初始化则崩溃

当 cache_quant_type 为 "cache_fp8" 或 "cache_int8" 时使用 dequant_scales.get()，若调用方未传入该 optional tensor，将触发 Paddle 的 optional dereference 异常。

建议修复方式：在进入 C8 dispatch 前添加断言：

PADDLE_ENFORCE_EQ( cache_k_dequant_scales.is_initialized(), true, phi::errors::InvalidArgument( "cache_k_dequant_scales must be provided for cache_quant_type=%s", cache_quant_type.c_str()));

或在 Python 侧 forward_mixed 中确保非 block_wise_fp8 时必传 dequant_scales。

PaddlePaddle-bot · 2026-05-21T08:22:34Z

        &PerTokenGroupQuantFp8,
        "per_token_group_quant_fp8");
+
+  /**


❓ 疑问 decoder_write_cache_with_rope 与 config_for_attention 仅通过 pybind11 注册，未见 PD_BUILD_STATIC_OP 宏

在 diff 范围内，只有 decode_unified_attention.cu 中可见 PD_BUILD_STATIC_OP(decode_unified_attention) 宏注册。decoder_write_cache_with_rope.cu 和 decode_unified_attention/config_for_attention.cu 未在 diff 中展示其静态 Op 注册。

请确认：

两个文件中是否也有对应的 PD_BUILD_STATIC_OP(decoder_write_cache_with_rope) / PD_BUILD_STATIC_OP(config_for_attention) 宏；

若这两个算子只通过 pybind11 调用（动态图），请在代码中加注释说明不需要静态注册的原因。

yongqiangma

LGTM

#7729) * support c8 decode attention * support c16 attention && backend * opt kernel * fix * opt larger batch * inplace out * fix input_batch && remove fast_math * fix xpu * fix bug * fix ci * opt and fix mtp * fix merge * clean code * fix merge * update * update test * fix test * fix test * opt buffer * fix conflict --------- Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>

lizhenyun01 temporarily deployed to Metax_ci May 1, 2026 16:20 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

lizhenyun01 temporarily deployed to Metax_ci May 1, 2026 17:50 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

lizhenyun01 temporarily deployed to Metax_ci May 3, 2026 14:56 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

lizhenyun01 temporarily deployed to Metax_ci May 6, 2026 09:55 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

lizhenyun01 had a problem deploying to Metax_ci May 6, 2026 12:38 — with GitHub Actions Error

This comment was marked as outdated.

Sign in to view

lizhenyun01 temporarily deployed to Metax_ci May 6, 2026 12:55 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

lizhenyun01 temporarily deployed to Metax_ci May 6, 2026 15:01 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

lizhenyun01 temporarily deployed to Metax_ci May 7, 2026 04:10 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

PaddlePaddle-bot mentioned this pull request May 7, 2026

[Cherry-Pick][Feature] support decode unified attention for mix(#7688) #7729

Merged

5 tasks

lizhenyun01 had a problem deploying to Metax_ci May 11, 2026 03:36 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

lizhenyun01 had a problem deploying to Metax_ci May 11, 2026 12:47 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

lizhenyun01 force-pushed the h_dec_attn branch from 29a6ed1 to 88cc875 Compare May 14, 2026 09:06

lizhenyun01 had a problem deploying to Metax_ci May 14, 2026 09:06 — with GitHub Actions Failure

lizhenyun01 force-pushed the h_dec_attn branch from 88cc875 to c72705f Compare May 14, 2026 09:15

lizhenyun01 had a problem deploying to Metax_ci May 14, 2026 09:15 — with GitHub Actions Failure

fix conflict

89c917b

lizhenyun01 force-pushed the h_dec_attn branch from b9042ad to 89c917b Compare May 20, 2026 04:41

lizhenyun01 had a problem deploying to Metax_ci May 20, 2026 04:42 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

fix

9bea80e

lizhenyun01 had a problem deploying to Metax_ci May 20, 2026 07:24 — with GitHub Actions Error

fix

4596ea9

lizhenyun01 had a problem deploying to Metax_ci May 20, 2026 07:28 — with GitHub Actions Failure

freeliuzc pushed a commit that referenced this pull request May 20, 2026

[Cherry-Pick][Feature] support decode unified attention(#7688) (#7850)

15aed4f

* support decode unified attention * support dyc8 && test * opt memory

This comment was marked as outdated.

Sign in to view

fix mix

3f7f0fd

lizhenyun01 had a problem deploying to Metax_ci May 20, 2026 12:45 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

fix fa version

faac077

lizhenyun01 had a problem deploying to Metax_ci May 21, 2026 05:53 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

lizhenyun01 had a problem deploying to Metax_ci May 21, 2026 06:45 — with GitHub Actions Failure

define base

0b30335

lizhenyun01 had a problem deploying to Metax_ci May 21, 2026 07:57 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 21, 2026

View reviewed changes

freeliuzc approved these changes May 21, 2026

View reviewed changes

lizhenyun01 requested review from jiangjiajun, xyxinyang and yongqiangma May 21, 2026 09:57

xyxinyang approved these changes May 25, 2026

View reviewed changes

yongqiangma approved these changes May 25, 2026

View reviewed changes

Jiang-Jia-Jun approved these changes May 25, 2026

View reviewed changes

Jiang-Jia-Jun merged commit 4762002 into PaddlePaddle:develop May 25, 2026
39 of 44 checks passed

Conversation

lizhenyun01 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 1, 2026

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

codecov-commenter commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 6, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 1/2 通过

2.2 可选任务 — 11/17 通过

3 失败详情（仅 required）

Approval

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 9/10 通过

2.2 可选任务 — 29/32 通过

3 失败详情（仅 required）

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

yongqiangma left a comment

Choose a reason for hiding this comment

lizhenyun01 commented May 1, 2026 •

edited

Loading

codecov-commenter commented May 1, 2026 •

edited

Loading

PaddlePaddle-bot commented May 20, 2026 •

edited

Loading