[Cherry-Pick][Feature] support decode unified attention for mix(#7688) by lizhenyun01 · Pull Request #7729 · PaddlePaddle/FastDeploy

lizhenyun01 · 2026-05-07T05:14:55Z

Motivation

C16/静态C8 attention支持，使用方式：flash_attn开启情况下export USE_DECODE_UNIFIED_ATTENTION =1

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-07T05:15:01Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-07T06:58:03Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-26 06:50:03

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 9341f60
Merge base: 3ffeb44 (branch: release/2.6)
查看完整 Diff
CI 详情

1 任务总览

Required 任务 7/10 通过，当前仍有 3 个 required 失败任务阻塞合并：2 个为 GitHub Actions artifact 存储配额问题，1 个为 Approval 人工审批；主测试任务 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 已通过。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
64(27)	37	30	7	0	0	0

2 任务状态汇总

日志列说明：失败任务直接使用日志链接；运行中任务手动拼接 Job 链接。

2.1 Required任务 : 7/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`xpu_4cards_case_test / run_xpu_4cards_cases`	30m8s	环境问题：Artifact 存储配额已满	清理 Actions artifacts 后 rerun	Job	🔄×1
❌	`xpu_8cards_case_test / run_xpu_8cards_cases`	9m22s	环境问题：Artifact 存储配额已满	清理 Actions artifacts 后 rerun	Job	🔄×1
❌	`Approval`	20s	需要 Approval	请通过人工审批	Job	🔄×1
✅	其余 7 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 23/27 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	1m44s	Job	-
❌	`Check PR Template`	20s	Job	-
❌	`Trigger Jenkins for PR`	25s	Job	-
❌	`CI_HPU`	1h4m	Job	-
✅	其余 23 个可选任务通过	-	-	-

3 失败详情（仅 required）

xpu_4cards_case_test / run_xpu_4cards_cases — 基础设施（置信度: 高）

xpu_4cards_case_test / run_xpu_4cards_cases

状态: ❌ 失败
错误类型: 基础设施
置信度: 高
根因摘要: Artifact 存储配额已满
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`tests/xpu_ci/4cards_cases`	Upload case logs 失败	15 个 case 已通过，失败在上传日志 artifact

根因详情:
日志显示 Run CI unittest 步骤成功，末尾为 15 passed 和 4卡cases测试通过。Job 失败发生在 Upload case logs 步骤，GitHub Actions 返回 artifact 存储配额已满，因此无法创建 xpu-4cards-case-logs，属于 CI 基础设施/配额问题，未发现与本 PR 的 GPU decode unified attention 变更直接相关。

关键日志:

15 passed in 1486.75s (0:24:46)
4卡cases测试通过!
Failed to CreateArtifact: Artifact storage quota has been hit.
Unable to upload any new artifacts.

修复建议:

清理 GitHub Actions artifacts 或释放仓库 Actions 存储配额后 rerun CI_XPU。
若配额清理需要等待，待 GitHub 重新计算存储用量（日志提示 6-12 小时）后再重跑。

修复建议摘要: 清理 Actions artifacts 后 rerun

关联变更: 未发现直接关联；PR 主要修改 GPU decode unified attention 相关文件，当前失败在 XPU 日志上传阶段。

xpu_8cards_case_test / run_xpu_8cards_cases — 基础设施（置信度: 高）

xpu_8cards_case_test / run_xpu_8cards_cases

状态: ❌ 失败
错误类型: 基础设施
置信度: 高
根因摘要: Artifact 存储配额已满
分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试	错误	根因
`tests/xpu_ci/8cards_cases`	Upload case logs 失败	4 个 case 已通过，失败在上传日志 artifact

根因详情:
日志显示 Run CI unittest 步骤成功，末尾为 4 passed 和 8卡cases测试通过。Job 失败发生在 Upload case logs 步骤，GitHub Actions 返回 artifact 存储配额已满，导致 xpu-8cards-case-logs 上传失败，属于 CI 基础设施/配额问题，与测试逻辑或 PR 代码变更无直接关联。

关键日志:

4 passed in 366.67s (0:06:06)
8卡cases测试通过!
Failed to CreateArtifact: Artifact storage quota has been hit.
Unable to upload any new artifacts.

修复建议:

清理 GitHub Actions artifacts 或释放仓库 Actions 存储配额后 rerun CI_XPU。
若配额清理需要等待，待 GitHub 重新计算存储用量（日志提示 6-12 小时）后再重跑。

修复建议摘要: 清理 Actions artifacts 后 rerun

关联变更: 未发现直接关联；PR 主要修改 GPU decode unified attention 相关文件，当前失败在 XPU 日志上传阶段。

Approval — 人工审批（置信度: 高）

该 Job 需要人工 Approval，完成审批后 CI 才会继续执行。

根因摘要: 需要 Approval

修复建议摘要: 请通过人工审批

4 代码关联性核验

已读取/检索 PR 关键变更与相关 workflow：

fastdeploy/envs.py 中 USE_DECODE_UNIFIED_ATTENTION 默认关闭，仅在显式设置时启用；
新增/更新测试集中在 GPU decode unified attention：tests/e2e/test_ernie_21b_mtp_decode_unified_attention.py、tests/operators/attention/test_decode_unified_attention_c16.py、tests/operators/attention/test_decode_unified_attention_c8.py；
XPU 4/8 卡 workflow 均在测试完成后通过 actions/upload-artifact@v6 上传 case_logs，与失败日志 Artifact storage quota has been hit 匹配。

结论：本轮 Required 失败不指向 PR 代码逻辑问题；建议优先处理 Actions 存储配额和 Approval。

codecov-commenter · 2026-05-11T05:22:20Z

Codecov Report

❌ Patch coverage is 94.25287% with 5 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@0a5d4b6). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...cutor/layers/attention/ops/config_for_attention.py	85.71%	0 Missing and 1 partial ⚠️
...r/layers/attention/ops/decode_unified_attention.py	88.88%	0 Missing and 1 partial ⚠️
...ers/attention/ops/decoder_write_cache_with_rope.py	88.88%	0 Missing and 1 partial ⚠️
fastdeploy/spec_decode/mtp.py	92.85%	0 Missing and 1 partial ⚠️
fastdeploy/worker/gpu_model_runner.py	85.71%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@              Coverage Diff               @@
##             release/2.6    #7729   +/-   ##
==============================================
  Coverage               ?   72.52%           
==============================================
  Files                  ?      385           
  Lines                  ?    54545           
  Branches               ?     8532           
==============================================
  Hits                   ?    39560           
  Misses                 ?    12213           
  Partials               ?     2772

Flag	Coverage Δ
GPU	`72.52% <94.25%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

yongqiangma

LGTM

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-26 16:09:29

📋 Review 摘要

PR 概述：新增 decode unified attention（C16/静态C8）算子及其 Python 调用侧，支持 flash_attn 场景下的 decode 阶段统一 attention 实现。
变更范围：custom_ops/gpu_ops/（新增 CUDA kernels）、fastdeploy/model_executor/layers/attention/（Python 调用侧）、fastdeploy/worker/、fastdeploy/spec_decode/、tests/
影响面 Tag：[OP] [Feature] [Speculative Decoding]

⚠️ 本 PR 变更量较大（28 文件 / 8235 行），建议拆分以降低审查难度和合入风险。

建议拆分方案：

PR 1: CUDA Kernel 实现 — custom_ops/gpu_ops/decode_unified_attention.cu、custom_ops/gpu_ops/decoder_write_cache_with_rope.cu、custom_ops/gpu_ops/decode_unified_attention/*.cuh、custom_ops/gpu_ops/cpp_extensions.cc、custom_ops/setup_ops.py
PR 2: Python 调用侧 + Worker 集成 — fastdeploy/model_executor/layers/attention/ops/、fastdeploy/model_executor/layers/attention/flash_attn_backend.py、fastdeploy/model_executor/layers/attention/append_attn_backend.py、fastdeploy/worker/gpu_model_runner.py、fastdeploy/worker/input_batch.py、fastdeploy/envs.py
PR 3: Spec Decode + 测试 — fastdeploy/spec_decode/mtp.py、fastdeploy/worker/metax_model_runner.py、tests/

问题

级别	文件	概述
🟡 建议	`custom_ops/setup_ops.py`	大 PR 拆分建议（见上）
❓ 疑问	`custom_ops/setup_ops.py`	缺少 `-DENABLE_DECODE_UNIFIED_ATTENTION` 编译宏，其他同类 PR 均有此宏
🟡 建议	`fastdeploy/worker/metax_model_runner.py`	文件在变更列表中，但未见 `USE_DECODE_UNIFIED_ATTENTION` 相关逻辑同步
📝 PR 规范	—	`Modifications`、`Usage or Command`、`Accuracy Tests` 段落为空

📝 PR 规范检查

标题格式符合 Cherry-Pick 规范（[Cherry-Pick][Feature]），无需修改。

描述中 Modifications、Usage or Command、Accuracy Tests 三个段落仅保留了注释占位符，未填写实际内容，不符合模板要求。

PR 描述建议（可直接复制）：

## Motivation

新增 decode unified attention 算子，支持 C16（cache_quant_type=none）和静态 C8（cache_fp8/cache_int8）两种 KV cache 量化模式下的 decode 阶段统一 attention 实现。使用方式：在 flash_attn 开启的情况下，设置环境变量 `export USE_DECODE_UNIFIED_ATTENTION=1` 即可启用。

## Modifications

- `custom_ops/gpu_ops/decode_unified_attention/`：新增 C16/C8 decode unified attention CUDA kernel 实现（`attention_func.cuh`、`decode_unified_attention_c16_impl.cuh`、`decode_unified_attention_c8_impl.cuh`、`mma_tensor_op.cuh`、`utils.cuh`、`mem_util.cuh`、`cu_tensor_map.cuh`）
- `custom_ops/gpu_ops/decode_unified_attention.cu`：新增 `DecodeUnifiedAttention` op 入口及 PD_BUILD_STATIC_OP 注册
- `custom_ops/gpu_ops/decoder_write_cache_with_rope.cu`：新增 `DecoderWriteCacheWithRoPE` op
- `custom_ops/gpu_ops/decode_unified_attention/config_for_attention.cu`：新增 `ConfigForAttention` op
- `custom_ops/gpu_ops/cpp_extensions.cc`：注册三个新 op 的 pybind11 接口
- `custom_ops/setup_ops.py`：在 SM90+ 分支加入新 `.cu` 文件编译
- `fastdeploy/model_executor/layers/attention/ops/`：新增 Python 调用侧封装（`decode_unified_attention.py`、`decoder_write_cache_with_rope.py`、`config_for_attention.py`）
- `fastdeploy/model_executor/layers/attention/flash_attn_backend.py`、`append_attn_backend.py`：集成新 attention 路径
- `fastdeploy/worker/gpu_model_runner.py`、`input_batch.py`：Worker 侧集成
- `fastdeploy/worker/metax_model_runner.py`：Metax 硬件适配
- `fastdeploy/spec_decode/mtp.py`：MTP 投机解码适配
- `fastdeploy/envs.py`：新增 `USE_DECODE_UNIFIED_ATTENTION` 环境变量
- `tests/`：新增 e2e 测试及算子单测

## Usage or Command

```bash
# 启用 decode unified attention（需 SM90+ GPU，flash_attn 开启）
export USE_DECODE_UNIFIED_ATTENTION=1
```

## Accuracy Tests

N/A（本 PR 为 Cherry-Pick，精度测试已在原 PR #7688 中提供）

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

整体实现结构清晰，CUDA kernel、Python 调用侧、Worker 集成、测试均已覆盖。建议确认 setup_ops.py 中是否需要补充 -DENABLE_DECODE_UNIFIED_ATTENTION 编译宏，并核查 metax_model_runner.py 的同步完整性。

lizhenyun01 added 11 commits May 7, 2026 13:03

support c8 decode attention

2265dc9

support c16 attention && backend

4c922bc

opt kernel

de6450d

fix

111230a

opt larger batch

03263a0

inplace out

cb64cb3

fix input_batch && remove fast_math

b1acb37

fix xpu

a5e394f

fix bug

6a5b3c6

fix ci

307e5a8

opt and fix mtp

3f29b01

lizhenyun01 had a problem deploying to Metax_ci May 7, 2026 05:14 — with GitHub Actions Failure

lizhenyun01 changed the title ~~[Feature] support decode attention for mix(#7688)~~ [Cherry-Pick][Feature] support decode attention for mix(#7688) May 7, 2026