Skip to content

[Cherry-Pick][Feature] support decode unified attention for mix(#7688)#7729

Merged
Jiang-Jia-Jun merged 22 commits into
PaddlePaddle:release/2.6from
lizhenyun01:dec_attn_2.6
May 26, 2026
Merged

[Cherry-Pick][Feature] support decode unified attention for mix(#7688)#7729
Jiang-Jia-Jun merged 22 commits into
PaddlePaddle:release/2.6from
lizhenyun01:dec_attn_2.6

Conversation

@lizhenyun01
Copy link
Copy Markdown
Collaborator

@lizhenyun01 lizhenyun01 commented May 7, 2026

Motivation

C16/静态C8 attention支持,使用方式:flash_attn开启情况下export USE_DECODE_UNIFIED_ATTENTION =1

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 7, 2026

Thanks for your contribution!

@lizhenyun01 lizhenyun01 changed the title [Feature] support decode attention for mix(#7688) [Cherry-Pick][Feature] support decode attention for mix(#7688) May 7, 2026
@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 7, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-26 06:50:03

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

Required 任务 7/10 通过,当前仍有 3 个 required 失败任务阻塞合并:2 个为 GitHub Actions artifact 存储配额问题,1 个为 Approval 人工审批;主测试任务 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 已通过。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
64(27) 37 30 7 0 0 0

2 任务状态汇总

日志列说明:失败任务直接使用日志链接;运行中任务手动拼接 Job 链接。

2.1 Required任务 : 7/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
xpu_4cards_case_test / run_xpu_4cards_cases 30m8s 环境问题:Artifact 存储配额已满 清理 Actions artifacts 后 rerun Job 🔄×1
xpu_8cards_case_test / run_xpu_8cards_cases 9m22s 环境问题:Artifact 存储配额已满 清理 Actions artifacts 后 rerun Job 🔄×1
Approval 20s 需要 Approval 请通过人工审批 Job 🔄×1
其余 7 个必选任务通过 - - - - -

2.2 可选任务 — 23/27 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 1m44s Job -
Check PR Template 20s Job -
Trigger Jenkins for PR 25s Job -
CI_HPU 1h4m Job -
其余 23 个可选任务通过 - - -

3 失败详情(仅 required)

xpu_4cards_case_test / run_xpu_4cards_cases — 基础设施(置信度: 高)

xpu_4cards_case_test / run_xpu_4cards_cases

  • 状态: ❌ 失败
  • 错误类型: 基础设施
  • 置信度: 高
  • 根因摘要: Artifact 存储配额已满
  • 分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试 错误 根因
tests/xpu_ci/4cards_cases Upload case logs 失败 15 个 case 已通过,失败在上传日志 artifact

根因详情:
日志显示 Run CI unittest 步骤成功,末尾为 15 passed4卡cases测试通过。Job 失败发生在 Upload case logs 步骤,GitHub Actions 返回 artifact 存储配额已满,因此无法创建 xpu-4cards-case-logs,属于 CI 基础设施/配额问题,未发现与本 PR 的 GPU decode unified attention 变更直接相关。

关键日志:

15 passed in 1486.75s (0:24:46)
4卡cases测试通过!
Failed to CreateArtifact: Artifact storage quota has been hit.
Unable to upload any new artifacts.

修复建议:

  1. 清理 GitHub Actions artifacts 或释放仓库 Actions 存储配额后 rerun CI_XPU
  2. 若配额清理需要等待,待 GitHub 重新计算存储用量(日志提示 6-12 小时)后再重跑。

修复建议摘要: 清理 Actions artifacts 后 rerun

关联变更: 未发现直接关联;PR 主要修改 GPU decode unified attention 相关文件,当前失败在 XPU 日志上传阶段。

xpu_8cards_case_test / run_xpu_8cards_cases — 基础设施(置信度: 高)

xpu_8cards_case_test / run_xpu_8cards_cases

  • 状态: ❌ 失败
  • 错误类型: 基础设施
  • 置信度: 高
  • 根因摘要: Artifact 存储配额已满
  • 分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试 错误 根因
tests/xpu_ci/8cards_cases Upload case logs 失败 4 个 case 已通过,失败在上传日志 artifact

根因详情:
日志显示 Run CI unittest 步骤成功,末尾为 4 passed8卡cases测试通过。Job 失败发生在 Upload case logs 步骤,GitHub Actions 返回 artifact 存储配额已满,导致 xpu-8cards-case-logs 上传失败,属于 CI 基础设施/配额问题,与测试逻辑或 PR 代码变更无直接关联。

关键日志:

4 passed in 366.67s (0:06:06)
8卡cases测试通过!
Failed to CreateArtifact: Artifact storage quota has been hit.
Unable to upload any new artifacts.

修复建议:

  1. 清理 GitHub Actions artifacts 或释放仓库 Actions 存储配额后 rerun CI_XPU
  2. 若配额清理需要等待,待 GitHub 重新计算存储用量(日志提示 6-12 小时)后再重跑。

修复建议摘要: 清理 Actions artifacts 后 rerun

关联变更: 未发现直接关联;PR 主要修改 GPU decode unified attention 相关文件,当前失败在 XPU 日志上传阶段。

Approval — 人工审批(置信度: 高)

该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。

根因摘要: 需要 Approval

修复建议摘要: 请通过人工审批


4 代码关联性核验

已读取/检索 PR 关键变更与相关 workflow:

  • fastdeploy/envs.pyUSE_DECODE_UNIFIED_ATTENTION 默认关闭,仅在显式设置时启用;
  • 新增/更新测试集中在 GPU decode unified attention:tests/e2e/test_ernie_21b_mtp_decode_unified_attention.pytests/operators/attention/test_decode_unified_attention_c16.pytests/operators/attention/test_decode_unified_attention_c8.py
  • XPU 4/8 卡 workflow 均在测试完成后通过 actions/upload-artifact@v6 上传 case_logs,与失败日志 Artifact storage quota has been hit 匹配。

结论:本轮 Required 失败不指向 PR 代码逻辑问题;建议优先处理 Actions 存储配额和 Approval。

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 11, 2026

Codecov Report

❌ Patch coverage is 94.25287% with 5 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@0a5d4b6). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...cutor/layers/attention/ops/config_for_attention.py 85.71% 0 Missing and 1 partial ⚠️
...r/layers/attention/ops/decode_unified_attention.py 88.88% 0 Missing and 1 partial ⚠️
...ers/attention/ops/decoder_write_cache_with_rope.py 88.88% 0 Missing and 1 partial ⚠️
fastdeploy/spec_decode/mtp.py 92.85% 0 Missing and 1 partial ⚠️
fastdeploy/worker/gpu_model_runner.py 85.71% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.6    #7729   +/-   ##
==============================================
  Coverage               ?   72.52%           
==============================================
  Files                  ?      385           
  Lines                  ?    54545           
  Branches               ?     8532           
==============================================
  Hits                   ?    39560           
  Misses                 ?    12213           
  Partials               ?     2772           
Flag Coverage Δ
GPU 72.52% <94.25%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@lizhenyun01 lizhenyun01 changed the title [Cherry-Pick][Feature] support decode attention for mix(#7688) [Cherry-Pick][Feature] support decode unified attention for mix(#7688) May 25, 2026
PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown
Collaborator

@yongqiangma yongqiangma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-26 16:09:29

📋 Review 摘要

PR 概述:新增 decode unified attention(C16/静态C8)算子及其 Python 调用侧,支持 flash_attn 场景下的 decode 阶段统一 attention 实现。
变更范围custom_ops/gpu_ops/(新增 CUDA kernels)、fastdeploy/model_executor/layers/attention/(Python 调用侧)、fastdeploy/worker/fastdeploy/spec_decode/tests/
影响面 Tag[OP] [Feature] [Speculative Decoding]

⚠️ 本 PR 变更量较大(28 文件 / 8235 行),建议拆分以降低审查难度和合入风险。

建议拆分方案

  • PR 1: CUDA Kernel 实现 — custom_ops/gpu_ops/decode_unified_attention.cucustom_ops/gpu_ops/decoder_write_cache_with_rope.cucustom_ops/gpu_ops/decode_unified_attention/*.cuhcustom_ops/gpu_ops/cpp_extensions.cccustom_ops/setup_ops.py
  • PR 2: Python 调用侧 + Worker 集成 — fastdeploy/model_executor/layers/attention/ops/fastdeploy/model_executor/layers/attention/flash_attn_backend.pyfastdeploy/model_executor/layers/attention/append_attn_backend.pyfastdeploy/worker/gpu_model_runner.pyfastdeploy/worker/input_batch.pyfastdeploy/envs.py
  • PR 3: Spec Decode + 测试 — fastdeploy/spec_decode/mtp.pyfastdeploy/worker/metax_model_runner.pytests/

问题

级别 文件 概述
🟡 建议 custom_ops/setup_ops.py 大 PR 拆分建议(见上)
❓ 疑问 custom_ops/setup_ops.py 缺少 -DENABLE_DECODE_UNIFIED_ATTENTION 编译宏,其他同类 PR 均有此宏
🟡 建议 fastdeploy/worker/metax_model_runner.py 文件在变更列表中,但未见 USE_DECODE_UNIFIED_ATTENTION 相关逻辑同步
📝 PR 规范 ModificationsUsage or CommandAccuracy Tests 段落为空

📝 PR 规范检查

标题格式符合 Cherry-Pick 规范([Cherry-Pick][Feature]),无需修改。

描述中 ModificationsUsage or CommandAccuracy Tests 三个段落仅保留了注释占位符,未填写实际内容,不符合模板要求。

PR 描述建议(可直接复制):

## Motivation

新增 decode unified attention 算子,支持 C16(cache_quant_type=none)和静态 C8(cache_fp8/cache_int8)两种 KV cache 量化模式下的 decode 阶段统一 attention 实现。使用方式:在 flash_attn 开启的情况下,设置环境变量 `export USE_DECODE_UNIFIED_ATTENTION=1` 即可启用。

## Modifications

- `custom_ops/gpu_ops/decode_unified_attention/`:新增 C16/C8 decode unified attention CUDA kernel 实现(`attention_func.cuh``decode_unified_attention_c16_impl.cuh``decode_unified_attention_c8_impl.cuh``mma_tensor_op.cuh``utils.cuh``mem_util.cuh``cu_tensor_map.cuh`- `custom_ops/gpu_ops/decode_unified_attention.cu`:新增 `DecodeUnifiedAttention` op 入口及 PD_BUILD_STATIC_OP 注册
- `custom_ops/gpu_ops/decoder_write_cache_with_rope.cu`:新增 `DecoderWriteCacheWithRoPE` op
- `custom_ops/gpu_ops/decode_unified_attention/config_for_attention.cu`:新增 `ConfigForAttention` op
- `custom_ops/gpu_ops/cpp_extensions.cc`:注册三个新 op 的 pybind11 接口
- `custom_ops/setup_ops.py`:在 SM90+ 分支加入新 `.cu` 文件编译
- `fastdeploy/model_executor/layers/attention/ops/`:新增 Python 调用侧封装(`decode_unified_attention.py``decoder_write_cache_with_rope.py``config_for_attention.py`- `fastdeploy/model_executor/layers/attention/flash_attn_backend.py``append_attn_backend.py`:集成新 attention 路径
- `fastdeploy/worker/gpu_model_runner.py``input_batch.py`:Worker 侧集成
- `fastdeploy/worker/metax_model_runner.py`:Metax 硬件适配
- `fastdeploy/spec_decode/mtp.py`:MTP 投机解码适配
- `fastdeploy/envs.py`:新增 `USE_DECODE_UNIFIED_ATTENTION` 环境变量
- `tests/`:新增 e2e 测试及算子单测

## Usage or Command

```bash
# 启用 decode unified attention(需 SM90+ GPU,flash_attn 开启)
export USE_DECODE_UNIFIED_ATTENTION=1
```

## Accuracy Tests

N/A(本 PR 为 Cherry-Pick,精度测试已在原 PR #7688 中提供)

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

整体实现结构清晰,CUDA kernel、Python 调用侧、Worker 集成、测试均已覆盖。建议确认 setup_ops.py 中是否需要补充 -DENABLE_DECODE_UNIFIED_ATTENTION 编译宏,并核查 metax_model_runner.py 的同步完整性。

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit a095d6f into PaddlePaddle:release/2.6 May 26, 2026
33 of 37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants