[Cherry-Pick][Feature] support decode unified attention for mix(#7688)#7729
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览Required 任务 7/10 通过,当前仍有 3 个 required 失败任务阻塞合并:2 个为 GitHub Actions artifact 存储配额问题,1 个为 Approval 人工审批;主测试任务
2 任务状态汇总日志列说明:失败任务直接使用日志链接;运行中任务手动拼接 Job 链接。 2.1 Required任务 : 7/10 通过
2.2 可选任务 — 23/27 通过
3 失败详情(仅 required)xpu_4cards_case_test / run_xpu_4cards_cases — 基础设施(置信度: 高)xpu_4cards_case_test / run_xpu_4cards_cases
失败用例:
根因详情: 关键日志: 修复建议:
修复建议摘要: 清理 Actions artifacts 后 rerun 关联变更: 未发现直接关联;PR 主要修改 GPU decode unified attention 相关文件,当前失败在 XPU 日志上传阶段。 xpu_8cards_case_test / run_xpu_8cards_cases — 基础设施(置信度: 高)xpu_8cards_case_test / run_xpu_8cards_cases
失败用例:
根因详情: 关键日志: 修复建议:
修复建议摘要: 清理 Actions artifacts 后 rerun 关联变更: 未发现直接关联;PR 主要修改 GPU decode unified attention 相关文件,当前失败在 XPU 日志上传阶段。 Approval — 人工审批(置信度: 高)该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。 根因摘要: 需要 Approval 修复建议摘要: 请通过人工审批 4 代码关联性核验已读取/检索 PR 关键变更与相关 workflow:
结论:本轮 Required 失败不指向 PR 代码逻辑问题;建议优先处理 Actions 存储配额和 Approval。 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## release/2.6 #7729 +/- ##
==============================================
Coverage ? 72.52%
==============================================
Files ? 385
Lines ? 54545
Branches ? 8532
==============================================
Hits ? 39560
Misses ? 12213
Partials ? 2772
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-26 16:09:29
📋 Review 摘要
PR 概述:新增 decode unified attention(C16/静态C8)算子及其 Python 调用侧,支持 flash_attn 场景下的 decode 阶段统一 attention 实现。
变更范围:custom_ops/gpu_ops/(新增 CUDA kernels)、fastdeploy/model_executor/layers/attention/(Python 调用侧)、fastdeploy/worker/、fastdeploy/spec_decode/、tests/
影响面 Tag:[OP] [Feature] [Speculative Decoding]
建议拆分方案:
- PR 1: CUDA Kernel 实现 —
custom_ops/gpu_ops/decode_unified_attention.cu、custom_ops/gpu_ops/decoder_write_cache_with_rope.cu、custom_ops/gpu_ops/decode_unified_attention/*.cuh、custom_ops/gpu_ops/cpp_extensions.cc、custom_ops/setup_ops.py - PR 2: Python 调用侧 + Worker 集成 —
fastdeploy/model_executor/layers/attention/ops/、fastdeploy/model_executor/layers/attention/flash_attn_backend.py、fastdeploy/model_executor/layers/attention/append_attn_backend.py、fastdeploy/worker/gpu_model_runner.py、fastdeploy/worker/input_batch.py、fastdeploy/envs.py - PR 3: Spec Decode + 测试 —
fastdeploy/spec_decode/mtp.py、fastdeploy/worker/metax_model_runner.py、tests/
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | custom_ops/setup_ops.py |
大 PR 拆分建议(见上) |
| ❓ 疑问 | custom_ops/setup_ops.py |
缺少 -DENABLE_DECODE_UNIFIED_ATTENTION 编译宏,其他同类 PR 均有此宏 |
| 🟡 建议 | fastdeploy/worker/metax_model_runner.py |
文件在变更列表中,但未见 USE_DECODE_UNIFIED_ATTENTION 相关逻辑同步 |
| 📝 PR 规范 | — | Modifications、Usage or Command、Accuracy Tests 段落为空 |
📝 PR 规范检查
标题格式符合 Cherry-Pick 规范([Cherry-Pick][Feature]),无需修改。
描述中 Modifications、Usage or Command、Accuracy Tests 三个段落仅保留了注释占位符,未填写实际内容,不符合模板要求。
PR 描述建议(可直接复制):
## Motivation
新增 decode unified attention 算子,支持 C16(cache_quant_type=none)和静态 C8(cache_fp8/cache_int8)两种 KV cache 量化模式下的 decode 阶段统一 attention 实现。使用方式:在 flash_attn 开启的情况下,设置环境变量 `export USE_DECODE_UNIFIED_ATTENTION=1` 即可启用。
## Modifications
- `custom_ops/gpu_ops/decode_unified_attention/`:新增 C16/C8 decode unified attention CUDA kernel 实现(`attention_func.cuh`、`decode_unified_attention_c16_impl.cuh`、`decode_unified_attention_c8_impl.cuh`、`mma_tensor_op.cuh`、`utils.cuh`、`mem_util.cuh`、`cu_tensor_map.cuh`)
- `custom_ops/gpu_ops/decode_unified_attention.cu`:新增 `DecodeUnifiedAttention` op 入口及 PD_BUILD_STATIC_OP 注册
- `custom_ops/gpu_ops/decoder_write_cache_with_rope.cu`:新增 `DecoderWriteCacheWithRoPE` op
- `custom_ops/gpu_ops/decode_unified_attention/config_for_attention.cu`:新增 `ConfigForAttention` op
- `custom_ops/gpu_ops/cpp_extensions.cc`:注册三个新 op 的 pybind11 接口
- `custom_ops/setup_ops.py`:在 SM90+ 分支加入新 `.cu` 文件编译
- `fastdeploy/model_executor/layers/attention/ops/`:新增 Python 调用侧封装(`decode_unified_attention.py`、`decoder_write_cache_with_rope.py`、`config_for_attention.py`)
- `fastdeploy/model_executor/layers/attention/flash_attn_backend.py`、`append_attn_backend.py`:集成新 attention 路径
- `fastdeploy/worker/gpu_model_runner.py`、`input_batch.py`:Worker 侧集成
- `fastdeploy/worker/metax_model_runner.py`:Metax 硬件适配
- `fastdeploy/spec_decode/mtp.py`:MTP 投机解码适配
- `fastdeploy/envs.py`:新增 `USE_DECODE_UNIFIED_ATTENTION` 环境变量
- `tests/`:新增 e2e 测试及算子单测
## Usage or Command
```bash
# 启用 decode unified attention(需 SM90+ GPU,flash_attn 开启)
export USE_DECODE_UNIFIED_ATTENTION=1
```
## Accuracy Tests
N/A(本 PR 为 Cherry-Pick,精度测试已在原 PR #7688 中提供)
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
整体实现结构清晰,CUDA kernel、Python 调用侧、Worker 集成、测试均已覆盖。建议确认 setup_ops.py 中是否需要补充 -DENABLE_DECODE_UNIFIED_ATTENTION 编译宏,并核查 metax_model_runner.py 的同步完整性。
a095d6f
into
PaddlePaddle:release/2.6
Motivation
C16/静态C8 attention支持,使用方式:flash_attn开启情况下export USE_DECODE_UNIFIED_ATTENTION =1
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.