[Feature] support decode unified attention#7688
Conversation
|
Thanks for your contribution! |
This comment was marked as outdated.
This comment was marked as outdated.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7688 +/- ##
==========================================
Coverage ? 63.64%
==========================================
Files ? 466
Lines ? 64726
Branches ? 9918
==========================================
Hits ? 41192
Misses ? 20741
Partials ? 2793
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览❌ 存在 1 个 Required 任务失败,需优先处理后方可合并。
2 任务状态汇总2.1 Required任务 : 1/2 通过
2.2 可选任务 — 11/17 通过
3 失败详情(仅 required)Approval — 审批缺失(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: 请相关RD审批(dangqingqing/jeff41404/liuzichang01/jiangjiajun等) 链接: 查看日志 |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览当前 Required 任务 9/10 通过,仍有 1 个 Required 任务失败:
2 任务状态汇总日志列说明:失败任务直接使用日志链接;可选任务不阻塞合并,仅供参考。 2.1 Required任务 : 9/10 通过
2.2 可选任务 — 29/32 通过
3 失败详情(仅 required)Approval — approval_required(置信度: 高)该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。 补充信息: |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-21 16:16:46
📋 Review 摘要
PR 概述:新增 decode unified attention 算子,支持 Hopper(SM90+)设备 decode 阶段 split-KV 注意力加速,覆盖 C16/动态C8/静态C8 三种 cache 量化模式。
变更范围:custom_ops/gpu_ops/decode_unified_attention/、fastdeploy/model_executor/layers/attention/、fastdeploy/worker/gpu_model_runner.py、fastdeploy/spec_decode/
影响面 Tag:[OP] [Feature] [Speculative Decoding] [PD Disaggregation]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | custom_ops/gpu_ops/append_attn/utils.cuh:34 |
div_up 修改超出本 PR 范围,会影响现有 append_attn 系列算子 |
| 🟡 建议 | custom_ops/gpu_ops/decode_unified_attention.cu:57 |
cache_k/v_dequant_scales 为 optional,C8 路径直接调用 .get() 无 null 检查 |
| ❓ 疑问 | custom_ops/gpu_ops/cpp_extensions.cc:2045 |
decoder_write_cache_with_rope 和 config_for_attention 未见 PD_BUILD_STATIC_OP 静态 Op 注册 |
📝 PR 规范检查
PR 标题 [Feature] support decode unified attention 使用官方 Tag,格式规范。PR 描述包含全部必填段落(Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist),结构合规,Accuracy Tests 明确说明 N/A。✓ PR 规范检查通过,无需修改。
总体评价
PR 整体实现完整,SM90+ 门控、setup_ops.py 编译注册、Python 调用侧均已同步,测试文件已补充。主要关注点是 C8 路径下 optional tensor 无 null 检查的崩溃风险,以及 append_attn/utils.cuh 中 div_up 修改的波及范围。建议作者确认上述两点后合入。
|
|
||
| __forceinline__ __host__ __device__ int div_up(int a, int b) { | ||
| return (a + b - 1) / b; | ||
| return a / b + (a % b != 0); |
There was a problem hiding this comment.
🟡 建议 div_up 修改超出本 PR 预期范围
此改动将 (a + b - 1) / b 改为 a / b + (a % b != 0),虽对正整数数学等价,但修改了 append_attn/ 目录下所有算子共用的工具函数,而本 PR 的新算子并不在该目录下。
建议修复方式:
- 若目的是修复潜在的整数溢出风险(
a + b - 1对超大值可能溢出),请在 PR 描述中补充说明,并增加对现有 append_attn 算子的回归测试 - 若只是 decode_unified_attention 内部需要,建议只在
custom_ops/gpu_ops/decode_unified_attention/utils.cuh中单独定义,避免修改无关文件
| const paddle::optional<paddle::Tensor>& attn_mask, | ||
| const paddle::optional<paddle::Tensor>& cache_k_quant_scales, | ||
| const paddle::optional<paddle::Tensor>& cache_v_quant_scales, | ||
| const paddle::optional<paddle::Tensor>& cache_k_dequant_scales, |
There was a problem hiding this comment.
🟡 建议 cache_k/v_dequant_scales 为可选 tensor,C8 路径未做 null 检查
在函数体的 C8 dispatch 路径中,代码直接调用:
cache_quant_type == "block_wise_fp8"
? cache_k_quant_scales.get()
: cache_k_dequant_scales.get() // ← 若未初始化则崩溃当 cache_quant_type 为 "cache_fp8" 或 "cache_int8" 时使用 dequant_scales.get(),若调用方未传入该 optional tensor,将触发 Paddle 的 optional dereference 异常。
建议修复方式:在进入 C8 dispatch 前添加断言:
PADDLE_ENFORCE_EQ(
cache_k_dequant_scales.is_initialized(), true,
phi::errors::InvalidArgument(
"cache_k_dequant_scales must be provided for cache_quant_type=%s",
cache_quant_type.c_str()));或在 Python 侧 forward_mixed 中确保非 block_wise_fp8 时必传 dequant_scales。
| &PerTokenGroupQuantFp8, | ||
| "per_token_group_quant_fp8"); | ||
|
|
||
| /** |
There was a problem hiding this comment.
❓ 疑问 decoder_write_cache_with_rope 与 config_for_attention 仅通过 pybind11 注册,未见 PD_BUILD_STATIC_OP 宏
在 diff 范围内,只有 decode_unified_attention.cu 中可见 PD_BUILD_STATIC_OP(decode_unified_attention) 宏注册。decoder_write_cache_with_rope.cu 和 decode_unified_attention/config_for_attention.cu 未在 diff 中展示其静态 Op 注册。
请确认:
- 两个文件中是否也有对应的
PD_BUILD_STATIC_OP(decoder_write_cache_with_rope)/PD_BUILD_STATIC_OP(config_for_attention)宏; - 若这两个算子只通过 pybind11 调用(动态图),请在代码中加注释说明不需要静态注册的原因。
#7729) * support c8 decode attention * support c16 attention && backend * opt kernel * fix * opt larger batch * inplace out * fix input_batch && remove fast_math * fix xpu * fix bug * fix ci * opt and fix mtp * fix merge * clean code * fix merge * update * update test * fix test * fix test * opt buffer * fix conflict --------- Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
Motivation
C16/动态/静态C8 decode unified attention 支持,用于 decode 阶段 split-KV 注意力计算加速(SM90+ Hopper 设备)。
使用方式:
export USE_DECODE_UNIFIED_ATTENTION=1export FD_ATTENTION_BACKEND=DECODE_UNIFIED_ATTNModifications
custom_ops/gpu_ops/decode_unified_attention.cu:注册DecodeUnifiedAttention、DecoderWriteCacheWithRoPE、ConfigForAttention三个算子custom_ops/gpu_ops/decode_unified_attention/:核心 kernel 实现(attention_func.cuh、decode_unified_attention_c16_impl.cuh、decode_unified_attention_c8_impl.cuh等)DecodeUnifiedAttentionBackend(fastdeploy/model_executor/layers/attention/decode_unified_attention_backend.py),实现forward_mixed接口,流程:config_for_attention(layer 0 chunk 配置)→decoder_write_cache_with_rope(写 KV cache + RoPE)→decode_unified_attention(split-KV attention 计算)USE_DECODE_UNIFIED_ATTENTION(集中式开关,默认 0)和FD_ATTENTION_BACKEND=DECODE_UNIFIED_ATTN(PD 分离开关)至fastdeploy/envs.pyfastdeploy/platforms/cuda.py添加DECODE_UNIFIED_ATTN后端分发fastdeploy/worker/gpu_model_runner.py添加 decode unified attention buffer 分配路径custom_ops/setup_ops.py在 SM90+/NVCC≥12.0 分支加入新算子源文件tests/operators/attention/test_decode_unified_attention_c16.py、test_decode_unified_attention_c8.pyUsage or Command
集中式(flash_attn 开启时):
export USE_DECODE_UNIFIED_ATTENTION=1PD 分离 D 节点:
export FD_ATTENTION_BACKEND=DECODE_UNIFIED_ATTNAccuracy Tests
N/A(本次 PR 为新增算子,尚未提供与参考实现的精度对比数据)
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.