Skip to content

[Feature] support decode unified attention#7688

Merged
Jiang-Jia-Jun merged 11 commits into
PaddlePaddle:developfrom
lizhenyun01:h_dec_attn
May 25, 2026
Merged

[Feature] support decode unified attention#7688
Jiang-Jia-Jun merged 11 commits into
PaddlePaddle:developfrom
lizhenyun01:h_dec_attn

Conversation

@lizhenyun01

@lizhenyun01 lizhenyun01 commented May 1, 2026

Copy link
Copy Markdown
Collaborator

Motivation

C16/动态/静态C8 decode unified attention 支持,用于 decode 阶段 split-KV 注意力计算加速(SM90+ Hopper 设备)。

使用方式:

  • 集中式(flash_attn 开启时):export USE_DECODE_UNIFIED_ATTENTION=1
  • PD 分离 D 节点:export FD_ATTENTION_BACKEND=DECODE_UNIFIED_ATTN

Modifications

  • 新增 CUDA 算子实现(仅在 SM90+ / NVCC≥12.0 下编译):
    • custom_ops/gpu_ops/decode_unified_attention.cu:注册 DecodeUnifiedAttentionDecoderWriteCacheWithRoPEConfigForAttention 三个算子
    • custom_ops/gpu_ops/decode_unified_attention/:核心 kernel 实现(attention_func.cuhdecode_unified_attention_c16_impl.cuhdecode_unified_attention_c8_impl.cuh 等)
  • 新增 Python 后端 DecodeUnifiedAttentionBackendfastdeploy/model_executor/layers/attention/decode_unified_attention_backend.py),实现 forward_mixed 接口,流程:config_for_attention(layer 0 chunk 配置)→ decoder_write_cache_with_rope(写 KV cache + RoPE)→ decode_unified_attention(split-KV attention 计算)
  • 新增环境变量 USE_DECODE_UNIFIED_ATTENTION(集中式开关,默认 0)和 FD_ATTENTION_BACKEND=DECODE_UNIFIED_ATTN(PD 分离开关)至 fastdeploy/envs.py
  • 更新 fastdeploy/platforms/cuda.py 添加 DECODE_UNIFIED_ATTN 后端分发
  • 更新 fastdeploy/worker/gpu_model_runner.py 添加 decode unified attention buffer 分配路径
  • 更新 custom_ops/setup_ops.py 在 SM90+/NVCC≥12.0 分支加入新算子源文件
  • 新增单测:tests/operators/attention/test_decode_unified_attention_c16.pytest_decode_unified_attention_c8.py

Usage or Command

集中式(flash_attn 开启时):

export USE_DECODE_UNIFIED_ATTENTION=1

PD 分离 D 节点:

export FD_ATTENTION_BACKEND=DECODE_UNIFIED_ATTN

Accuracy Tests

N/A(本次 PR 为新增算子,尚未提供与参考实现的精度对比数据)

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot

paddle-bot Bot commented May 1, 2026

Copy link
Copy Markdown

Thanks for your contribution!

@PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter

codecov-commenter commented May 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 77.45902% with 55 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@dad5a43). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...yers/attention/decode_unified_attention_backend.py 71.09% 26 Missing and 11 partials ⚠️
fastdeploy/spec_decode/mtp.py 57.57% 12 Missing and 2 partials ⚠️
...cutor/layers/attention/ops/config_for_attention.py 85.71% 0 Missing and 1 partial ⚠️
...r/layers/attention/ops/decode_unified_attention.py 88.88% 0 Missing and 1 partial ⚠️
...ers/attention/ops/decoder_write_cache_with_rope.py 88.88% 0 Missing and 1 partial ⚠️
fastdeploy/worker/gpu_model_runner.py 88.88% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7688   +/-   ##
==========================================
  Coverage           ?   63.64%           
==========================================
  Files              ?      466           
  Lines              ?    64726           
  Branches           ?     9918           
==========================================
  Hits               ?    41192           
  Misses             ?    20741           
  Partials           ?     2793           
Flag Coverage Δ
GPU 72.73% <77.45%> (?)
XPU 7.09% <0.40%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-06 20:58:21

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

❌ 存在 1 个 Required 任务失败,需优先处理后方可合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
19(0) 19 12 2 2 3 0

2 任务状态汇总

2.1 Required任务 : 1/2 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 8s PR问题:PR缺少4个必要审批(自定义算子及敏感文件) 联系相关RD审批(dangqingqing/jeff41404等) Job -
其余 1 个必选任务通过 - - - - -

2.2 可选任务 — 11/17 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Check PR Template 12s Job -
xpu_build_test / xpu-build-test - Job -
Trigger Jenkins for PR - Job -
⏸️ CI_HPU - - -
⏸️ Run iluvatar Tests / run_iluvatar_cases - - -
⏸️ FD-Build-Linux / fd-build - - -
其余 11 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 审批缺失(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 审批缺失
  • 置信度: 高
  • 根因摘要: PR缺少4个必要审批,涉及自定义算子及敏感文件修改
  • 分析器: 通用分析(fallback)

根因详情:
check_approval.sh 脚本检测到本 PR 缺少 4 项必要审批。本次 PR 修改了自定义算子相关目录(fastdeploy/spec_decodecustom_ops/gpu_ops/speculate_decoding)以及敏感配置文件(fastdeploy/envs.py),这些改动需要对应模块负责人的显式 Approve 方可合并。

关键日志:

0. You must have one FastDeploy RD (qingqing01, Jiang-Jia-Jun, heavengate) approval for adding custom op.
1. You must have one PaddlePaddle RD (jeff41404, yongqiangma) approval for adding custom op.
2. You must have one FastDeploy RD (freeliuzc, Deleter-D) approval for modifying [fastdeploy/spec_decode, custom_ops/gpu_ops/speculate_decoding].
3. You must have one FastDeploy RD (Jiang-Jia-Jun, yuanlehome, rainyfly, Wanglongzhi2001) approval for modifying [fastdeploy/envs.py].
There are 4 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. @dangqingqing / @jiangjiajun / @DENGKAIPENG 中任意一位 FastDeploy RD Approve(自定义算子)
  2. @gaoxiang / @mayongqiang 中任意一位 PaddlePaddle RD Approve(自定义算子)
  3. 请 @liuzichang01 / @wangyanpeng04 中任意一位 FastDeploy RD Approve(spec_decode/custom_ops 目录)
  4. @jiangjiajun / @liuyuanle / @chenjian26 / @wanglongzhi 中任意一位 FastDeploy RD Approve(envs.py)

修复建议摘要: 请相关RD审批(dangqingqing/jeff41404/liuzichang01/jiangjiajun等)

链接: 查看日志

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot

PaddlePaddle-bot commented May 20, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-24 17:30:54

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前 Required 任务 9/10 通过,仍有 1 个 Required 任务失败:Approval。主测试任务 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 已通过;当前阻塞项为人工审批检查,需完成 Approval 后再观察后续状态。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
43(1) 42 38 4 0 0 0

2 任务状态汇总

日志列说明:失败任务直接使用日志链接;可选任务不阻塞合并,仅供参考。

2.1 Required任务 : 9/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 8s 需要 Approval:审批检查未通过 请通过人工审批 Job 🔄×1
其余 9 个必选任务通过 - - - - -

2.2 可选任务 — 29/32 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 3m15s Job -
CI_HPU 1h25m Job -
Trigger Jenkins for PR 7m51s Job -
其余 29 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — approval_required(置信度: 高)

该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。

补充信息:Approval workflow 在 .github/workflows/approve.yml 中执行 scripts/check_approval.sh 进行审批检查;本次主测试任务已通过,当前 Required 阻塞项不是测试失败。

freeliuzc pushed a commit that referenced this pull request May 20, 2026
* support decode unified attention

* support dyc8 && test

* opt memory
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-21 16:16:46

📋 Review 摘要

PR 概述:新增 decode unified attention 算子,支持 Hopper(SM90+)设备 decode 阶段 split-KV 注意力加速,覆盖 C16/动态C8/静态C8 三种 cache 量化模式。
变更范围custom_ops/gpu_ops/decode_unified_attention/fastdeploy/model_executor/layers/attention/fastdeploy/worker/gpu_model_runner.pyfastdeploy/spec_decode/
影响面 Tag[OP] [Feature] [Speculative Decoding] [PD Disaggregation]

问题

级别 文件 概述
🟡 建议 custom_ops/gpu_ops/append_attn/utils.cuh:34 div_up 修改超出本 PR 范围,会影响现有 append_attn 系列算子
🟡 建议 custom_ops/gpu_ops/decode_unified_attention.cu:57 cache_k/v_dequant_scales 为 optional,C8 路径直接调用 .get() 无 null 检查
❓ 疑问 custom_ops/gpu_ops/cpp_extensions.cc:2045 decoder_write_cache_with_ropeconfig_for_attention 未见 PD_BUILD_STATIC_OP 静态 Op 注册

📝 PR 规范检查

PR 标题 [Feature] support decode unified attention 使用官方 Tag,格式规范。PR 描述包含全部必填段落(Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist),结构合规,Accuracy Tests 明确说明 N/A。✓ PR 规范检查通过,无需修改。

总体评价

PR 整体实现完整,SM90+ 门控、setup_ops.py 编译注册、Python 调用侧均已同步,测试文件已补充。主要关注点是 C8 路径下 optional tensor 无 null 检查的崩溃风险,以及 append_attn/utils.cuhdiv_up 修改的波及范围。建议作者确认上述两点后合入。


__forceinline__ __host__ __device__ int div_up(int a, int b) {
return (a + b - 1) / b;
return a / b + (a % b != 0);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 div_up 修改超出本 PR 预期范围

此改动将 (a + b - 1) / b 改为 a / b + (a % b != 0),虽对正整数数学等价,但修改了 append_attn/ 目录下所有算子共用的工具函数,而本 PR 的新算子并不在该目录下。

建议修复方式:

  • 若目的是修复潜在的整数溢出风险(a + b - 1 对超大值可能溢出),请在 PR 描述中补充说明,并增加对现有 append_attn 算子的回归测试
  • 若只是 decode_unified_attention 内部需要,建议只在 custom_ops/gpu_ops/decode_unified_attention/utils.cuh 中单独定义,避免修改无关文件

const paddle::optional<paddle::Tensor>& attn_mask,
const paddle::optional<paddle::Tensor>& cache_k_quant_scales,
const paddle::optional<paddle::Tensor>& cache_v_quant_scales,
const paddle::optional<paddle::Tensor>& cache_k_dequant_scales,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 cache_k/v_dequant_scales 为可选 tensor,C8 路径未做 null 检查

在函数体的 C8 dispatch 路径中,代码直接调用:

cache_quant_type == "block_wise_fp8"
    ? cache_k_quant_scales.get()
    : cache_k_dequant_scales.get()  // ← 若未初始化则崩溃

cache_quant_type"cache_fp8""cache_int8" 时使用 dequant_scales.get(),若调用方未传入该 optional tensor,将触发 Paddle 的 optional dereference 异常。

建议修复方式:在进入 C8 dispatch 前添加断言:

PADDLE_ENFORCE_EQ(
    cache_k_dequant_scales.is_initialized(), true,
    phi::errors::InvalidArgument(
        "cache_k_dequant_scales must be provided for cache_quant_type=%s",
        cache_quant_type.c_str()));

或在 Python 侧 forward_mixed 中确保非 block_wise_fp8 时必传 dequant_scales。

&PerTokenGroupQuantFp8,
"per_token_group_quant_fp8");

/**

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 decoder_write_cache_with_ropeconfig_for_attention 仅通过 pybind11 注册,未见 PD_BUILD_STATIC_OP

在 diff 范围内,只有 decode_unified_attention.cu 中可见 PD_BUILD_STATIC_OP(decode_unified_attention) 宏注册。decoder_write_cache_with_rope.cudecode_unified_attention/config_for_attention.cu 未在 diff 中展示其静态 Op 注册。

请确认:

  1. 两个文件中是否也有对应的 PD_BUILD_STATIC_OP(decoder_write_cache_with_rope) / PD_BUILD_STATIC_OP(config_for_attention) 宏;
  2. 若这两个算子只通过 pybind11 调用(动态图),请在代码中加注释说明不需要静态注册的原因。

@yongqiangma yongqiangma left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 4762002 into PaddlePaddle:develop May 25, 2026
39 of 44 checks passed
Jiang-Jia-Jun pushed a commit that referenced this pull request May 26, 2026
#7729)

* support c8 decode attention

* support c16 attention && backend

* opt kernel

* fix

* opt larger batch

* inplace out

* fix input_batch && remove fast_math

* fix xpu

* fix bug

* fix ci

* opt and fix mtp

* fix merge

* clean code

* fix merge

* update

* update test

* fix test

* fix test

* opt buffer

* fix conflict

---------

Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants