[OP][Optimization] Remove ENABLE_PREFILL template parameter in multi_query_append_attention_warp1_4_kernel#7201
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7201 +/- ##
==========================================
Coverage ? 73.84%
==========================================
Files ? 376
Lines ? 52947
Branches ? 8263
==========================================
Hits ? 39101
Misses ? 11111
Partials ? 2735
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
fastdeploy-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-07 10:33 CST
📋 Review 摘要
PR 概述:删除 ENABLE_PREFILL 模板参数,统一 attention kernel 的内存布局
变更范围:custom_ops/gpu_ops/append_attn/、fastdeploy/model_executor/layers/attention/
影响面 Tag:OP Speculative Decoding
📝 PR 规范检查
PR 标题和描述不符合规范:
- 标题缺少有效的 Tag 标签
- 描述中 Motivation 和 Modifications 部分未填写
标题建议(可直接复制):
[OP][Optimization] Remove ENABLE_PREFILL template parameter in attention kernel
描述建议:
## Motivation
简化 attention kernel 代码,删除不再需要的 `ENABLE_PREFILL` 模板参数分支,统一使用 `speculate_max_draft_token_num` 的内存布局。
## Modifications
1. 删除 `multiquery_attention_c16_impl.cuh` 中 `multi_query_append_attention_warp1_4_kernel` 的 `ENABLE_PREFILL` 模板参数
2. 简化内存分配逻辑,统一使用 `speculate_max_draft_token_num * bsz` 的计算方式
3. Python 端(append_attn_backend.py, flash_attn_backend.py, flash_mask_attn_backend.py)在非 speculative 模式下将 `speculate_max_draft_token_num` 设为 0问题
| 级别 | 文件 | 概述 |
|---|---|---|
| ❓ 疑问 | multiquery_attention_c16_impl.cuh:433 |
c4/c8 版本未同步删除 ENABLE_PREFILL,是否有意为之? |
| 🟡 建议 | append_attn_backend.py:150 |
建议添加注释说明与 CUDA 端的关联 |
总体评价
代码逻辑正确:Python 端设置 speculate_max_draft_token_num=0(传给 CUDA 时 +1=1),与删除 ENABLE_PREFILL 后的统一内存布局兼容。建议补充 PR 规范信息,并确认 c4/c8 版本是否需要同步修改。
| uint32_t num_frags_y, | ||
| typename OutT = T, | ||
| bool ENABLE_PREFILL = true> | ||
| typename OutT = T> |
There was a problem hiding this comment.
❓ 疑问 只有 multi_query_append_attention_warp1_4_kernel 删除了 ENABLE_PREFILL 模板参数
注意到 multiquery_attention_c4_impl.cuh 和 multiquery_attention_c8_impl.cuh 中仍然保留了 ENABLE_PREFILL 参数。这是否是有意为之(分阶段重构),还是遗漏?
如果是有意为之,建议在 PR 描述中说明原因。
| self.speculative_method = fd_config.speculative_config.method | ||
| self.speculate_max_draft_token_num: int = fd_config.speculative_config.num_speculative_tokens | ||
| if self.speculative_method is None: | ||
| self.speculate_max_draft_token_num = 0 |
There was a problem hiding this comment.
🟡 建议 建议添加注释说明此设置的目的
这个修改确保非 speculative 模式下 speculate_max_draft_token_num=0,与 CUDA 端删除 ENABLE_PREFILL 后的统一内存布局配合。建议添加注释说明这个关联,方便后续维护。
# When not using speculative decoding, set to 0. The CUDA kernel will
# receive (speculate_max_draft_token_num + 1) = 1, which matches the
# simplified memory layout after removing ENABLE_PREFILL branches.
if self.speculative_method is None:
self.speculate_max_draft_token_num = 0
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.