[CI]【Hackathon 10th Spring No.53】headwise SWA unit test [cfx]#7693
[CI]【Hackathon 10th Spring No.53】headwise SWA unit test [cfx]#7693jonny-cloudforge wants to merge 1 commit into
Conversation
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览当前未发现 required 任务失败、运行中或等待中;但有 2 个 Workflow 处于
2 任务状态汇总日志列说明:失败任务直接使用 2.1 Required任务 : 0/0 通过
2.2 可选任务 — 1/1 通过
3 失败详情(仅 required)无 required 失败任务,无需深度日志分析。 4 本轮检查说明
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-22 18:19:45
📋 Review 摘要
PR 概述:为 ResourceManagerV1 新增 per-head SWA KV 块回收机制(Hackathon 10th Spring No.53 PR1)
变更范围:cache_manager/、engine/sched/resource_manager_v1.py、config.py、envs.py、worker/
影响面 Tag:[KVCache] [Scheduler] [FDConfig]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/cache_manager/prefix_cache_manager.py:597 |
assert 用于运行时分配容量校验,Python -O 下失效 |
| 🟡 建议 | fastdeploy/worker/worker_process.py:484 |
删除 _tp_barrier_wait() 可能导致多卡 TP 进程间不同步 |
| 🟡 建议 | fastdeploy/worker/gpu_model_runner.py:2017 |
CUDAGraph _dummy_run 逻辑变更需确认是否同步到其他硬件 runner |
📝 PR 规范检查
PR 描述结构完整(Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist 均存在)。标题 Tag [CI] 与实际变更不符:PR 主体为 KVCache 管理器和调度器的功能增强,[CI] Tag 仅用于 .github/workflows/、scripts/ 等 CI 配置变更。
标题建议(可直接复制):
[Feature][KVCache] Add per-head SWA block recycle to ResourceManagerV1 (Hackathon 10th Spring No.53 PR1)
总体评价
PR 实现了 per-head SWA KV cache 块回收机制,逻辑结构清晰,默认关闭保证向后兼容,测试覆盖了主要场景。需修复运行时 assert 校验问题(P0),确认 TP barrier 删除的影响,并将标题 Tag 由 [CI] 改为 [Feature][KVCache]。
| kv_num_heads = max(1, self.kv_num_heads) | ||
| needed = num_blocks * kv_num_heads | ||
| free_list = self.gpu_free_head_wise_block_list | ||
| assert needed <= len(free_list), f"head-wise gpu free block num: {len(free_list)} < needed number {needed}" |
There was a problem hiding this comment.
🔴 Bug assert 被用于运行时分配容量校验,Python -O 下此断言被静默跳过,heappop 将在空堆上抛出无上下文的 IndexError(checklist §C 表层信号)。
建议修复:
if needed > len(free_list):
raise ValueError(
f"head-wise gpu free block num: {len(free_list)} < needed number {needed}"
)| req_dicts = None | ||
| self.worker_healthy_live_signal.value[tp_rank % self.max_chips_per_node] = int(time.time()) | ||
|
|
||
| self._tp_barrier_wait() if tp_size > 1 else None |
There was a problem hiding this comment.
🟡 建议 此处删除了 self._tp_barrier_wait() if tp_size > 1 else None。
该调用是多卡 TP 场景下 event_loop_normal 的同步屏障,确保所有 TP rank 在 task-queue 检测前保持对齐。删除后,当 tp_rank > 0 的进程处理速度与 rank-0 存在差异时,可能导致 rank 间任务不同步。
建议补充说明该删除的原因(如已有其他同步机制替代),或恢复该调用。
| @@ -2016,10 +2017,8 @@ def _dummy_run( | |||
| while True: | |||
There was a problem hiding this comment.
🟡 建议 _dummy_run 的 CUDAGraph 逻辑由 if not (in_capturing or step_use_cudagraph): set False 改为 in_capturing and forward_meta.step_use_cudagraph,语义有差异。同时 worker_process.py 中的 RUN_DUMMY_FOR_PROFILE 及 _dummy_run(step_use_cudagraph=True) 调用也一并删除。
请确认此变更是否需要同步到其他硬件 ModelRunner(xpu_model_runner.py、iluvatar_model_runner.py 等),以避免其他硬件 CI 失败(checklist §A6 多硬件同步)。
|
|
PR1 Body —
[Feature][KVCache] Add per-head SWA block recycle to ResourceManagerV1 (Hackathon 10th Spring No.53)Motivation
Hackathon 10th Spring Task No.53 — 离散 KV Cache 管理和 AppendAttention 算子的性能优化 (PR1 of 2). Spec: https://github.com/PaddlePaddle/community/blob/master/hackathon/hackathon_10th/【Hackathon_10th】开源贡献个人挑战赛春节特别季—任务合集.md#no53.
For models that mix Sliding-Window Attention (SWA) heads with full-attention heads inside the same layer, today's V1 KV-cache scheduling path (
ResourceManagerV1+PrefixCacheManager, gated by the default-onENABLE_V1_KVCACHE_SCHEDULER=1) allocates one sharedblock_idxper layer for all heads. SWA heads finish their window long before full-attn heads, but their cache stays pinned until the whole layer evicts. Throughput suffers.This PR teaches the V1 scheduler +
PrefixCacheManagerto manageblock_idxper head (head-wise SWA layout) and recycle a SWA head's cache as soon as it crosses its window — the per-head equivalent of what PR #6702 did for V0.Authorship: this PR is independently designed and implemented by the submitter for Hackathon 10th Spring No.53. The earlier community PR #6702 (V0, not merged) is referenced as prior art only; no code is lifted unattributed. Any future contributor work will be acknowledged via per-commit
Co-authored-bytrailers.RFC: PaddlePaddle/community#1362.
Modifications
fastdeploy/cache_manager/prefix_cache_manager.pygpu_free_block_list_head_wise[head]);allocate_gpu_blocks_head_wise/recycle_gpu_blocks_head_wise; TP-aware sizing (num_key_value_heads // tp_size)fastdeploy/engine/sched/resource_manager_v1.pyrecycle_request_swa_head_cache(per-head cursor advance ≥ window+sink);_should_skip_swa_recycle_for_overlap(per-requestcache_swap_metadata/cache_evict_metadatainspection); P4 cleanup in_free_blocksfastdeploy/model_executor/models/paddleformers/base.pyFD_T53_HEAD_WISE_SWA_FIXTURE=1fastdeploy/config.pypaddleformers/base.pyhead-wise SWA attribute injection soResourceManagerV1._should_use_head_wise_swa(engine-main) sees the samemodel_config.head_wise_swa_ratioas the worker. Gated onFD_T53_HEAD_WISE_SWA_FIXTURE.enable_prefix_caching=True + FD_HEAD_WISE_KV_CACHE=1raises atPrefixCacheManager.__init__FD_HEAD_WISE_KV_CACHE=0default — bit-identical when disabledTests use real lightweight objects +
object.__new__/AST or shape oracles (noMagicMock-only). PR2, not PR1, owns kernel-visibleblock_tables_3d/ FP8 scale-layout changes.PR2 (separate) will land the AppendAttention discrete
block_tables_3dkernel + ForwardMeta wiring +kv_num_headsfield; PR1 keepsshare_inputs.block_tables2D and reaches the +30% recycle gate via cache-manager-side changes only.Usage or Command
Accuracy Tests
Spec PR1 acceptance — throughput up ≥30% with timely SWA recycle vs without, same VRAM, fixed-IO dataset, V1 KV-cache scheduler on (
ENABLE_V1_KVCACHE_SCHEDULER=1, default):Additional: p99 TTFT 570 s → 285 s (−50%). Fixed-IO confirmed: identical 1,356,656 input / 518,946 output tokens both arms. 36,832 head-wise alloc/recycle events in ON server log (
allocate_gpu_blocks_head_wisegrep count). Oracle:[T53/bench] head-wise oracle PASS: issued=1 recycled=1.(Quick validation run: num_prompts=128. Full 1,024-prompt run in progress; table will be updated once complete.)
Benchmark:
FastDeploy/benchmarks/benchmark_serving.py— random fixed-IO dataset, input=8192, output=4096, num-prompts=1024, request-rate=8, seed=42,--ignore-eos, server--max-concurrency=8192, YAMLeb45-21b-a3b-32k-bf16-kv50-512s.yaml. The harness rejects partial JSONs (completed != 1024or non-emptyerrors).Correctness:
tests/cache_manager/test_head_wise_*.py,tests/cache_manager/test_swa_recycle*.py, andtests/layers/test_append_attention_head_wise_shapes.py— real_FakeCacheManager+object.__new__(ResourceManagerV1)+ AST/shape oracles. NoMagicMock-only tests.issued=1 recycled=1), 36,832 head-wise alloc/recycle events in full bench server log.CI run:
Checklist
pre-commit run --all-filesclean (black reformatted 2 files; amend committed)