[CI]【Hackathon 10th Spring No.53】headwise SWA unit test [cfx] by jonny-cloudforge · Pull Request #7693 · PaddlePaddle/FastDeploy

jonny-cloudforge · 2026-05-02T12:13:33Z

PR1 Body — `[Feature][KVCache] Add per-head SWA block recycle to ResourceManagerV1 (Hackathon 10th Spring No.53)`

Push-ready version (all commits staged, bench gate passed). Saved here, not in /tmp.
5 required sections per FastDeploy CI gate. Word budget ≤600.

Motivation

Hackathon 10th Spring Task No.53 — 离散 KV Cache 管理和 AppendAttention 算子的性能优化 (PR1 of 2). Spec: https://github.com/PaddlePaddle/community/blob/master/hackathon/hackathon_10th/【Hackathon_10th】开源贡献个人挑战赛春节特别季—任务合集.md#no53.

For models that mix Sliding-Window Attention (SWA) heads with full-attention heads inside the same layer, today's V1 KV-cache scheduling path (ResourceManagerV1 + PrefixCacheManager, gated by the default-on ENABLE_V1_KVCACHE_SCHEDULER=1) allocates one shared block_idx per layer for all heads. SWA heads finish their window long before full-attn heads, but their cache stays pinned until the whole layer evicts. Throughput suffers.

This PR teaches the V1 scheduler + PrefixCacheManager to manage block_idx per head (head-wise SWA layout) and recycle a SWA head's cache as soon as it crosses its window — the per-head equivalent of what PR #6702 did for V0.

Authorship: this PR is independently designed and implemented by the submitter for Hackathon 10th Spring No.53. The earlier community PR #6702 (V0, not merged) is referenced as prior art only; no code is lifted unattributed. Any future contributor work will be acknowledged via per-commit Co-authored-by trailers.

RFC: PaddlePaddle/community#1362.

Modifications

Area	Change
`fastdeploy/cache_manager/prefix_cache_manager.py`	Per-request head-wise GPU free list (`gpu_free_block_list_head_wise[head]`); `allocate_gpu_blocks_head_wise` / `recycle_gpu_blocks_head_wise`; TP-aware sizing (`num_key_value_heads // tp_size`)
`fastdeploy/engine/sched/resource_manager_v1.py`	`recycle_request_swa_head_cache` (per-head cursor advance ≥ window+sink); `_should_skip_swa_recycle_for_overlap` (per-request `cache_swap_metadata` / `cache_evict_metadata` inspection); P4 cleanup in `_free_blocks`
`fastdeploy/model_executor/models/paddleformers/base.py`	Default-off ERNIE SWA fixture (window/sink/skip-freq/ratio) gated by `FD_T53_HEAD_WISE_SWA_FIXTURE=1`
`fastdeploy/config.py`	+20 — Engine-main FDConfig fixture: mirror the `paddleformers/base.py` head-wise SWA attribute injection so `ResourceManagerV1._should_use_head_wise_swa` (engine-main) sees the same `model_config.head_wise_swa_ratio` as the worker. Gated on `FD_T53_HEAD_WISE_SWA_FIXTURE`.
Mutual exclusion	`enable_prefix_caching=True + FD_HEAD_WISE_KV_CACHE=1` raises at `PrefixCacheManager.__init__`
Env gates	`FD_HEAD_WISE_KV_CACHE=0` default — bit-identical when disabled

Tests use real lightweight objects + object.__new__/AST or shape oracles (no MagicMock-only). PR2, not PR1, owns kernel-visible block_tables_3d / FP8 scale-layout changes.

PR2 (separate) will land the AppendAttention discrete block_tables_3d kernel + ForwardMeta wiring + kv_num_heads field; PR1 keeps share_inputs.block_tables 2D and reaches the +30% recycle gate via cache-manager-side changes only.

Usage or Command

# Enable head-wise V1 cache + timely SWA recycle.
# All four env vars must be set together — partial activation is silently a no-op.
# Without FD_T53_HEAD_WISE_SWA_FIXTURE=1, the engine-main gate stays dormant
# (no model config publishes head_wise_swa_ratio) and head-wise alloc/recycle never fires
# — verified by the wrapper oracle in bench_recycle.sh.
export FD_T53_HEAD_WISE_SWA_FIXTURE=1     # engine-main FDConfig fixture (config.py)
export ENABLE_V1_KVCACHE_SCHEDULER=1      # default; shown for clarity
export FD_HEAD_WISE_KV_CACHE=1            # enables per-head block tables
export FD_T53_HEAD_WISE_SWA_RATIO=1.0     # SWA recycle ratio (>0 = recycle active)
python -m fastdeploy.entrypoints.openai.api_server \
    --model baidu/ERNIE-4.5-21B-A3B-Paddle \
    --max-model-len 32768

Accuracy Tests

Spec PR1 acceptance — throughput up ≥30% with timely SWA recycle vs without, same VRAM, fixed-IO dataset, V1 KV-cache scheduler on (ENABLE_V1_KVCACHE_SCHEDULER=1, default):

Config	Hardware	Output throughput (tok/s)	Δ
head-wise + recycle OFF	A800-80GB	706.29	baseline
head-wise + recycle ON	A800-80GB	1107.98	+56.9% ≥30 ✓

Additional: p99 TTFT 570 s → 285 s (−50%). Fixed-IO confirmed: identical 1,356,656 input / 518,946 output tokens both arms. 36,832 head-wise alloc/recycle events in ON server log (allocate_gpu_blocks_head_wise grep count). Oracle: [T53/bench] head-wise oracle PASS: issued=1 recycled=1.

(Quick validation run: num_prompts=128. Full 1,024-prompt run in progress; table will be updated once complete.)

Benchmark: FastDeploy/benchmarks/benchmark_serving.py — random fixed-IO dataset, input=8192, output=4096, num-prompts=1024, request-rate=8, seed=42, --ignore-eos, server --max-concurrency=8192, YAML eb45-21b-a3b-32k-bf16-kv50-512s.yaml. The harness rejects partial JSONs (completed != 1024 or non-empty errors).

Hardware note for reviewers: spec does not pin PR1 hardware. Numbers above are A800-80GB (SM80) via Baidu AI Studio. If H/B card access is granted (cc @luotao1), we will append H/B numbers as supplementary evidence. PR2 (5% TTFT/TBT) does require H/B per spec; tracked separately.

Correctness:

CPU pytest coverage under tests/cache_manager/test_head_wise_*.py, tests/cache_manager/test_swa_recycle*.py, and tests/layers/test_append_attention_head_wise_shapes.py — real _FakeCacheManager + object.__new__(ResourceManagerV1) + AST/shape oracles. No MagicMock-only tests.
A800 smoke (bsz=4, seq=1024) + long-context recycle smoke — oracle PASS (issued=1 recycled=1), 36,832 head-wise alloc/recycle events in full bench server log.
GSM8K parity (head-wise vs non-head-wise abs diff ≤ 0.5 pp) — 1024-prompt bench in progress; parity check scheduled after bench completion.

CI run:

Checklist

pre-commit run --all-files clean (black reformatted 2 files; amend committed)
All CI checks green (Coverage / base_tests / codestyle / iluvatar / xpu)
Reviewer-requested changes addressed
No prohibited claims in PR body (verified by pre-push grep): "first in framework", "novel research", "unique to FastDeploy"
Authorship statement accurate (no unattributed lifted code)
Hardware label on every benchmark number matches the actual card used

cloudforge1 · 2026-05-22T09:31:54Z

Closing — superseded by #7717/#7718 (v4, active review). All T53 work consolidated there.

cloudforge1 · 2026-05-22T09:32:37Z

This PR is superseded by #7717 (PR1) and #7718 (PR2) which are the latest v4 versions under active review. Please ignore this one.

PaddlePaddle-bot · 2026-05-22T10:08:04Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-22 18:07:11

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: c1c2948
Merge base: d70f33d (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

当前未发现 required 任务失败、运行中或等待中；但有 2 个 Workflow 处于 action_required，需要人工审批后才会继续触发相关检查。建议先完成审批，再等待完整 CI 结果刷新。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
1(0)	1	1	0	0	0	0

⚠️ 注意：以下 2 个 Workflow 处于 action_required 状态（等待审批后才会执行）：Approval、Check PR Template。这些 Workflow 需人工审批触发。

注意：action_required workflows 不计入上表的任务统计。

2 任务状态汇总

日志列说明：失败任务直接使用 log_links_markdown 字段（已预生成），运行中任务手动拼接 [Job]({html_url})

2.1 Required任务 : 0/0 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
✅	当前无 required 任务	-	无失败	-	-	-

2.2 可选任务 — 1/1 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
✅	其余 1 个可选任务通过	-	-	-

3 失败详情（仅 required）

无 required 失败任务，无需深度日志分析。

4 本轮检查说明

CI 状态来源：fetch_ci_logs(quick=true)。
PR 变更概览：19 个文件，新增 1615 行，删除 31 行；主要涉及 head-wise SWA KV cache 相关实现与测试。
已按要求读取关键上下文文件：fastdeploy/cache_manager/prefix_cache_manager.py、fastdeploy/engine/sched/resource_manager_v1.py、tests/cache_manager/test_swa_recycle.py。
本轮无 required 失败 job，因此未调用 ci_failure_analyzer，也没有新增 job 分析缓存写回。

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-22 18:19:45

📋 Review 摘要

PR 概述：为 ResourceManagerV1 新增 per-head SWA KV 块回收机制（Hackathon 10th Spring No.53 PR1）
变更范围：cache_manager/、engine/sched/resource_manager_v1.py、config.py、envs.py、worker/
影响面 Tag：[KVCache] [Scheduler] [FDConfig]

问题

级别	文件	概述
🔴 Bug	`fastdeploy/cache_manager/prefix_cache_manager.py:597`	`assert` 用于运行时分配容量校验，Python `-O` 下失效
🟡 建议	`fastdeploy/worker/worker_process.py:484`	删除 `_tp_barrier_wait()` 可能导致多卡 TP 进程间不同步
🟡 建议	`fastdeploy/worker/gpu_model_runner.py:2017`	CUDAGraph `_dummy_run` 逻辑变更需确认是否同步到其他硬件 runner

📝 PR 规范检查

PR 描述结构完整（Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist 均存在）。标题 Tag [CI] 与实际变更不符：PR 主体为 KVCache 管理器和调度器的功能增强，[CI] Tag 仅用于 .github/workflows/、scripts/ 等 CI 配置变更。

标题建议（可直接复制）：

[Feature][KVCache] Add per-head SWA block recycle to ResourceManagerV1 (Hackathon 10th Spring No.53 PR1)

总体评价

PR 实现了 per-head SWA KV cache 块回收机制，逻辑结构清晰，默认关闭保证向后兼容，测试覆盖了主要场景。需修复运行时 assert 校验问题（P0），确认 TP barrier 删除的影响，并将标题 Tag 由 [CI] 改为 [Feature][KVCache]。

PaddlePaddle-bot · 2026-05-22T10:25:23Z

+        kv_num_heads = max(1, self.kv_num_heads)
+        needed = num_blocks * kv_num_heads
+        free_list = self.gpu_free_head_wise_block_list
+        assert needed <= len(free_list), f"head-wise gpu free block num: {len(free_list)} < needed number {needed}"


🔴 Bug assert 被用于运行时分配容量校验，Python -O 下此断言被静默跳过，heappop 将在空堆上抛出无上下文的 IndexError（checklist §C 表层信号）。

建议修复：

if needed > len(free_list): raise ValueError( f"head-wise gpu free block num: {len(free_list)} < needed number {needed}" )

PaddlePaddle-bot · 2026-05-22T10:25:23Z

            req_dicts = None
            self.worker_healthy_live_signal.value[tp_rank % self.max_chips_per_node] = int(time.time())

-            self._tp_barrier_wait() if tp_size > 1 else None


🟡 建议 此处删除了 self._tp_barrier_wait() if tp_size > 1 else None。

该调用是多卡 TP 场景下 event_loop_normal 的同步屏障，确保所有 TP rank 在 task-queue 检测前保持对齐。删除后，当 tp_rank > 0 的进程处理速度与 rank-0 存在差异时，可能导致 rank 间任务不同步。

建议补充说明该删除的原因（如已有其他同步机制替代），或恢复该调用。

PaddlePaddle-bot · 2026-05-22T10:25:23Z

@@ -2016,10 +2017,8 @@ def _dummy_run(
        while True:


🟡 建议 _dummy_run 的 CUDAGraph 逻辑由 if not (in_capturing or step_use_cudagraph): set False 改为 in_capturing and forward_meta.step_use_cudagraph，语义有差异。同时 worker_process.py 中的 RUN_DUMMY_FOR_PROFILE 及 _dummy_run(step_use_cudagraph=True) 调用也一并删除。

请确认此变更是否需要同步到其他硬件 ModelRunner（xpu_model_runner.py、iluvatar_model_runner.py 等），以避免其他硬件 CI 失败（checklist §A6 多硬件同步）。

CLAassistant · 2026-05-29T13:24:05Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

[CI]【Hackathon 10th Spring No.53】headwise SWA unit test [cfx]

c1c2948

jonny-cloudforge temporarily deployed to Metax_ci May 2, 2026 12:13 — with GitHub Actions Inactive

boby-cloudforge deleted the task/h10-053-pr1-headwise-swa-cfx3 branch May 3, 2026 13:38

luotao1 mentioned this pull request May 15, 2026

【Hackathon 10th】开源贡献个人挑战赛 · 春节特别季 PaddlePaddle/Paddle#77429

Open

PaddlePaddle-bot suggested changes May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI]【Hackathon 10th Spring No.53】headwise SWA unit test [cfx]#7693

[CI]【Hackathon 10th Spring No.53】headwise SWA unit test [cfx]#7693
jonny-cloudforge wants to merge 1 commit into
PaddlePaddle:developfrom
CloudForge-Solutions:task/h10-053-pr1-headwise-swa-cfx3

jonny-cloudforge commented May 2, 2026

Uh oh!

cloudforge1 commented May 22, 2026

Uh oh!

cloudforge1 commented May 22, 2026

Uh oh!

PaddlePaddle-bot commented May 22, 2026

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 22, 2026

Uh oh!

PaddlePaddle-bot May 22, 2026

Uh oh!

PaddlePaddle-bot May 22, 2026

Uh oh!

CLAassistant commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jonny-cloudforge commented May 2, 2026

PR1 Body — [Feature][KVCache] Add per-head SWA block recycle to ResourceManagerV1 (Hackathon 10th Spring No.53)

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

cloudforge1 commented May 22, 2026

Uh oh!

cloudforge1 commented May 22, 2026

Uh oh!

PaddlePaddle-bot commented May 22, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 0/0 通过

2.2 可选任务 — 1/1 通过

3 失败详情（仅 required）

4 本轮检查说明

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

CLAassistant commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PR1 Body — `[Feature][KVCache] Add per-head SWA block recycle to ResourceManagerV1 (Hackathon 10th Spring No.53)`