Skip to content

[CI]【Hackathon 10th Spring No.53】headwise SWA unit test [cfx]#7693

Open
jonny-cloudforge wants to merge 1 commit into
PaddlePaddle:developfrom
CloudForge-Solutions:task/h10-053-pr1-headwise-swa-cfx3
Open

[CI]【Hackathon 10th Spring No.53】headwise SWA unit test [cfx]#7693
jonny-cloudforge wants to merge 1 commit into
PaddlePaddle:developfrom
CloudForge-Solutions:task/h10-053-pr1-headwise-swa-cfx3

Conversation

@jonny-cloudforge
Copy link
Copy Markdown

PR1 Body — [Feature][KVCache] Add per-head SWA block recycle to ResourceManagerV1 (Hackathon 10th Spring No.53)

Push-ready version (all commits staged, bench gate passed). Saved here, not in /tmp.
5 required sections per FastDeploy CI gate. Word budget ≤600.


Motivation

Hackathon 10th Spring Task No.53 — 离散 KV Cache 管理和 AppendAttention 算子的性能优化 (PR1 of 2). Spec: https://github.com/PaddlePaddle/community/blob/master/hackathon/hackathon_10th/【Hackathon_10th】开源贡献个人挑战赛春节特别季—任务合集.md#no53.

For models that mix Sliding-Window Attention (SWA) heads with full-attention heads inside the same layer, today's V1 KV-cache scheduling path (ResourceManagerV1 + PrefixCacheManager, gated by the default-on ENABLE_V1_KVCACHE_SCHEDULER=1) allocates one shared block_idx per layer for all heads. SWA heads finish their window long before full-attn heads, but their cache stays pinned until the whole layer evicts. Throughput suffers.

This PR teaches the V1 scheduler + PrefixCacheManager to manage block_idx per head (head-wise SWA layout) and recycle a SWA head's cache as soon as it crosses its window — the per-head equivalent of what PR #6702 did for V0.

Authorship: this PR is independently designed and implemented by the submitter for Hackathon 10th Spring No.53. The earlier community PR #6702 (V0, not merged) is referenced as prior art only; no code is lifted unattributed. Any future contributor work will be acknowledged via per-commit Co-authored-by trailers.

RFC: PaddlePaddle/community#1362.

Modifications

Area Change
fastdeploy/cache_manager/prefix_cache_manager.py Per-request head-wise GPU free list (gpu_free_block_list_head_wise[head]); allocate_gpu_blocks_head_wise / recycle_gpu_blocks_head_wise; TP-aware sizing (num_key_value_heads // tp_size)
fastdeploy/engine/sched/resource_manager_v1.py recycle_request_swa_head_cache (per-head cursor advance ≥ window+sink); _should_skip_swa_recycle_for_overlap (per-request cache_swap_metadata / cache_evict_metadata inspection); P4 cleanup in _free_blocks
fastdeploy/model_executor/models/paddleformers/base.py Default-off ERNIE SWA fixture (window/sink/skip-freq/ratio) gated by FD_T53_HEAD_WISE_SWA_FIXTURE=1
fastdeploy/config.py +20 — Engine-main FDConfig fixture: mirror the paddleformers/base.py head-wise SWA attribute injection so ResourceManagerV1._should_use_head_wise_swa (engine-main) sees the same model_config.head_wise_swa_ratio as the worker. Gated on FD_T53_HEAD_WISE_SWA_FIXTURE.
Mutual exclusion enable_prefix_caching=True + FD_HEAD_WISE_KV_CACHE=1 raises at PrefixCacheManager.__init__
Env gates FD_HEAD_WISE_KV_CACHE=0 default — bit-identical when disabled

Tests use real lightweight objects + object.__new__/AST or shape oracles (no MagicMock-only). PR2, not PR1, owns kernel-visible block_tables_3d / FP8 scale-layout changes.

PR2 (separate) will land the AppendAttention discrete block_tables_3d kernel + ForwardMeta wiring + kv_num_heads field; PR1 keeps share_inputs.block_tables 2D and reaches the +30% recycle gate via cache-manager-side changes only.

Usage or Command

# Enable head-wise V1 cache + timely SWA recycle.
# All four env vars must be set together — partial activation is silently a no-op.
# Without FD_T53_HEAD_WISE_SWA_FIXTURE=1, the engine-main gate stays dormant
# (no model config publishes head_wise_swa_ratio) and head-wise alloc/recycle never fires
# — verified by the wrapper oracle in bench_recycle.sh.
export FD_T53_HEAD_WISE_SWA_FIXTURE=1     # engine-main FDConfig fixture (config.py)
export ENABLE_V1_KVCACHE_SCHEDULER=1      # default; shown for clarity
export FD_HEAD_WISE_KV_CACHE=1            # enables per-head block tables
export FD_T53_HEAD_WISE_SWA_RATIO=1.0     # SWA recycle ratio (>0 = recycle active)
python -m fastdeploy.entrypoints.openai.api_server \
    --model baidu/ERNIE-4.5-21B-A3B-Paddle \
    --max-model-len 32768

Accuracy Tests

Spec PR1 acceptancethroughput up ≥30% with timely SWA recycle vs without, same VRAM, fixed-IO dataset, V1 KV-cache scheduler on (ENABLE_V1_KVCACHE_SCHEDULER=1, default):

Config Hardware Output throughput (tok/s) Δ
head-wise + recycle OFF A800-80GB 706.29 baseline
head-wise + recycle ON A800-80GB 1107.98 +56.9% ≥30 ✓

Additional: p99 TTFT 570 s → 285 s (−50%). Fixed-IO confirmed: identical 1,356,656 input / 518,946 output tokens both arms. 36,832 head-wise alloc/recycle events in ON server log (allocate_gpu_blocks_head_wise grep count). Oracle: [T53/bench] head-wise oracle PASS: issued=1 recycled=1.

(Quick validation run: num_prompts=128. Full 1,024-prompt run in progress; table will be updated once complete.)

Benchmark: FastDeploy/benchmarks/benchmark_serving.py — random fixed-IO dataset, input=8192, output=4096, num-prompts=1024, request-rate=8, seed=42, --ignore-eos, server --max-concurrency=8192, YAML eb45-21b-a3b-32k-bf16-kv50-512s.yaml. The harness rejects partial JSONs (completed != 1024 or non-empty errors).

Hardware note for reviewers: spec does not pin PR1 hardware. Numbers above are A800-80GB (SM80) via Baidu AI Studio. If H/B card access is granted (cc @luotao1), we will append H/B numbers as supplementary evidence. PR2 (5% TTFT/TBT) does require H/B per spec; tracked separately.

Correctness:

  • CPU pytest coverage under tests/cache_manager/test_head_wise_*.py, tests/cache_manager/test_swa_recycle*.py, and tests/layers/test_append_attention_head_wise_shapes.py — real _FakeCacheManager + object.__new__(ResourceManagerV1) + AST/shape oracles. No MagicMock-only tests.
  • A800 smoke (bsz=4, seq=1024) + long-context recycle smoke — oracle PASS (issued=1 recycled=1), 36,832 head-wise alloc/recycle events in full bench server log.
  • GSM8K parity (head-wise vs non-head-wise abs diff ≤ 0.5 pp) — 1024-prompt bench in progress; parity check scheduled after bench completion.

CI run:

Checklist

  • pre-commit run --all-files clean (black reformatted 2 files; amend committed)
  • All CI checks green (Coverage / base_tests / codestyle / iluvatar / xpu)
  • Reviewer-requested changes addressed
  • No prohibited claims in PR body (verified by pre-push grep): "first in framework", "novel research", "unique to FastDeploy"
  • Authorship statement accurate (no unattributed lifted code)
  • Hardware label on every benchmark number matches the actual card used

@cloudforge1
Copy link
Copy Markdown
Contributor

Closing — superseded by #7717/#7718 (v4, active review). All T53 work consolidated there.

@cloudforge1
Copy link
Copy Markdown
Contributor

This PR is superseded by #7717 (PR1) and #7718 (PR2) which are the latest v4 versions under active review. Please ignore this one.

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-22 18:07:11

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前未发现 required 任务失败、运行中或等待中;但有 2 个 Workflow 处于 action_required,需要人工审批后才会继续触发相关检查。建议先完成审批,再等待完整 CI 结果刷新。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
1(0) 1 1 0 0 0 0

⚠️ 注意:以下 2 个 Workflow 处于 action_required 状态(等待审批后才会执行):Approval、Check PR Template。这些 Workflow 需人工审批触发。

注意:action_required workflows 不计入上表的任务统计。

2 任务状态汇总

日志列说明:失败任务直接使用 log_links_markdown 字段(已预生成),运行中任务手动拼接 [Job]({html_url})

2.1 Required任务 : 0/0 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
当前无 required 任务 - 无失败 - - -

2.2 可选任务 — 1/1 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
其余 1 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 失败任务,无需深度日志分析。


4 本轮检查说明

  • CI 状态来源:fetch_ci_logs(quick=true)
  • PR 变更概览:19 个文件,新增 1615 行,删除 31 行;主要涉及 head-wise SWA KV cache 相关实现与测试。
  • 已按要求读取关键上下文文件:fastdeploy/cache_manager/prefix_cache_manager.pyfastdeploy/engine/sched/resource_manager_v1.pytests/cache_manager/test_swa_recycle.py
  • 本轮无 required 失败 job,因此未调用 ci_failure_analyzer,也没有新增 job 分析缓存写回。

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-22 18:19:45

📋 Review 摘要

PR 概述:为 ResourceManagerV1 新增 per-head SWA KV 块回收机制(Hackathon 10th Spring No.53 PR1)
变更范围cache_manager/engine/sched/resource_manager_v1.pyconfig.pyenvs.pyworker/
影响面 Tag[KVCache] [Scheduler] [FDConfig]

问题

级别 文件 概述
🔴 Bug fastdeploy/cache_manager/prefix_cache_manager.py:597 assert 用于运行时分配容量校验,Python -O 下失效
🟡 建议 fastdeploy/worker/worker_process.py:484 删除 _tp_barrier_wait() 可能导致多卡 TP 进程间不同步
🟡 建议 fastdeploy/worker/gpu_model_runner.py:2017 CUDAGraph _dummy_run 逻辑变更需确认是否同步到其他硬件 runner

📝 PR 规范检查

PR 描述结构完整(Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist 均存在)。标题 Tag [CI] 与实际变更不符:PR 主体为 KVCache 管理器和调度器的功能增强,[CI] Tag 仅用于 .github/workflows/scripts/ 等 CI 配置变更。

标题建议(可直接复制):

  • [Feature][KVCache] Add per-head SWA block recycle to ResourceManagerV1 (Hackathon 10th Spring No.53 PR1)

总体评价

PR 实现了 per-head SWA KV cache 块回收机制,逻辑结构清晰,默认关闭保证向后兼容,测试覆盖了主要场景。需修复运行时 assert 校验问题(P0),确认 TP barrier 删除的影响,并将标题 Tag 由 [CI] 改为 [Feature][KVCache]

kv_num_heads = max(1, self.kv_num_heads)
needed = num_blocks * kv_num_heads
free_list = self.gpu_free_head_wise_block_list
assert needed <= len(free_list), f"head-wise gpu free block num: {len(free_list)} < needed number {needed}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug assert 被用于运行时分配容量校验,Python -O 下此断言被静默跳过,heappop 将在空堆上抛出无上下文的 IndexError(checklist §C 表层信号)。

建议修复:

if needed > len(free_list):
    raise ValueError(
        f"head-wise gpu free block num: {len(free_list)} < needed number {needed}"
    )

req_dicts = None
self.worker_healthy_live_signal.value[tp_rank % self.max_chips_per_node] = int(time.time())

self._tp_barrier_wait() if tp_size > 1 else None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 此处删除了 self._tp_barrier_wait() if tp_size > 1 else None

该调用是多卡 TP 场景下 event_loop_normal 的同步屏障,确保所有 TP rank 在 task-queue 检测前保持对齐。删除后,当 tp_rank > 0 的进程处理速度与 rank-0 存在差异时,可能导致 rank 间任务不同步。

建议补充说明该删除的原因(如已有其他同步机制替代),或恢复该调用。

@@ -2016,10 +2017,8 @@ def _dummy_run(
while True:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 _dummy_run 的 CUDAGraph 逻辑由 if not (in_capturing or step_use_cudagraph): set False 改为 in_capturing and forward_meta.step_use_cudagraph,语义有差异。同时 worker_process.py 中的 RUN_DUMMY_FOR_PROFILE_dummy_run(step_use_cudagraph=True) 调用也一并删除。

请确认此变更是否需要同步到其他硬件 ModelRunner(xpu_model_runner.pyiluvatar_model_runner.py 等),以避免其他硬件 CI 失败(checklist §A6 多硬件同步)。

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants