Skip to content

[KVCache][BugFix] Buffer early layer0 cache signals#7872

Open
kevincheng2 wants to merge 1 commit into
PaddlePaddle:developfrom
kevincheng2:fix/cache-messager-layer0-signal-20260521111440
Open

[KVCache][BugFix] Buffer early layer0 cache signals#7872
kevincheng2 wants to merge 1 commit into
PaddlePaddle:developfrom
kevincheng2:fix/cache-messager-layer0-signal-20260521111440

Conversation

@kevincheng2
Copy link
Copy Markdown
Collaborator

@kevincheng2 kevincheng2 commented May 21, 2026

Motivation

修复 PD 分离场景下,layer0 prefill signal 可能早于 decode cache info 到达的问题。此前 consume_signals 会在 cache_info 未 ready 时跳过 signal,或在 cache task 尚未注册时直接投递,可能导致 layerwise cache send 入口丢失或断言失败。

Modifications

  • CacheMessagerV1 增加 pending layer0 signal 缓冲。
  • cache task 注册时只恢复当前 current_id 对应的 pending signal,避免误恢复其他 engine 的 signal。
  • 同一 engine 的 pending signal 只保留最新值,避免 pending 无界增长和全量扫描。
  • 修复 tasks_count 未调用 .item() 导致 Paddle Tensor 直接用于 range/比较的类型风险。
  • 任务完成后清理对应 pending signal。
  • 增加单元测试覆盖 early layer0 buffer、pending recovery、invalid pending drop、其他 engine pending 不被误恢复、任务完成清理。

Usage or Command

# 静态/语法检查
python -m py_compile fastdeploy/cache_manager/cache_messager.py tests/cache_manager/test_cache_messager.py
git diff --check

# 相关单测(本地环境中 pytest 进程偶发被 shell job-control 挂起,已单独验证相关用例)
python -m pytest tests/cache_manager/test_cache_messager.py::test_cache_messager_v1_recovers_pending_layer0_signal -q
python -m pytest tests/cache_manager/test_cache_messager.py::test_cache_messager_v1_drops_invalid_pending_layer0_signal -q
python -m pytest tests/cache_manager/test_cache_messager.py::test_cache_messager_v1_consume_signals -q
python -m pytest tests/cache_manager/test_cache_messager.py::test_cache_messager_v1_consume_signals_buffers_early_layer0 -q
python -m pytest tests/cache_manager/test_cache_messager.py::test_cache_messager_v1_prefill_layerwise_send_cache_thread -q

Accuracy Tests

不涉及模型前向、算子或输出精度变更。

Checklist

  • Add at least a tag in the PR title.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 21, 2026

Thanks for your contribution!

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-21 11:20:41

📋 Review 摘要

PR 概述:修复 PD 分离 V1 KVCache 场景下 layer0 prefill signal 可能早于 decode cache info 到达导致信号丢失的竞态问题,引入 pending buffer 机制确保信号不被丢弃。

变更范围fastdeploy/cache_manager/cache_messager.pytests/cache_manager/test_cache_messager.py

影响面 Tag[KVCache] [PD Disaggregation]

问题

级别 文件 概述
📝 PR 规范 标题含两个官方 Tag [KVCache][BugFix],§D1 要求仅含一个

📝 PR 规范检查

标题 [KVCache][BugFix] Buffer early layer0 cache signals 包含 [KVCache][BugFix] 两个官方 Tag,违反 checklist §D1「标题必须且仅能包含一个官方 Tag」规则。PR 描述结构完整(Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist 均已覆盖),无需修改描述。

标题建议(可直接复制):

  • [BugFix] Fix early layer0 signal loss in KVCache PD disaggregation

总体评价

整体实现正确,逻辑清晰:锁顺序全程保持 engine_cache_task_thread_lockpending_layer0_signal_lock 一致,无死锁风险;del idx_cache_task_dictpop pending_layer0_signals 均在 engine_cache_task_thread_lock 内原子完成,无竞态;tasks_count.item() 修复了 Paddle Tensor 直接用于 range 的类型隐患;单元测试覆盖了 early buffer、pending 恢复、invalid 丢弃、跨 engine 隔离、任务完成清理五个关键场景,质量较高。仅标题双 Tag 规范问题,不阻塞合入。

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 21, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-21 14:54:48

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前 required 任务 9/10 通过,存在 1 个 required 失败任务(Approval),等待处理的 required 任务数 0。该失败为人工审批门禁,需完成 Approval 后再继续。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
42(0) 42 37 4 0 0 1

2 任务状态汇总

日志列说明:失败任务直接使用 log_links_markdown 字段(已预生成),运行中任务手动拼接 [Job]({html_url})

2.1 Required任务 : 9/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 7s 需要 Approval 请通过人工审批 Job -
其余 9 个必选任务通过 - - - - -

2.2 可选任务 — 28/32 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 1m3s Job -
Trigger Jenkins for PR 7m26s Job -
CI_HPU 1h4m Job -
⏭️ 1 个可选任务跳过 - - -
其余 28 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 需要人工审批(置信度: 高)

根因摘要

该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。

修复建议摘要

请通过人工审批。

关键信息

  • Job: Approval
  • 结论: failure
  • 耗时: 7s
  • 错误片段: Process completed with exit code 6.

代码上下文核查:已读取 fastdeploy/cache_manager/cache_messager.pytests/cache_manager/test_cache_messager.py 关键上下文;当前 required 阻塞项与单测/覆盖率无关,属于 Approval 门禁未通过。

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 21, 2026

Codecov Report

❌ Patch coverage is 97.50000% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@ab9a7f3). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/cache_manager/cache_messager.py 97.50% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7872   +/-   ##
==========================================
  Coverage           ?   63.61%           
==========================================
  Files              ?      462           
  Lines              ?    64513           
  Branches           ?     9894           
==========================================
  Hits               ?    41038           
  Misses             ?    20697           
  Partials           ?     2778           
Flag Coverage Δ
GPU 72.74% <97.50%> (?)
XPU 7.11% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants