Skip to content

[Bugfix] AS block leaks#7890

Merged
kevincheng2 merged 2 commits into
PaddlePaddle:developfrom
zccjjj:ASBlockLeaks
May 25, 2026
Merged

[Bugfix] AS block leaks#7890
kevincheng2 merged 2 commits into
PaddlePaddle:developfrom
zccjjj:ASBlockLeaks

Conversation

@zccjjj

@zccjjj zccjjj commented May 22, 2026

Copy link
Copy Markdown
Contributor

Motivation

修复在 enable_prefix_caching + 分层 KV Cache 路径下,can_allocate_gpu_blocks 未通过 _get_can_schedule_prefill_threshold_block 计算预留块阈值,导致 AS block 泄漏的问题。
变更范围:fastdeploy/engine/sched/resource_manager_v1.py(调度资源管理)
影响面 Tag:[Scheduler] [KVCache]

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot

paddle-bot Bot commented May 22, 2026

Copy link
Copy Markdown

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot

PaddlePaddle-bot commented May 22, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-25 18:51:38

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前 Required 任务无失败,但仍有 1 个 Required 任务正在运行;需等待该任务完成后再判断是否可合并。Optional 任务有 3 个失败、1 个运行中,不阻塞合并,仅供参考。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
41(0) 41 36 3 2 0 0

2 任务状态汇总

日志列说明:失败任务使用工具生成的 Job 链接;运行中任务使用对应 Workflow 链接。

2.1 Required任务 : 9/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 等待任务完成 Workflow -
其余 9 个必选任务通过 - - - - -

2.2 可选任务 — 27/31 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 1m59s Job -
Check PR Template 29s Job -
Trigger Jenkins for PR 17s Job -
CI_HPU - Workflow -
其余 27 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 失败任务。本轮未调用 ci_failure_analyzer;Optional 失败任务不做深度分析。

@codecov-commenter

codecov-commenter commented May 22, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@13eaea0). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/sched/resource_manager_v1.py 50.00% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7890   +/-   ##
==========================================
  Coverage           ?   64.03%           
==========================================
  Files              ?      467           
  Lines              ?    64965           
  Branches           ?     9962           
==========================================
  Hits               ?    41601           
  Misses             ?    20540           
  Partials           ?     2824           
Flag Coverage Δ
GPU 73.15% <50.00%> (?)
XPU 7.07% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zccjjj zccjjj changed the title bugfix AS block leaks [Bugfix] AS block leaks May 25, 2026
@zccjjj zccjjj changed the title [Bugfix] AS block leaks [Bugfix] [Cherry-pick] AS block leaks May 25, 2026
PaddlePaddle-bot

This comment was marked as outdated.

@zccjjj

zccjjj commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

/re-run all failed

@zccjjj

zccjjj commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

/re-run all-failed

@zccjjj zccjjj changed the title [Bugfix] [Cherry-pick] AS block leaks [Bugfix] [Cherry-pick] AS block leaks(#7895) May 25, 2026
@zccjjj

zccjjj commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

/re-run all-failed

PaddlePaddle-bot

This comment was marked as outdated.

@zccjjj zccjjj changed the title [Bugfix] [Cherry-pick] AS block leaks(#7895) [Bugfix] AS block leaks May 25, 2026
@zccjjj

zccjjj commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

/re-run all-failed

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-25 17:10:25

📋 Review 摘要

PR 概述:修复 enable_prefix_caching + 分层 KV Cache 路径下,can_allocate_gpu_blocks 未经 _get_can_schedule_prefill_threshold_block 计算预留块阈值导致 AS block 泄漏的问题。
变更范围fastdeploy/engine/sched/resource_manager_v1.py
影响面 Tag[Scheduler] [KVCache]

问题

级别 文件 概述
🟡 建议 resource_manager_v1.py:1073 缺少针对此修复路径的单元测试
📝 PR 规范 标题 Tag 大小写偏差;描述模板多个段落为空

📝 PR 规范检查

标题 [Bugfix] 应规范为 [BugFix](官方 Tag 列表中为 [BugFix]);ModificationsUsage or CommandAccuracy Tests 段落为空,需补充。

标题建议(可直接复制):

  • [BugFix][Scheduler][KVCache] Fix AS block leaks in enable_prefix_caching + hierarchical KV cache path

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation
修复在 enable_prefix_caching + 分层 KV Cache 路径下,`can_allocate_gpu_blocks` 未通过 `_get_can_schedule_prefill_threshold_block` 计算预留块阈值,导致 AS block 泄漏的问题。

## Modifications
- `fastdeploy/engine/sched/resource_manager_v1.py`:在 `_allocate_decode_and_extend` 的两处分层 KV Cache 预检路径中,将直接传入原始 block 数改为先调用 `_get_can_schedule_prefill_threshold_block` 计算含 reserve_blocks 的阈值,再传给 `can_allocate_gpu_blocks`,与主调度路径保持一致。
-`_free_blocks` 调用前新增 Warning 注释,说明在 `update_cache_blocks` 之前调用 `_free_blocks` 可能导致 storage blocks 泄漏。

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

修复逻辑正确,将分层 KV Cache 预检路径与主调度路径对齐,消除了 AS block 泄漏风险。建议补充 tests/scheduler/ 下的单元测试覆盖此修复路径,并完善 PR 描述。

or self.config.cache_config.kvcache_storage_backend
):
if not self.cache_manager.can_allocate_gpu_blocks(
can_schedule_block_num_threshold = self._get_can_schedule_prefill_threshold_block(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 此修复路径缺少对应的单元测试。

建议在 tests/scheduler/ 下补充测试用例,覆盖 enable_prefix_caching=True + num_cpu_blocks > 0(或 kvcache_storage_backend 非空)场景下,_get_can_schedule_prefill_threshold_block 阈值计算是否正确阻止了 block 泄漏。

@kevincheng2 kevincheng2 merged commit f23586e into PaddlePaddle:develop May 25, 2026
39 of 43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants