[Bugfix] AS block leaks by zccjjj · Pull Request #7890 · PaddlePaddle/FastDeploy

zccjjj · 2026-05-22T03:09:25Z

Motivation

修复在 enable_prefix_caching + 分层 KV Cache 路径下，can_allocate_gpu_blocks 未通过 _get_can_schedule_prefill_threshold_block 计算预留块阈值，导致 AS block 泄漏的问题。
变更范围：fastdeploy/engine/sched/resource_manager_v1.py（调度资源管理）
影响面 Tag：[Scheduler] [KVCache]

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-22T03:09:32Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-22T03:35:15Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-25 18:51:38

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 7b4af55
Merge base: 13eaea0 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

当前 Required 任务无失败，但仍有 1 个 Required 任务正在运行；需等待该任务完成后再判断是否可合并。Optional 任务有 3 个失败、1 个运行中，不阻塞合并，仅供参考。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
41(0)	41	36	3	2	0	0

2 任务状态汇总

日志列说明：失败任务使用工具生成的 Job 链接；运行中任务使用对应 Workflow 链接。

2.1 Required任务 : 9/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	运行中	等待任务完成	Workflow	-
✅	其余 9 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 27/31 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	1m59s	Job	-
❌	`Check PR Template`	29s	Job	-
❌	`Trigger Jenkins for PR`	17s	Job	-
⏳	`CI_HPU`	-	Workflow	-
✅	其余 27 个可选任务通过	-	-	-

3 失败详情（仅 required）

无 required 失败任务。本轮未调用 ci_failure_analyzer；Optional 失败任务不做深度分析。

codecov-commenter · 2026-05-22T04:00:44Z

Codecov Report

❌ Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@13eaea0). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/engine/sched/resource_manager_v1.py	50.00%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7890   +/-   ##
==========================================
  Coverage           ?   64.03%           
==========================================
  Files              ?      467           
  Lines              ?    64965           
  Branches           ?     9962           
==========================================
  Hits               ?    41601           
  Misses             ?    20540           
  Partials           ?     2824

Flag	Coverage Δ
GPU	`73.15% <50.00%> (?)`
XPU	`7.07% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

zccjjj · 2026-05-25T03:23:43Z

/re-run all failed

zccjjj · 2026-05-25T03:28:17Z

/re-run all-failed

zccjjj · 2026-05-25T04:02:04Z

/re-run all-failed

zccjjj · 2026-05-25T06:33:45Z

/re-run all-failed

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-25 17:10:25

📋 Review 摘要

PR 概述：修复 enable_prefix_caching + 分层 KV Cache 路径下，can_allocate_gpu_blocks 未经 _get_can_schedule_prefill_threshold_block 计算预留块阈值导致 AS block 泄漏的问题。
变更范围：fastdeploy/engine/sched/resource_manager_v1.py
影响面 Tag：[Scheduler] [KVCache]

问题

级别	文件	概述
🟡 建议	`resource_manager_v1.py:1073`	缺少针对此修复路径的单元测试
📝 PR 规范	—	标题 Tag 大小写偏差；描述模板多个段落为空

📝 PR 规范检查

标题 [Bugfix] 应规范为 [BugFix]（官方 Tag 列表中为 [BugFix]）；Modifications、Usage or Command、Accuracy Tests 段落为空，需补充。

标题建议（可直接复制）：

[BugFix][Scheduler][KVCache] Fix AS block leaks in enable_prefix_caching + hierarchical KV cache path

PR 描述建议（可直接复制，必须复刻 checklist §D2 模板的完整结构）：

## Motivation
修复在 enable_prefix_caching + 分层 KV Cache 路径下，`can_allocate_gpu_blocks` 未通过 `_get_can_schedule_prefill_threshold_block` 计算预留块阈值，导致 AS block 泄漏的问题。

## Modifications
- `fastdeploy/engine/sched/resource_manager_v1.py`：在 `_allocate_decode_and_extend` 的两处分层 KV Cache 预检路径中，将直接传入原始 block 数改为先调用 `_get_can_schedule_prefill_threshold_block` 计算含 reserve_blocks 的阈值，再传给 `can_allocate_gpu_blocks`，与主调度路径保持一致。
- 在 `_free_blocks` 调用前新增 Warning 注释，说明在 `update_cache_blocks` 之前调用 `_free_blocks` 可能导致 storage blocks 泄漏。

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

修复逻辑正确，将分层 KV Cache 预检路径与主调度路径对齐，消除了 AS block 泄漏风险。建议补充 tests/scheduler/ 下的单元测试覆盖此修复路径，并完善 PR 描述。

PaddlePaddle-bot · 2026-05-25T09:12:37Z

                                    or self.config.cache_config.kvcache_storage_backend
                                ):
-                                    if not self.cache_manager.can_allocate_gpu_blocks(
+                                    can_schedule_block_num_threshold = self._get_can_schedule_prefill_threshold_block(


🟡 建议 此修复路径缺少对应的单元测试。

建议在 tests/scheduler/ 下补充测试用例，覆盖 enable_prefix_caching=True + num_cpu_blocks > 0（或 kvcache_storage_backend 非空）场景下，_get_can_schedule_prefill_threshold_block 阈值计算是否正确阻止了 block 泄漏。

zccjjj had a problem deploying to Metax_ci May 22, 2026 03:09 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

zccjjj force-pushed the ASBlockLeaks branch from 84fd8c2 to 3735546 Compare May 25, 2026 02:42

zccjjj had a problem deploying to Metax_ci May 25, 2026 02:43 — with GitHub Actions Failure

zccjjj changed the title ~~bugfix AS block leaks~~ [Bugfix] AS block leaks May 25, 2026

zccjjj changed the title ~~[Bugfix] AS block leaks~~ [Bugfix] [Cherry-pick] AS block leaks May 25, 2026

This comment was marked as outdated.

Sign in to view

zccjjj changed the title ~~[Bugfix] [Cherry-pick] AS block leaks~~ [Bugfix] [Cherry-pick] AS block leaks(#7895) May 25, 2026

zccjjj had a problem deploying to Metax_ci May 25, 2026 04:02 — with GitHub Actions Failure

zccjjj force-pushed the ASBlockLeaks branch from 3735546 to d906923 Compare May 25, 2026 04:04

zccjjj had a problem deploying to Metax_ci May 25, 2026 04:04 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

zccjjj changed the title ~~[Bugfix] [Cherry-pick] AS block leaks(#7895)~~ [Bugfix] AS block leaks May 25, 2026

zccjjj had a problem deploying to Metax_ci May 25, 2026 06:34 — with GitHub Actions Failure

[bugfix] AS block leaks

8d760e5

zccjjj force-pushed the ASBlockLeaks branch from d906923 to 8d760e5 Compare May 25, 2026 07:04

zccjjj had a problem deploying to Metax_ci May 25, 2026 07:04 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Merge branch 'develop' into ASBlockLeaks

7b4af55

plusNew001 had a problem deploying to Metax_ci May 25, 2026 08:32 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 25, 2026

View reviewed changes

kevincheng2 approved these changes May 25, 2026

View reviewed changes

kevincheng2 merged commit f23586e into PaddlePaddle:develop May 25, 2026
39 of 43 checks passed

Conversation

zccjjj commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 22, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 9/10 通过

2.2 可选任务 — 27/31 通过

3 失败详情（仅 required）

Uh oh!

codecov-commenter commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

zccjjj commented May 25, 2026

Uh oh!

zccjjj commented May 25, 2026

Uh oh!

zccjjj commented May 25, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

zccjjj commented May 25, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zccjjj commented May 22, 2026 •

edited

Loading

PaddlePaddle-bot commented May 22, 2026 •

edited

Loading

codecov-commenter commented May 22, 2026 •

edited

Loading