Skip to content

reduce sleep time in loops and cancel schedule threashold for prefill instance#7871

Draft
liyonghua0910 wants to merge 1 commit into
PaddlePaddle:release/2.6from
liyonghua0910:release/2.6+20260521_pd_test
Draft

reduce sleep time in loops and cancel schedule threashold for prefill instance#7871
liyonghua0910 wants to merge 1 commit into
PaddlePaddle:release/2.6from
liyonghua0910:release/2.6+20260521_pd_test

Conversation

@liyonghua0910
Copy link
Copy Markdown
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 21, 2026

Thanks for your contribution!

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 21, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-21 20:21:01

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前 Required 失败 2 个(1 个覆盖率阈值失败、1 个等待 Approval),Required 无运行中/等待中任务;需处理覆盖率并完成人工审批后再合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
36(0) 36 30 5 1 0 0

2 任务状态汇总

日志列说明:失败任务直接使用预生成 Job 链接;运行中任务如无 URL 则展示任务名。

2.1 Required任务 : 8/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 1h13m PR问题:Diff覆盖率62%,新增sleep行未覆盖 补测未覆盖行或申请覆盖率豁免 Job -
Approval 10s 需要 Approval:等待人工审批 请通过人工审批 Job -
其余 8 个必选任务通过 - - - - -

2.2 可选任务 — 22/26 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 14m12s Job -
Check PR Template 11s Job -
Trigger Jenkins for PR 1m4s Job -
CI_HPU - - -
其余 22 个可选任务通过 - - -

3 失败详情(仅 required)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率不足(置信度: 高)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

  • 状态: ❌ 失败
  • 错误类型: 覆盖率不足
  • 置信度: 高
  • 根因摘要: Diff覆盖率62%,新增sleep行未覆盖
  • 分析器: ci_analyze_unittest_fastdeploy

失败用例: 无。单元测试通过(TEST_EXIT_CODE=0),失败发生在覆盖率阈值校验阶段。

根因详情:
CI 在 Verify Code Coverage Threshold (80%) 步骤失败,COVERAGE_EXIT_CODE=9。日志中的 diff_coverage.json 显示本 PR 变更行总覆盖率为 62%,低于 80% 阈值;未覆盖行包括 fastdeploy/engine/common_engine_prepare_mixin.py:251,254fastdeploy/output/token_processor.py:671,均为本 PR 调整 sleep 时间的新增/修改行。

关键日志:

Coverage generation failed (exit code 9)
GPU Patch Coverage Details:
fastdeploy/output/token_processor.py: percent_covered=0.0, violation_lines=[671]
fastdeploy/engine/common_engine_prepare_mixin.py: percent_covered=0.0, violation_lines=[251, 254]
total_num_lines=8, total_num_violations=3, total_percent_covered=62

修复建议:

  1. 在相关单测中覆盖 fastdeploy/engine/common_engine_prepare_mixin.py_fetch_loop 正常执行和异常分支,覆盖第 251/254 行的 sleep 调用。
  2. 覆盖 fastdeploy/output/token_processor.py prefill 等待 cache 未完成分支,或如仅为常量 sleep 调整且无测试价值,按项目流程申请 diff coverage 豁免。

修复建议摘要: 补测未覆盖行或申请覆盖率豁免

关联变更: fastdeploy/engine/common_engine_prepare_mixin.py:251,254fastdeploy/output/token_processor.py:671
链接: 查看日志

Approval — 需要 Approval(置信度: 高)

该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。

修复建议摘要: 请通过人工审批
链接: 查看日志

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-21 11:22:00

📋 Review 摘要

PR 概述:针对 PD 分离场景调整各循环 sleep 时长、为 Prefill 实例取消调度块阈值、修复 block_num 负数问题、调整默认预填充准备线程数。
变更范围engine/engine/sched/envs.pyoutput/splitwise/
影响面 Tag[PD Disaggregation] [Scheduler] [Engine]

问题

级别 文件 概述
❓ 疑问 fastdeploy/engine/common_engine_prepare_mixin.py:251 PR 标题称"reduce sleep time",但实际所有 sleep 均为增大(0.002→0.02 / 0.002→0.005 / 0.001→0.005)
📝 PR 规范 目标分支为 release,标题缺少 [Cherry-Pick] 前缀及原 PR 号;标题无官方 Tag;所有描述 section 均为空

📝 PR 规范检查

PR 存在以下规范问题:

  1. 目标分支为 release/2.6,按规范标题须用 [Cherry-Pick][Tag] 标题(#原PR号) 格式,但当前标题既无 [Cherry-Pick] 也无官方 Tag。
  2. PR body 中 Motivation / Modifications / Usage or Command / Accuracy Tests 均为空(仅保留模板注释)。

标题建议(可直接复制):

  • [Cherry-Pick][PD Disaggregation] Adjust sleep intervals and cancel schedule threshold for prefill instance(#<原PR号>)

PR 描述建议(可直接复制):

## Motivation
针对 PD 分离(Prefill/Decode 解耦)场景的性能与稳定性调优:
1. 将各忙等循环的 sleep 时间适当调大,降低 CPU 无效轮询;
2. 为 Prefill 实例移除 decode 阶段调度块阈值限制,避免 running 请求被错误阻塞;
3. 修复 `get_new_block_nums``block_num` 可能返回负数的潜在 bug;
4.`FD_PREFILL_PREPARE_REQ_THREAD_NUM` 默认线程数从 5 调整为 3,减少不必要线程开销。

## Modifications
- `fastdeploy/engine/common_engine_prepare_mixin.py``_fetch_loop` 中正常/异常路径 sleep 从 `0.002s` 调整为 `0.02s`
- `fastdeploy/engine/sched/resource_manager_v1.py`- `get_new_block_nums` 增加 `block_num = max(block_num, 0)` 防负数保护
  - `_allocate_decode_and_extend` 中 Prefill 实例(`splitwise_role == "prefill"`)直接设 `can_schedule_block_num_threshold = 0`,跳过阈值计算
- `fastdeploy/envs.py``FD_PREFILL_PREPARE_REQ_THREAD_NUM` 默认值从 `5` 改为 `3`
- `fastdeploy/output/token_processor.py``_recycle_resources` 等待 KV cache 发送的 sleep 从 `0.002s` 调整为 `0.005s`
- `fastdeploy/splitwise/splitwise_connector.py``check_decode_allocated` 轮询 sleep 从 `0.001s` 调整为 `0.005s`

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

代码逻辑改动合理:block_num 负数防护是有效的防御性修复,Prefill 实例跳过调度块阈值符合 PD 分离语义。需要作者确认 sleep 调整方向与 PR 标题是否一致(标题写"reduce"但实际均为增大),并补全 PR 描述以便后续追溯。

self._pause_cond.wait_for(lambda: not self.is_paused)
fetch_fn()
time.sleep(0.002)
time.sleep(0.02)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 PR 标题为 "reduce sleep time in loops",但此处实际是将 0.002s 增大0.02s(10 倍)。token_processor.py(0.002→0.005)和 splitwise_connector.py(0.001→0.005)同样是增大。

请确认:标题是否有笔误(应为 "adjust/increase sleep time")?还是存在本 PR 未体现的减少操作?

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 50.00000% with 4 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@31b12ee). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/common_engine_prepare_mixin.py 0.00% 2 Missing ⚠️
fastdeploy/engine/sched/resource_manager_v1.py 75.00% 1 Missing ⚠️
fastdeploy/output/token_processor.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.6    #7871   +/-   ##
==============================================
  Coverage               ?   72.20%           
==============================================
  Files                  ?      381           
  Lines                  ?    54225           
  Branches               ?     8474           
==============================================
  Hits                   ?    39153           
  Misses                 ?    12302           
  Partials               ?     2770           
Flag Coverage Δ
GPU 72.20% <50.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants