[Cherry-Pick][BugFix] Fix get_tasks returns empty list and incorrect nnode computation(#7677 #7685)#7863
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览Required 任务已全部通过(6/6),本次 PR 的阻塞合并检查均为绿色,建议通过。当前仍有 1 个 Optional 任务失败,仅供参考,不阻塞合并。
2 任务状态汇总日志列说明:失败任务直接使用 2.1 Required任务 : 6/6 通过
2.2 可选任务 — 16/17 通过
3 失败详情(仅 required)无 required 失败任务。本次唯一失败为 Optional 的 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/online/20260415 #7863 +/- ##
==========================================================
Coverage ? 72.37%
==========================================================
Files ? 387
Lines ? 54021
Branches ? 8467
==========================================================
Hits ? 39095
Misses ? 12235
Partials ? 2691
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…0260520_fix_get_tasks
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-20 17:36:48
📋 Review 摘要
PR 概述:Cherry-pick 修复 worker_process.py 中 nnode 向上取整计算错误及 get_tasks 在无任务时返回空列表两个 Bug
变更范围:fastdeploy/worker/worker_process.py
影响面 Tag:[BugFix] [Executor]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 📝 PR 规范 | — | Usage or Command 和 Accuracy Tests 段为空(模板要求填 N/A);单元测试 Checklist 未勾选且未注明原因 |
📝 PR 规范检查
Usage or Command 和 Accuracy Tests 两个 section 留空(模板要求"无则 N/A");Add unit tests checklist 未勾选,但 PR 中未说明不写测试的原因(checklist 原文要求:"Please write the reason in this PR if no unit tests")。
标题建议(可直接复制):
-
[Cherry-Pick][BugFix] Fix get_tasks returns empty list and incorrect nnode computation(#7677 #7685)标题格式本身已符合 Cherry-Pick 规范,无需修改。
PR 描述建议(可直接复制,已补全空段为 N/A 并修正 Checklist):
## Motivation
Cherry-pick #7677 和 #7685 到 `release/online/20260415` 分支,修复两处 Bug:
1. `nnode` 计算使用普通除法而非向上取整,当 `tp_size` 为 `max_chips_per_node` 整数倍时多计 1 个节点,导致单节点多卡场景错误走多节点代码路径
2. 任务队列为空时 `exist_task_flag` 未重置,下一轮循环仍进入 `get_tasks`,返回空列表
## Modifications
- `fastdeploy/worker/worker_process.py`
- 修复 `nnode` 向上取整:`(tp_size + max_chips_per_node) // max_chips_per_node` → `(tp_size + max_chips_per_node - 1) // max_chips_per_node`
- 新增 `_get_exist_task_flag()` / `_update_exist_task_flag()` 封装单节点/多节点信号读写,消除代码重复
- 在 `exist_tasks() == False` 时主动重置 flag,防止 `get_tasks` 返回空列表
- 任务检测前新增 TP barrier,确保各 TP worker 完成上一轮迭代后再由 `tp_rank==0` 更新标志位
- 将 "Detected new requests" 日志级别从 INFO 降为 DEBUG
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
两处 Bug 修复逻辑正确:nnode 向上取整修复解决了单节点全卡(如 8 卡 TP)错误走多节点路径的问题;flag 重置逻辑修复了 get_tasks 空列表问题;新增的 TP barrier 位置及同步顺序均无误。代码质量良好,可合入。
40d3f3e
into
PaddlePaddle:release/online/20260415
Motivation
Cherry-pick #7677 and #7685 to
release/online/20260415branch.Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.