[Models] fix fleet model fallback ep init#8039
Conversation
89c6e7d to
f5a72c4
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #8039 +/- ##
==========================================
Coverage ? 67.50%
==========================================
Files ? 475
Lines ? 66661
Branches ? 10284
==========================================
Hits ? 45002
Misses ? 18788
Partials ? 2871
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
CI报告基于以下代码生成(30分钟更新一次): 1 Required任务 : 9/10 通过
2 失败详情🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题(置信度: 高)错误类型: PR问题 | 置信度: 高
关键日志:
修复建议:
关联变更: |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-16 17:38:00
📋 Review 摘要
PR 概述:修复 --model-impl paddlefleet fallback 的生成模型校验与 PaddleFleet EP/TP 初始化状态复用问题。
变更范围:fastdeploy/config.py、fastdeploy/model_executor/models/paddleformers/base_fleet.py
影响面 Tag:[Models] [FDConfig]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/model_executor/models/paddleformers/base_fleet.py:431 |
TP=1 手工重建 PaddleFleet TP group 时未同步重置 global ranks,仍可能复用旧拓扑状态 |
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | 旧 TP group 拓扑过期时仍可能被保留 |
📝 PR 规范检查
标题和描述结构已按上次建议补齐;未发现需要阻塞本次修复的规范问题。Checklist 中 - [x ] Add unit tests 不是合法 checkbox,建议按实际测试情况改成合法的 [ ] 或 [x]。
总体评价
paddlefleet fallback 的配置放行方向合理,但 PaddleFleet parallel_state 的重置仍不完整。请修复 TP=1 分支的 global ranks 同步,并同步检查 same-size stale TP group 是否应强制用新的 HCG 重新初始化。
| parallel_state._TENSOR_MODEL_PARALLEL_GROUP = default_pg | ||
| parallel_state._TENSOR_MODEL_PARALLEL_GLOBAL_RANKS = tp_ranks | ||
| logger.info(f"Reinitialized TP group with size=1, rank={current_rank}, ranks={tp_ranks}") | ||
| parallel_state._TENSOR_MODEL_PARALLEL_GROUP = dist.new_group(ranks=[dist.get_rank()]) |
There was a problem hiding this comment.
🔴 Bug TP=1 的手工重建分支只替换了 _TENSOR_MODEL_PARALLEL_GROUP,没有同步重置对应的 global ranks 状态。
这个函数前面刚把 HCG/parallel context 置空,目标是避免 PaddleFleet 复用旧拓扑;但当 expected_tp_size == 1 时,这里只创建新 group,删除了原先同步写 _TENSOR_MODEL_PARALLEL_GLOBAL_RANKS = [current_rank] 的逻辑。如果进程里已有上一次初始化留下的 global ranks,PaddleFleet parallel_state 会出现 group 与 global ranks 不一致,后续依赖 tensor-parallel rank/global ranks 的 sharded state 或随机种子初始化仍可能按旧拓扑运行。
建议修复方式:在 TP=1 分支同时设置与新 group 匹配的 global ranks,例如先保存 current_rank = dist.get_rank(),然后同时写:
parallel_state._TENSOR_MODEL_PARALLEL_GROUP = dist.new_group(ranks=[current_rank])
parallel_state._TENSOR_MODEL_PARALLEL_GLOBAL_RANKS = [current_rank]如果 PaddleFleet 提供 destroy/reset API,更稳妥的是先清空 TP 相关 parallel_state 后再用新的 HCG 初始化。
Motivation
修复
--model-impl paddlefleetfallback 场景下生成模型校验与 Expert Parallel 初始化状态复用问题,避免 PaddleFleet 初始化时使用不匹配的 TP/EP parallel state。Modifications
fastdeploy/config.py:允许model_impl=paddlefleet在runner_type=generate且模型未被 FastDeploy 原生 registry 标记为生成模型时继续走 fallback。fastdeploy/model_executor/models/paddleformers/base_fleet.py:为 PaddleFleet fallback 配置 CPU 初始化和跳过参数初始化,并在fleet.init()前重置 Paddle Fleet hybrid topology。Usage or Command
N/A
Accuracy Tests
N/A
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.