-
Notifications
You must be signed in to change notification settings - Fork 751
[Models] fix fleet model fallback ep init #8039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
gongshaotian
merged 2 commits into
PaddlePaddle:develop
from
xiaoguoguo626807:fleet_graph
Jun 18, 2026
+22
−29
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 Bug TP=1 的手工重建分支只替换了
_TENSOR_MODEL_PARALLEL_GROUP,没有同步重置对应的 global ranks 状态。这个函数前面刚把 HCG/parallel context 置空,目标是避免 PaddleFleet 复用旧拓扑;但当
expected_tp_size == 1时,这里只创建新 group,删除了原先同步写_TENSOR_MODEL_PARALLEL_GLOBAL_RANKS = [current_rank]的逻辑。如果进程里已有上一次初始化留下的 global ranks,PaddleFleet parallel_state 会出现 group 与 global ranks 不一致,后续依赖 tensor-parallel rank/global ranks 的 sharded state 或随机种子初始化仍可能按旧拓扑运行。建议修复方式:在 TP=1 分支同时设置与新 group 匹配的 global ranks,例如先保存
current_rank = dist.get_rank(),然后同时写:如果 PaddleFleet 提供 destroy/reset API,更稳妥的是先清空 TP 相关 parallel_state 后再用新的 HCG 初始化。