Skip to content

[Models] fix fleet model fallback ep init#8039

Merged
gongshaotian merged 2 commits into
PaddlePaddle:developfrom
xiaoguoguo626807:fleet_graph
Jun 18, 2026
Merged

[Models] fix fleet model fallback ep init#8039
gongshaotian merged 2 commits into
PaddlePaddle:developfrom
xiaoguoguo626807:fleet_graph

Conversation

@xiaoguoguo626807

@xiaoguoguo626807 xiaoguoguo626807 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Motivation

修复 --model-impl paddlefleet fallback 场景下生成模型校验与 Expert Parallel 初始化状态复用问题,避免 PaddleFleet 初始化时使用不匹配的 TP/EP parallel state。

Modifications

  • fastdeploy/config.py:允许 model_impl=paddlefleetrunner_type=generate 且模型未被 FastDeploy 原生 registry 标记为生成模型时继续走 fallback。
  • fastdeploy/model_executor/models/paddleformers/base_fleet.py:为 PaddleFleet fallback 配置 CPU 初始化和跳过参数初始化,并在 fleet.init() 前重置 Paddle Fleet hybrid topology。

Usage or Command

N/A

Accuracy Tests

N/A

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • [x ] Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter

codecov-commenter commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@f161fea). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #8039   +/-   ##
==========================================
  Coverage           ?   67.50%           
==========================================
  Files              ?      475           
  Lines              ?    66661           
  Branches           ?    10284           
==========================================
  Hits               ?    45002           
  Misses             ?    18788           
  Partials           ?     2871           
Flag Coverage Δ
GPU 77.48% <100.00%> (?)
XPU 6.98% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot

PaddlePaddle-bot commented Jun 11, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-12 21:32:45

CI报告基于以下代码生成(30分钟更新一次):
PR commit: f5a72c4 | Merge base: f161fea (branch: develop)


1 Required任务 : 9/10 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
42(0) 42 38 4 0 0 0
任务 错误类型 置信度 日志
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage PR问题 Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题(置信度: 高)

错误类型: PR问题 | 置信度: 高
分析器: 通用分析(fallback)
失败用例:

用例 错误摘要
tests/model_executor/fallback/test_fallback_fleet_model_coverge.py::TestInitPaddlefleetParallelState::test_seed_assertion_error_is_silenced initialize_model_parallel 触发 PaddleFleet 全局 memory buffer 已初始化断言

关键日志:

tests/model_executor/fallback/test_fallback_fleet_model_coverge.py:1070: model._init_paddlefleet_parallel_state(fd_config)
fastdeploy/model_executor/models/paddleformers/base_fleet.py:422: parallel_state.initialize_model_parallel(hcg)
/usr/local/lib/python3.10/dist-packages/paddlefleet/parallel_state.py:426: assert _GLOBAL_MEMORY_BUFFER is None
AssertionError: global memory buffer is already initialized
  • 根因摘要: PR改为初始化PaddleFleet时未重置全局缓冲
    PR 在 fastdeploy/model_executor/models/paddleformers/base_fleet.py:420-422 删除了原先 TP=1 手动建组/TP size mismatch 分支,改为只要 _TENSOR_MODEL_PARALLEL_GROUP is None 就调用 parallel_state.initialize_model_parallel(hcg)。失败用例将 _TENSOR_MODEL_PARALLEL_GROUP 置空后,本应验证 model_parallel_cuda_manual_seedAssertionError 会被 base_fleet.py:428-431 吞掉,但新逻辑先进入 initialize_model_parallel,被 PaddleFleet 已存在的 _GLOBAL_MEMORY_BUFFER 断言拦截,seed 分支没有执行。因此这是本 PR 的 base_fleet.py 初始化逻辑变更直接触发的单测失败。

修复建议:

  1. base_fleet.py 调用 parallel_state.initialize_model_parallel(hcg) 前,使用 PaddleFleet/Paddle 的正式 reset/destroy API 清理已有 parallel_state 全局状态,至少要覆盖 _GLOBAL_MEMORY_BUFFER;若 TP=1 不需要重新初始化 model parallel,则保留或恢复原 TP=1 手动建组路径。
  2. 若新初始化语义是预期行为,同步更新 tests/model_executor/fallback/test_fallback_fleet_model_coverge.py:1049 附近的用例,在只验证 seed 断言吞掉时 mock 或重置 ps.initialize_model_parallel 相关全局状态,并清理仍描述旧 TP=1 分支的断言。

关联变更: fastdeploy/model_executor/models/paddleformers/base_fleet.py:399-422;PR 删除旧 TP=1/mismatch 分支并新增 fleet 状态重置后统一初始化。

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-16 17:38:00

📋 Review 摘要

PR 概述:修复 --model-impl paddlefleet fallback 的生成模型校验与 PaddleFleet EP/TP 初始化状态复用问题。
变更范围fastdeploy/config.pyfastdeploy/model_executor/models/paddleformers/base_fleet.py
影响面 Tag[Models] [FDConfig]

问题

级别 文件 概述
🔴 Bug fastdeploy/model_executor/models/paddleformers/base_fleet.py:431 TP=1 手工重建 PaddleFleet TP group 时未同步重置 global ranks,仍可能复用旧拓扑状态

历史 Findings 修复情况

Finding 问题 状态
F1 旧 TP group 拓扑过期时仍可能被保留 ⚠️ 仍存在

📝 PR 规范检查

标题和描述结构已按上次建议补齐;未发现需要阻塞本次修复的规范问题。Checklist 中 - [x ] Add unit tests 不是合法 checkbox,建议按实际测试情况改成合法的 [ ][x]

总体评价

paddlefleet fallback 的配置放行方向合理,但 PaddleFleet parallel_state 的重置仍不完整。请修复 TP=1 分支的 global ranks 同步,并同步检查 same-size stale TP group 是否应强制用新的 HCG 重新初始化。

parallel_state._TENSOR_MODEL_PARALLEL_GROUP = default_pg
parallel_state._TENSOR_MODEL_PARALLEL_GLOBAL_RANKS = tp_ranks
logger.info(f"Reinitialized TP group with size=1, rank={current_rank}, ranks={tp_ranks}")
parallel_state._TENSOR_MODEL_PARALLEL_GROUP = dist.new_group(ranks=[dist.get_rank()])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug TP=1 的手工重建分支只替换了 _TENSOR_MODEL_PARALLEL_GROUP,没有同步重置对应的 global ranks 状态。

这个函数前面刚把 HCG/parallel context 置空,目标是避免 PaddleFleet 复用旧拓扑;但当 expected_tp_size == 1 时,这里只创建新 group,删除了原先同步写 _TENSOR_MODEL_PARALLEL_GLOBAL_RANKS = [current_rank] 的逻辑。如果进程里已有上一次初始化留下的 global ranks,PaddleFleet parallel_state 会出现 group 与 global ranks 不一致,后续依赖 tensor-parallel rank/global ranks 的 sharded state 或随机种子初始化仍可能按旧拓扑运行。

建议修复方式:在 TP=1 分支同时设置与新 group 匹配的 global ranks,例如先保存 current_rank = dist.get_rank(),然后同时写:

parallel_state._TENSOR_MODEL_PARALLEL_GROUP = dist.new_group(ranks=[current_rank])
parallel_state._TENSOR_MODEL_PARALLEL_GLOBAL_RANKS = [current_rank]

如果 PaddleFleet 提供 destroy/reset API,更稳妥的是先清空 TP 相关 parallel_state 后再用新的 HCG 初始化。

@gongshaotian gongshaotian left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gongshaotian gongshaotian merged commit fbf3f4e into PaddlePaddle:develop Jun 18, 2026
40 of 43 checks passed
@xiaoguoguo626807 xiaoguoguo626807 deleted the fleet_graph branch June 18, 2026 06:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants