[Cherry-Pick][RL] Support cpu tensor broadcast(#7833) by Sunny-bot1 · Pull Request #7840 · PaddlePaddle/FastDeploy

Sunny-bot1 · 2026-05-18T02:51:23Z

Motivation

paddle 建立通信组时默认的 backend 是NCCL，此时 paddle.distributed.broadcast 不支持广播CPU tensor，paddle.distributed.broadcast_object_list 仍会调用GPU kernel并引入DtoH同步拷贝。

Modifications

当我们只需要广播CPU tensor时可以用 gloo backend 建组：

group = dist.new_group(list(range(world_size), backend="gloo")
paddle.distributed.broadcast(signal_tensor, src=0, group=group)

gloo 是纯 CPU socket 实现，完全绕开 NCCL 和 GPU, nsys 上不会有任何 CUDA kernel 和 DtoH/HtoD。
代价：gloo的带宽和延迟比NCCL差，但广播一个信号值本身数据量极小，实际影响可忽略。

paddle.distributed.shutdown_process_group() 不传参时会遍历所有 process group 调 .shutdown()，但 ProcessGroupGloo 没有实现这个方法，导致触发 AttributeError。

解决方案：在调用 paddle.distributed.shutdown_process_group() 前，先把 gloo group 从 paddle 的全局 group 注册表里移除。在调全量 shutdown_process_group() 前，遍历 paddle 全局 group 注册表，把没有 shutdown 方法（即 gloo 等 CPU backend）的 group 条目先删掉，这样后续遍历就跳过它们，不会触发AttributeError。

gloo group 本身不需要显式 shutdown，进程退出时会自动清理。

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-18T02:51:29Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-18T03:10:50Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-18 14:16:51

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: c075965
Merge base: d71bdda (branch: release/2.6)
查看完整 Diff
CI 详情

1 任务总览

有 1 个 Required 任务失败，需优先处理后方可合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
37(0)	37	32	4	0	1	0

2 任务状态汇总

2.1 Required任务 : 9/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	1h12m	PR问题：新增代码覆盖率25%，未达80%阈值	为新增gloo broadcast代码添加单测或申请豁免	Job	-
✅	其余 9 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 23/27 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	11m34s	Job	-
❌	`Check PR Template`	11s	Job	-
❌	`Trigger Jenkins for PR`	1m1s	Job	-
⏸️	`CI_HPU`	-	-	-
✅	其余 23 个可选任务通过	-	-	-

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率不达标（置信度: 高）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

状态: ❌ 失败
错误类型: 覆盖率不达标
置信度: 高
根因摘要: PR新增代码覆盖率25%，未达80%阈值
分析器: ci_analyze_unittest_fastdeploy

覆盖率详情:

文件	覆盖率	未覆盖行
`fastdeploy/worker/worker_process.py`	37.5%	L323, L326, L327, L328, L508
`fastdeploy/rl/dynamic_weight_manager.py`	0.0%	L353, L355, L356, L357

根因详情:
PR 新增了 gloo 通信组 CPU tensor broadcast 实现（worker_process.py 中 _broadcast_model_weights_signal 方法 L323-328 及 L508），以及 gloo group 清理逻辑（dynamic_weight_manager.py L353-357），但这些新增代码均未被现有单元测试覆盖。diff 覆盖率仅 25%（12行新增中9行未覆盖），低于 80% 阈值，CI 以 exit code 9 退出。

关键日志:

COVERAGE_EXIT_CODE: 9
GPU Patch Coverage Details:
{"total_num_lines": 12, "total_num_violations": 9, "total_percent_covered": 25}
  "fastdeploy/worker/worker_process.py": {"percent_covered": 37.5, "violation_lines": [323, 326, 327, 328, 508]}
  "fastdeploy/rl/dynamic_weight_manager.py": {"percent_covered": 0.0, "violation_lines": [353, 355, 356, 357]}

修复建议:

为 fastdeploy/worker/worker_process.py 中 _broadcast_model_weights_signal 方法（L323-328）添加单元测试，覆盖 CPU tensor broadcast 逻辑
为 fastdeploy/rl/dynamic_weight_manager.py 中 gloo group 清理代码（L353-357）添加单元测试
如上述逻辑依赖多进程分布式环境无法在单测中覆盖，可在 PR 描述中说明原因申请覆盖率豁免

修复建议摘要: 为新增gloo broadcast代码添加单测或申请豁免

关联变更: fastdeploy/worker/worker_process.py L177-180, L323-328, L508; fastdeploy/rl/dynamic_weight_manager.py L353-357
链接: 查看日志

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-18 12:15:39

📋 Review 摘要

PR 概述：将 RL 动态权重广播信号从 broadcast_object_list（触发 GPU kernel + DtoH）替换为 gloo backend 纯 CPU tensor broadcast，消除不必要的 CUDA 同步开销。

变更范围：fastdeploy/rl/dynamic_weight_manager.py、fastdeploy/worker/worker_process.py

影响面 Tag：[RL]

问题

级别	文件	概述
🟡 建议	`fastdeploy/rl/dynamic_weight_manager.py:355`	使用 Paddle 内部私有 API `_get_group_map_by_name`，存在版本兼容风险；`pg.process_group` 未做 `hasattr` 防护

CI 注：Check PR Template workflow 已失败（exit code 7），根因为 Modifications 等 section 为空，修复 PR 描述后应自动通过。

📝 PR 规范检查

Modifications、Usage or Command、Accuracy Tests 三个 section 均为空（仅保留模板注释），Checklist 全未勾选，导致 Check PR Template CI 失败。标题格式 [Cherry-Pick][RL] Support cpu tensor broadcast(#7833) 符合 cherry-pick 规范 ✓

PR 描述建议（可直接复制）：

## Motivation
paddle 建立通信组时默认的 backend 是 NCCL，此时 `paddle.distributed.broadcast` 不支持广播 CPU tensor，而 `paddle.distributed.broadcast_object_list` 仍会调用 GPU kernel 并引入 DtoH 同步拷贝。当只需要广播 CPU 信号量时，可用 gloo backend 单独建组，完全绕开 NCCL 和 GPU，nsys 上不会出现任何 CUDA kernel 和 DtoH/HtoD。gloo 带宽和延迟比 NCCL 差，但广播一个信号值数据量极小，实际影响可忽略。

## Modifications
- `fastdeploy/worker/worker_process.py`：
  - `__init__` 中 ranks > 1 时创建 gloo backend 进程组 `self.gloo_group`
  - `_broadcast_model_weights_signal` 改用 CPU int32 tensor + `paddle.distributed.broadcast` + gloo group，替代原来触发 GPU kernel 和 DtoH 拷贝的 `broadcast_object_list`
  - `event_loop_normal` 中两处调用改为传入 `self.gloo_group`
- `fastdeploy/rl/dynamic_weight_manager.py`：
  - `clear_parameters` 中全局 `shutdown_process_group()` 之前，先将无 `shutdown()` 方法的 Gloo ProcessGroup 从 paddle 内部 registry 中移除，避免 AttributeError

## Usage or Command
N/A

## Accuracy Tests
N/A（本次变更仅替换信号量广播实现，不影响模型权重和精度）

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

实现思路清晰，用 gloo backend CPU broadcast 替代 NCCL broadcast_object_list 是规避 GPU 同步开销的合理做法。主要关注点是 clear_parameters 中对 Paddle 内部私有 API 的直接操作，建议加防御性保护以提升跨版本稳定性；同时需补全 PR 描述以通过 CI 模板检查。

PaddlePaddle-bot · 2026-05-18T04:21:18Z

+            # before the global sweep to avoid AttributeError.
+            from paddle.distributed.collective import _get_group_map_by_name
+
+            for name, pg in list(_get_group_map_by_name().items()):


🟡 建议 使用了 Paddle 内部私有 API _get_group_map_by_name，存在版本兼容风险

_get_group_map_by_name 以 _ 开头，是 Paddle 的内部实现细节，不在公开 API 保证范围内，Paddle 版本升级时可能被改名/移除/返回格式变化，导致 ImportError 或逻辑静默失效。

下一行直接访问 pg.process_group，未先检查 hasattr(pg, 'process_group')，若 Paddle 内部 Group 对象结构变化会抛出 AttributeError（注释中说的 "to avoid AttributeError" 反而自身也有 AttributeError 风险）。

建议加防御性保护：

try: from paddle.distributed.collective import _get_group_map_by_name for name, pg in list(_get_group_map_by_name().items()): proc_group = getattr(pg, 'process_group', None) if proc_group is not None and not hasattr(proc_group, "shutdown"): _get_group_map_by_name().pop(name, None) except (ImportError, AttributeError): pass # paddle version without gloo registry; safe to skip

codecov-commenter · 2026-05-18T05:27:13Z

Codecov Report

❌ Patch coverage is 25.00000% with 9 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@d71bdda). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/worker/worker_process.py	37.50%	5 Missing ⚠️
fastdeploy/rl/dynamic_weight_manager.py	0.00%	4 Missing ⚠️

Additional details and impacted files

@@              Coverage Diff               @@
##             release/2.6    #7840   +/-   ##
==============================================
  Coverage               ?   72.45%           
==============================================
  Files                  ?      381           
  Lines                  ?    54155           
  Branches               ?     8460           
==============================================
  Hits                   ?    39236           
  Misses                 ?    12161           
  Partials               ?     2758

Flag	Coverage Δ
GPU	`72.45% <25.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sunny-bot1 added 3 commits May 18, 2026 10:49

support cpu tensor broadcast

890cb3d

fix place

37ac100

fix group

40a5a7c

Sunny-bot1 had a problem deploying to Metax_ci May 18, 2026 02:51 — with GitHub Actions Failure

fix init

d6f1d34

Sunny-bot1 had a problem deploying to Metax_ci May 18, 2026 03:03 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

fix shutdown process group

c075965

Sunny-bot1 had a problem deploying to Metax_ci May 18, 2026 04:05 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 18, 2026

View reviewed changes

Deleter-D approved these changes May 18, 2026

View reviewed changes

Jiang-Jia-Jun merged commit 9894b32 into PaddlePaddle:release/2.6 May 18, 2026
33 of 38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-Pick][RL] Support cpu tensor broadcast(#7833)#7840

[Cherry-Pick][RL] Support cpu tensor broadcast(#7833)#7840
Jiang-Jia-Jun merged 5 commits into
PaddlePaddle:release/2.6from
Sunny-bot1:broadcast_cpu_26

Sunny-bot1 commented May 18, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented May 18, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 18, 2026 •

edited

Loading

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 18, 2026

Uh oh!

codecov-commenter commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Sunny-bot1 commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 18, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 9/10 通过

2.2 可选任务 — 23/27 通过

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 18, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Sunny-bot1 commented May 18, 2026 •

edited

Loading

PaddlePaddle-bot commented May 18, 2026 •

edited

Loading