[RL] enforce cuda graph clearing and rebuilding in rsync weight update by rdma#8085
[RL] enforce cuda graph clearing and rebuilding in rsync weight update by rdma#8085liyonghua0910 wants to merge 2 commits into
Conversation
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-29 15:19:09
📋 Review 摘要
PR 概述:调整 rsync/RDMA 权重更新前后的 CUDA Graph、KV cache 与输入缓存重建流程。
变更范围:fastdeploy/worker/gpu_model_runner.py
影响面 Tag:[RL] [Graph Optimization] [KVCache]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/worker/gpu_model_runner.py:3313 |
非 MTP 场景下 GDR release cache 重建会访问空的 self.proposer |
📝 PR 规范检查
标题符合规范;PR 描述包含模板章节,但 Motivation、Modifications、Usage or Command、Accuracy Tests 仍为空/占位内容,建议替换为可读的完整描述。
PR 描述建议(点击展开,可直接复制)
## Motivation
修复 rsync/RDMA 权重更新过程中 CUDA Graph 未按权重变化及时清理并重建的问题,避免权重更新后继续复用旧图状态。
## Modifications
在 `fastdeploy/worker/gpu_model_runner.py` 中抽取权重更新前后的内存处理逻辑:
- GDR release cache 路径在权重更新前后清理/重建 KV cache 与 CUDA Graph。
- RDMA 权重更新路径在更新前清理 CUDA Graph,并在更新后重建 CUDA Graph。
- 权重更新后重置 `share_inputs`、MTP model inputs 以及缓存的 model/sampler 输出状态。
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
当前实现有一个会阻断 GDR release cache 权重更新的空对象访问问题,需要先恢复 MTP 条件保护后再合入。
|
|
||
| # Rebuild cache on model runner | ||
| if not self.enable_cache_manager_v1: | ||
| self.proposer.initialize_kv_cache(main_model_num_blocks=self.num_gpu_blocks) |
There was a problem hiding this comment.
🔴 Bug 非 MTP 场景下这里会无条件访问空的 self.proposer。
_init_speculative_proposer() 在 spec_method is None 时将 self.proposer 设为 None,但这段重建逻辑只受 rebuild_kv_cache 和 enable_cache_manager_v1 控制。只要使用 GDR 权重更新并开启 gdr_release_cache,普通非投机解码模型在 cache manager v1 关闭时就会在重建阶段抛 AttributeError,导致本次权重更新失败且 KV cache 状态不能恢复到正常服务。
建议修复方式:保留旧逻辑中的 MTP 条件,只在 self.spec_method == SpecMethod.MTP 时初始化 proposer cache;主模型 cache 仍然独立初始化。
if self.spec_method == SpecMethod.MTP:
if not self.enable_cache_manager_v1:
self.proposer.initialize_kv_cache(main_model_num_blocks=self.num_gpu_blocks)
self.initialize_kv_cache()
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #8085 +/- ##
==========================================
Coverage ? 67.55%
==========================================
Files ? 475
Lines ? 66913
Branches ? 10323
==========================================
Hits ? 45202
Misses ? 18836
Partials ? 2875
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.