Skip to content

[RL] Reuse GDR checkpoint transfer handle#8078

Open
jackyYang6 wants to merge 2 commits into
PaddlePaddle:developfrom
jackyYang6:jacky/optimize-checkpoint-transfer-handle-init-develop
Open

[RL] Reuse GDR checkpoint transfer handle#8078
jackyYang6 wants to merge 2 commits into
PaddlePaddle:developfrom
jackyYang6:jacky/optimize-checkpoint-transfer-handle-init-develop

Conversation

@jackyYang6

@jackyYang6 jackyYang6 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Motivation

Avoid repeated CheckpointTransfer initialization during GDR dynamic weight updates. Reusing the initialized handle reduces repeated setup overhead across multiple update steps.

Modifications

  • Cache the GDR CheckpointTransfer handle in DynamicWeightManager.
  • Lazily initialize the handle on the first GDR weight update.
  • Reuse the cached handle for later update_weights_by_gdr calls.
  • Destroy and reset the cached handle when an update fails.

Usage or Command

No new user-facing command. Existing GDR weight update flow is unchanged.

Accuracy Tests

Not applicable. This PR only changes checkpoint-transfer handle initialization behavior and does not affect model outputs.

Checklist

  • Add at least a tag in the PR title.
  • Format your code, run pre-commit before commit.
  • Add unit tests. No unit tests added because this is a handle lifecycle optimization for GDR runtime behavior.
  • Provide accuracy results. Not applicable; no model output changes.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

PaddlePaddle-bot

This comment was marked as outdated.

@jackyYang6 jackyYang6 force-pushed the jacky/optimize-checkpoint-transfer-handle-init-develop branch from ee3f166 to b69ad2a Compare June 25, 2026 11:47
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter

codecov-commenter commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.95652% with 3 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@6d9a8f4). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/rl/dynamic_weight_manager.py 86.95% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #8078   +/-   ##
==========================================
  Coverage           ?   67.52%           
==========================================
  Files              ?      475           
  Lines              ?    66907           
  Branches           ?    10317           
==========================================
  Hits               ?    45182           
  Misses             ?    18857           
  Partials           ?     2868           
Flag Coverage Δ
GPU 77.55% <86.95%> (?)
XPU 6.95% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-26 11:04:01

📋 Review 摘要

PR 概述:缓存 GDR CheckpointTransfer handle,避免动态权重更新时重复初始化 transfer 句柄。
变更范围fastdeploy/rl/dynamic_weight_manager.pytests/rl/test_dynamic_weight_gdr.py
影响面 Tag[RL]

问题

未发现新的阻塞性问题。PR 规范问题在下面章节报,不要在这里重复

历史 Findings 修复情况

Finding 问题 状态
F1 _destroy_gdr_handle() 吞掉 cleanup() 异常且没有任何日志。 ⚠️ 仍存在
F2 缓存的 GDR CheckpointTransfer 没有在 sleep/clear 权重路径释放。 ⚠️ 仍存在

📝 PR 规范检查

符合规范。标题使用官方 [RL] Tag,PR 描述包含 checklist §D2 要求的 MotivationModificationsUsage or CommandAccuracy TestsChecklist 章节。

总体评价

本轮按风险优先追溯了 GDR handle 创建、复用、异常清理、runner update/clear/sleep 调用链和新增单测。除历史未解决项外,暂未发现新的需要行间评论的问题。

@PaddlePaddle-bot

PaddlePaddle-bot commented Jun 26, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-29 03:21:57 UTC+08:00

CI报告基于以下代码生成(30分钟更新一次):
PR commit: 9238c11 | Merge base: 6d9a8f4 (branch: develop)


1 Required任务 : 8/10 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
42(0) 42 39 3 0 0 0
任务 错误类型 置信度 日志
xpu_8cards_case_test / run_xpu_8cards_cases 不稳定问题 Job
Approval 需要 Approval Job

2 失败详情

🔴 xpu_8cards_case_test / run_xpu_8cards_cases — 不稳定问题(置信度: 中)

错误类型: 不稳定问题 | 置信度: 中
分析器: ci_analyze_unittest_fastdeploy
失败用例: (按根因聚类合并)

用例 错误摘要
tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp1.py::test_pd_separation PD 服务健康检查通过后,首个 EP4TP1 请求模型回复为空,触发关键词断言失败

关键日志:

服务健康检查中... P节点状态码:200,D节点状态码:200
PD分离服务启动成功!耗时 10 秒
模型回复:
PD分离测试失败: 响应内容不符合预期:
assert False
FAILED tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp1.py::test_pd_separation
=================== 1 failed, 3 passed in 404.94s (0:06:44) ====================
  • 根因摘要: XPU PD分离推理偶发空回复
    失败发生在 test_pd_21b_ep4tp1.py::test_pd_separation 的响应内容校验,P/D 服务健康检查均已返回 200,但 response.choices[0].message.content 为空,导致关键词断言失败。同一 job 后续 3 个 PD 分离 case 均产出正常回复并通过;日志中未出现 FD_USE_GDR_CHECKPOINT_TRANSFERCheckpointTransferupdate_weights_by_gdr,代码搜索也未发现 XPU runner 引用本 PR 修改的 DynamicWeightManager 路径,当前证据不支持 PR 变更直接导致该失败,更像 XPU PD serving 的偶发空回复/运行时不稳定。

修复建议:

  1. 先 rerun xpu_8cards_case_test / run_xpu_8cards_cases 验证是否恢复。
  2. 若复现,补充保留 log_routerlog_prefilllog_decode,重点排查 EP4TP1 首次请求为什么返回空文本。

关联变更: 未发现直接关联;PR 仅修改 fastdeploy/rl/dynamic_weight_manager.pytests/rl/test_dynamic_weight_gdr.py,失败用例不在本次 diff 中。

🔴 Approval — 需要 Approval(置信度: 高)

错误类型: 需要 Approval | 置信度: 高
分析器: builtin approval_required
失败用例: 无

用例 错误摘要
Approval Workflow 等待人工审批

关键日志:

[FAILURE]: Process completed with exit code 6.
  • 根因摘要: Workflow 等待人工审批
    该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。

修复建议:

  1. 请通过人工审批后触发对应 workflow 继续运行。

关联变更: 与代码变更无关。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants