Skip to content

[RL] Allow configuring update weights control timeout#8073

Merged
liyonghua0910 merged 1 commit into
PaddlePaddle:developfrom
jackyYang6:jacky/update-weights-timeout
Jun 24, 2026
Merged

[RL] Allow configuring update weights control timeout#8073
liyonghua0910 merged 1 commit into
PaddlePaddle:developfrom
jackyYang6:jacky/update-weights-timeout

Conversation

@jackyYang6

Copy link
Copy Markdown
Contributor

Motivation

Support configuring the timeout for update_weights requests. This helps large weight updates avoid timing out with the fixed default timeout.

Modifications

  • Add timeout parameter validation in the OpenAI API update_weights endpoint.
  • Use the configured timeout when updating worker weights and cache-transfer metadata.
  • Keep timeout as a control-only parameter and do not forward it to workers.
  • Add unit tests for default timeout, custom timeout, and invalid timeout values.

Usage or Command

Example request body:

{
  "timeout": 120
}

If timeout is not provided, the default value is 180 seconds.

Accuracy Tests

Not applicable. This PR only changes control request timeout handling and does not affect model outputs.

Checklist

  • Add at least a tag in the PR title.
    • Suggested title: [Engine][APIServer] Allow configuring update weights timeout
  • Format your code, run pre-commit before commit.
  • Add unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@jackyYang6 jackyYang6 changed the title Allow configuring update weights control timeout [RL]Allow configuring update weights control timeout Jun 24, 2026
@jackyYang6 jackyYang6 changed the title [RL]Allow configuring update weights control timeout [RL] Allow configuring update weights control timeout Jun 24, 2026

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-24 13:37:14

📋 Review 摘要

PR 概述:为 OpenAI update_weights 控制请求新增可配置超时,并在 engine 内部用该超时等待 worker/cache-transfer。
变更范围fastdeploy/entrypoints/openai/api_server.pyfastdeploy/engine/common_engine.py 及相关单测
影响面 Tag[APIServer] [Engine]

问题

级别 文件 概述
🔴 Bug fastdeploy/entrypoints/openai/api_server.py:515 timeout > 600 时 API 层仍会被 EngineClient 固定 600s 外层等待提前判超时

📝 PR 规范检查

标题缺少官方 Tag。描述结构符合模板。

标题建议(可直接复制):

  • [APIServer] Allow configuring update weights control timeout

总体评价

核心参数剥离和 worker/cache-transfer 参数隔离方向是对的,但当前 timeout 没有贯穿到 API 到 engine 的控制响应等待,超过 600 秒的配置无法生效,需要先修正后再合入。

status_code=400,
content={"error": "Invalid parameter value", "message": "timeout must be positive"},
)
args["timeout"] = request_data["timeout"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug timeout 目前只传到了 engine 内部等待,API 到 engine 的外层等待仍然是固定 600 秒。

EngineClient.run_control_method() 仍在 asyncio.wait_for(response_queue.get(), timeout=600) 等控制响应,所以这里允许用户传入任意大于 600 的 timeout 时,HTTP 请求会先在 600 秒返回 Timeout waiting for control method response,而 engine 侧可能还在按用户配置继续执行;这和“可配置 update_weights 超时”的语义不一致。

建议修复方式:把 update_weights 的控制超时同步传递给 EngineClient.run_control_method() 的外层等待,且需要覆盖 cache-transfer 路径的总等待预算;或者在 API 层把 timeout 上限显式限制为当前外层等待能支持的范围,并返回清晰错误。

@codecov-commenter

codecov-commenter commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@1b70017). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #8073   +/-   ##
==========================================
  Coverage           ?   67.51%           
==========================================
  Files              ?      475           
  Lines              ?    66890           
  Branches           ?    10315           
==========================================
  Hits               ?    45164           
  Misses             ?    18860           
  Partials           ?     2866           
Flag Coverage Δ
GPU 77.54% <100.00%> (?)
XPU 6.95% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mitu626 mitu626 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@liyonghua0910 liyonghua0910 merged commit 2ee3c7f into PaddlePaddle:develop Jun 24, 2026
40 of 42 checks passed
@jackyYang6 jackyYang6 deleted the jacky/update-weights-timeout branch June 25, 2026 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants