[RL] Allow configuring update weights control timeout#8073
Conversation
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-24 13:37:14
📋 Review 摘要
PR 概述:为 OpenAI update_weights 控制请求新增可配置超时,并在 engine 内部用该超时等待 worker/cache-transfer。
变更范围:fastdeploy/entrypoints/openai/api_server.py、fastdeploy/engine/common_engine.py 及相关单测
影响面 Tag:[APIServer] [Engine]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/entrypoints/openai/api_server.py:515 |
timeout > 600 时 API 层仍会被 EngineClient 固定 600s 外层等待提前判超时 |
📝 PR 规范检查
标题缺少官方 Tag。描述结构符合模板。
标题建议(可直接复制):
[APIServer] Allow configuring update weights control timeout
总体评价
核心参数剥离和 worker/cache-transfer 参数隔离方向是对的,但当前 timeout 没有贯穿到 API 到 engine 的控制响应等待,超过 600 秒的配置无法生效,需要先修正后再合入。
| status_code=400, | ||
| content={"error": "Invalid parameter value", "message": "timeout must be positive"}, | ||
| ) | ||
| args["timeout"] = request_data["timeout"] |
There was a problem hiding this comment.
🔴 Bug timeout 目前只传到了 engine 内部等待,API 到 engine 的外层等待仍然是固定 600 秒。
EngineClient.run_control_method() 仍在 asyncio.wait_for(response_queue.get(), timeout=600) 等控制响应,所以这里允许用户传入任意大于 600 的 timeout 时,HTTP 请求会先在 600 秒返回 Timeout waiting for control method response,而 engine 侧可能还在按用户配置继续执行;这和“可配置 update_weights 超时”的语义不一致。
建议修复方式:把 update_weights 的控制超时同步传递给 EngineClient.run_control_method() 的外层等待,且需要覆盖 cache-transfer 路径的总等待预算;或者在 API 层把 timeout 上限显式限制为当前外层等待能支持的范围,并返回清晰错误。
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #8073 +/- ##
==========================================
Coverage ? 67.51%
==========================================
Files ? 475
Lines ? 66890
Branches ? 10315
==========================================
Hits ? 45164
Misses ? 18860
Partials ? 2866
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Motivation
Support configuring the timeout for
update_weightsrequests. This helps large weight updates avoid timing out with the fixed default timeout.Modifications
timeoutparameter validation in the OpenAI APIupdate_weightsendpoint.timeoutas a control-only parameter and do not forward it to workers.Usage or Command
Example request body:
{ "timeout": 120 }If
timeoutis not provided, the default value is180seconds.Accuracy Tests
Not applicable. This PR only changes control request timeout handling and does not affect model outputs.
Checklist
[Engine][APIServer] Allow configuring update weights timeoutpre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.