[CI] Set --workers=1 to avoid intermittent timeout failures#7846
Conversation
|
/skip-ci ci_iluvatar |
|
Thanks for your contribution! |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-18 22:12:34
📋 Review 摘要
PR 概述:在 CI 测试配置中显式设置 --workers=1,修复 test_max_waiting_time.py 的间歇性超时断言失败。
变更范围:.github/workflows/_base_test.yml
影响面 Tag:[CI]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| ❓ 疑问 | .github/workflows/_base_test.yml:275 |
仅修复测试层,生产环境多 worker 信号量分配问题未跟踪 |
📝 PR 规范检查
✓ 标题格式合规([CI] 为官方 Tag),描述模板五节完整,内容具体充实,Checklist 勾选状态与实际变更一致。
总体评价
变更简洁,根因分析充分,通过限制 CI 测试使用单 worker 消除信号量分配不均引发的间歇性断言失败,方案合理。
| curl -X POST http://0.0.0.0:${FLASK_PORT}/switch \ | ||
| -H "Content-Type: application/json" \ | ||
| -d "{ \"--model\": \"/MODELDATA/ERNIE-4.5-0.3B-Paddle\", \"--max-concurrency\": 5000, \"--max-waiting-time\": 1 }" | ||
| -d "{ \"--model\": \"/MODELDATA/ERNIE-4.5-0.3B-Paddle\", \"--workers\": 1, \"--max-concurrency\": 5000, \"--max-waiting-time\": 1 }" |
There was a problem hiding this comment.
❓ 疑问 此处通过 --workers=1 修复了测试层的断言失败,但根本问题——多 worker 场景下 StatefulSemaphore 按 worker 数量等分信号量、导致请求分布不均时部分 worker 信号量提前耗尽——在生产环境中仍然存在。
建议方向:是否计划单独提 issue 跟踪生产侧的多 worker 信号量分配问题(如改为全局信号量或自适应分配策略)?
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览✅ 所有 Required 任务已通过,PR 可合并。
2 任务状态汇总2.1 Required任务 : 2/9 通过(7个已跳过)
2.2 可选任务 — 27/31 通过(4个已跳过)
3 失败详情(仅 required)无 required 失败任务。 |
|
✅ Cherry-pick successful! Created PR: #7848 |
Motivation
Under the configuration:
--max-concurrency 5000--max-waiting-time 1the test intermittently fails due to insufficient successful responses:
>= 10241011322The issue is related to worker-level semaphore allocation:
Since
FD_SUPPORT_MAX_CONNECTIONSdefaults to1024:workers = 1 → semaphore size = 1024workers = 4 → semaphore size = 256 per workerAlthough the total theoretical capacity remains unchanged, requests are not evenly distributed across Gunicorn workers in practice.
With
workers = 4, some workers may receive significantly more requests than others, causing local semaphore exhaustion and triggering timeout failures before requests can enter inference execution.This issue is intermittent because it depends on:
To preserve existing test behavior and avoid introducing unintended variability, it is necessary to explicitly set
--workers=1in test configurations.Modifications
--workers=1in the related test configuration.Usage or Command
N/A
Accuracy Tests
N/A
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.