Skip to content

[CI] Set --workers=1 to avoid intermittent timeout failures#7846

Merged
DDDivano merged 1 commit into
PaddlePaddle:developfrom
EmmonsCurse:fix_test_max_waiting_time_error
May 19, 2026
Merged

[CI] Set --workers=1 to avoid intermittent timeout failures#7846
DDDivano merged 1 commit into
PaddlePaddle:developfrom
EmmonsCurse:fix_test_max_waiting_time_error

Conversation

@EmmonsCurse
Copy link
Copy Markdown
Collaborator

@EmmonsCurse EmmonsCurse commented May 18, 2026

Motivation

Under the configuration:

  • --max-concurrency 5000
  • --max-waiting-time 1

the test intermittently fails due to insufficient successful responses:

  • concurrent requests: 1333
  • expected successful responses: >= 1024
  • actual successful responses: around 1011
  • timeout failures (500): around 322

The issue is related to worker-level semaphore allocation:

self.semaphore = StatefulSemaphore((FD_SUPPORT_MAX_CONNECTIONS + workers - 1) // workers)

Since FD_SUPPORT_MAX_CONNECTIONS defaults to 1024:

  • workers = 1 → semaphore size = 1024
  • workers = 4 → semaphore size = 256 per worker

Although the total theoretical capacity remains unchanged, requests are not evenly distributed across Gunicorn workers in practice.

With workers = 4, some workers may receive significantly more requests than others, causing local semaphore exhaustion and triggering timeout failures before requests can enter inference execution.

This issue is intermittent because it depends on:

  • OS-level socket accept scheduling
  • runtime request distribution across workers
  • inference latency fluctuations under GPU load
  • boundary-state concurrency conditions

To preserve existing test behavior and avoid introducing unintended variability, it is necessary to explicitly set --workers=1 in test configurations.

Modifications

  • Explicitly set --workers=1 in the related test configuration.
  • Avoided worker-level semaphore fragmentation caused by multi-worker request imbalance.
  • Improved concurrency stability and reduced intermittent timeout-related assertion failures.

Usage or Command

N/A

Accuracy Tests

N/A

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@EmmonsCurse
Copy link
Copy Markdown
Collaborator Author

/skip-ci ci_iluvatar
/skip-ci ci_hpu
/skip-ci build_xpu
/skip-ci coverage
/skip-ci stable_test
/skip-ci pre_ce_test
/skip-ci logprob_test
/skip-ci gpu_4cards_test

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 18, 2026

Thanks for your contribution!

@EmmonsCurse EmmonsCurse requested a review from DDDivano May 18, 2026 14:07
Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-18 22:12:34

📋 Review 摘要

PR 概述:在 CI 测试配置中显式设置 --workers=1,修复 test_max_waiting_time.py 的间歇性超时断言失败。
变更范围.github/workflows/_base_test.yml
影响面 Tag[CI]

问题

级别 文件 概述
❓ 疑问 .github/workflows/_base_test.yml:275 仅修复测试层,生产环境多 worker 信号量分配问题未跟踪

📝 PR 规范检查

✓ 标题格式合规([CI] 为官方 Tag),描述模板五节完整,内容具体充实,Checklist 勾选状态与实际变更一致。

总体评价

变更简洁,根因分析充分,通过限制 CI 测试使用单 worker 消除信号量分配不均引发的间歇性断言失败,方案合理。

curl -X POST http://0.0.0.0:${FLASK_PORT}/switch \
-H "Content-Type: application/json" \
-d "{ \"--model\": \"/MODELDATA/ERNIE-4.5-0.3B-Paddle\", \"--max-concurrency\": 5000, \"--max-waiting-time\": 1 }"
-d "{ \"--model\": \"/MODELDATA/ERNIE-4.5-0.3B-Paddle\", \"--workers\": 1, \"--max-concurrency\": 5000, \"--max-waiting-time\": 1 }"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 此处通过 --workers=1 修复了测试层的断言失败,但根本问题——多 worker 场景下 StatefulSemaphore 按 worker 数量等分信号量、导致请求分布不均时部分 worker 信号量提前耗尽——在生产环境中仍然存在。

建议方向:是否计划单独提 issue 跟踪生产侧的多 worker 信号量分配问题(如改为全局信号量或自适应分配策略)?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么会请求分配不均?

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-18 23:12:58

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

所有 Required 任务已通过,PR 可合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
40(0) 40 29 0 0 0 11

2 任务状态汇总

2.1 Required任务 : 2/9 通过(7个已跳过)

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run Base Tests / base_tests 12m26s - - - -
Approval 8s - - - -
⏭️ 其余 7 个必选任务已跳过(bypass) - - - - -

注:Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage(主测试任务)通过 check_cov_skip 机制跳过,属正常流程。

2.2 可选任务 — 27/31 通过(4个已跳过)

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
其余 27 个可选任务通过 - - -
⏭️ 4 个可选任务已跳过(bypass) - - -

3 失败详情(仅 required)

无 required 失败任务。

@DDDivano DDDivano merged commit 14de908 into PaddlePaddle:develop May 19, 2026
42 checks passed
@EmmonsCurse EmmonsCurse deleted the fix_test_max_waiting_time_error branch May 19, 2026 03:05
@EmmonsCurse
Copy link
Copy Markdown
Collaborator Author

✅ Cherry-pick successful! Created PR: #7848

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants