Skip to content

fix: Parallel unit test runner with file-level subprocess isolation#3285

Merged
WMC001 merged 2 commits into
developfrom
fix/parallel-unit-tests
Jun 23, 2026
Merged

fix: Parallel unit test runner with file-level subprocess isolation#3285
WMC001 merged 2 commits into
developfrom
fix/parallel-unit-tests

Conversation

@JasonW404

@JasonW404 JasonW404 commented Jun 23, 2026

Copy link
Copy Markdown
Member

Summary

Rewrites the unit test runner to support file-level parallel execution via subprocess isolation, achieving ~7x speedup while maintaining 100% pass rate.

Results (229 test files, ~6,200+ tests)

Mode Wall Clock Pass Rate Workers
Serial 7m7s 229/229 (100%) 1
Parallel 1m0.76s 229/229 (100%) 20 (auto)

Changes

test/run_all_test.py — Parallel runner

  • File-level parallel execution using ThreadPoolExecutor
  • Each test file runs in its own subprocess (preserves isolation)
  • Configurable via environment variables:
    • NEXENT_PYTEST_WORKERS: auto (default, uses CPU count), 0 (serial), or integer
    • NEXENT_PYTEST_FILE_TIMEOUT: per-file timeout in seconds (default: 600)
    • NEXENT_PYTEST_TARGETS: override test directories/files
  • Coverage data per file (.coverage.N), combined via coverage combine

test/backend/services/test_mcp_service.py — Bug fix

  • Clear proxy env vars (socks://) in autouse fixture to prevent httpx.AsyncClient ValueError (8 test failures)

test/backend/services/test_remote_mcp_service.py — Bug fix

  • Mock check_runtime_host_port_available to prevent port conflict in container enable test (1 test failure)

test/sdk/core/models/test_openai_llm.py — Memory fix

  • Reduce memory leak from repeated module imports (~20MB RSS reduction)

.github/workflows/auto-unit-test.yml — CI optimization

  • Default to parallel mode (NEXENT_PYTEST_WORKERS=auto)
  • Per-file timeout: 300s (configurable via workflow_dispatch)
  • Dispatch inputs for worker count and timeout override

Why file-level parallel instead of pytest-xdist?

Direct pytest -n auto across the full test suite fails because many tests use global sys.modules mocking and import patching that conflict during a single pytest collection phase. File-level subprocess isolation preserves the existing per-file isolation guarantee while parallelizing across files.

Note: This PR does NOT add pytest-xdist as a dependency. The parallel runner uses Python's built-in concurrent.futures.ThreadPoolExecutor with per-file subprocess.run() calls — no xdist needed.

- Rewrite test/run_all_test.py as file-level parallel runner using
  ThreadPoolExecutor with configurable workers (NEXENT_PYTEST_WORKERS)
  and per-file timeout (NEXENT_PYTEST_FILE_TIMEOUT)
- Add pytest-xdist to backend test extras
- Fix test_mcp_service.py: clear proxy env vars (socks://) in fixture
  to prevent httpx.AsyncClient ValueError
- Fix test_remote_mcp_service.py: mock check_runtime_host_port_available
  to prevent port conflict in container enable test
- Fix test_openai_llm.py: reduce memory leak from repeated module imports
- Update CI workflow: default to parallel mode, add dispatch inputs for
  worker count and per-file timeout

Serial: 229/229 pass (7m7s). Parallel: 229/229 pass (1m1s, ~7x speedup).
Copilot AI review requested due to automatic review settings June 23, 2026 01:36

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

The parallel runner uses ThreadPoolExecutor with per-file subprocess
isolation, not pytest-xdist. The xdist package was added but never
used due to sys.modules mock conflicts during pytest collection.
@WMC001 WMC001 merged commit b6e59cf into develop Jun 23, 2026
18 checks passed
Comment thread test/run_all_test.py
if xml.returncode != 0:
logger.error("Coverage XML generation failed:\n%s\n%s", xml.stdout, xml.stderr)
return False
logger.info("Coverage XML file generated: %s", coverage_xml_file)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coverage combine 使用 str(current_dir) 作为数据源目录,但每个 worker 的 .coverage.{index} 文件是通过 COVERAGE_FILE 环境变量写入 current_dir 的。coverage combine 默认在源目录中查找 .coverage.* 文件,但 --data-file 参数指定的是合并后的输出文件路径,不是源目录。如果 coverage combine 的 positional argument 被解释为数据文件列表而非目录,可能导致合并失败。建议验证 CI 中此路径是否正确工作。

@YehongPan

Copy link
Copy Markdown
Contributor

🔍 Code Review Comments

1. [逻辑漏洞] Windows 路径分隔符与 coverage combine
_run_test_fileenv["PYTHONPATH"] 使用 path_separator 拼接,但 coverage_dir 下的 .coverage.{index} 文件名在 Windows 上可能因路径分隔符问题导致 coverage combine 失败。

2. [代码规范] 容器环境下 CPU 计数不准确
_worker_countos.cpu_count() 在容器环境中可能返回宿主机的 CPU 数量而非容器限制值,应使用 len(os.sched_getaffinity(0)) 或读取 cgroup 限制。

3. [逻辑漏洞] coverage combine 参数问题
_combine_coveragecoverage combine 传入 str(current_dir) 作为参数,但 coverage combine 期望的是 coverage data 文件路径而非目录路径,可能导致 combine 找不到 .coverage.* 文件。

@YehongPan YehongPan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

  • [逻辑漏洞] _run_test_fileenv["PYTHONPATH"] 使用 path_separator 拼接,但 coverage_dir 下的 .coverage.{index} 文件名在 Windows 上可能因路径分隔符问题导致 coverage combine 失败。
  • [代码规范] _worker_countos.cpu_count() 在容器环境中可能返回宿主机的 CPU 数量而非容器限制值,应使用 len(os.sched_getaffinity(0)) 或读取 cgroup 限制。
  • [逻辑漏洞] _combine_coveragecoverage combine 传入 str(current_dir) 作为参数,但 coverage combine 期望的是 coverage data 文件路径而非目录路径,可能导致 combine 找不到 .coverage.* 文件。

Comment thread test/run_all_test.py
if raw_workers in {"", "0", "false", "none", "off"}:
return 1
if raw_workers == "auto":
return max(1, min(os.cpu_count() or 1, total_files))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.cpu_count() 在容器环境中可能返回宿主机的 CPU 数量而非容器实际分配的 CPU 配额。建议使用 len(os.sched_getaffinity(0)) 获取实际可用 CPU 数(Linux),或读取 cgroup 的 cpu.cfs_quota_us 限制,避免在 CI 容器中过度并行导致 OOM。

Comment thread test/run_all_test.py
try:
result = subprocess.run(
cmd,
capture_output=True,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subprocess.run 使用 capture_output=True,当测试输出非常大时会全部加载到内存中。建议对 stdout/stderr 设置最大缓冲区大小(如 max_output_size),超出部分截断并写入临时文件,防止单个测试文件的巨量输出耗尽内存。

Comment thread test/run_all_test.py
if raw_workers in {"", "0", "false", "none", "off"}:
return 1
if raw_workers == "auto":
return max(1, min(os.cpu_count() or 1, total_files))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] auto 模式直接用 os.cpu_count() 作为并发上限,在 16/32 核 runner 上会同时起大量 pytest+coverage 子进程,内存和文件句柄都容易被打满。建议默认 cap 到较小值,只有显式 env 才放大。

Comment thread test/run_all_test.py
if cov_config.exists():
cmd.append("--cov-config=test/.coveragerc")

env = os.environ.copy()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] 每个测试子进程继承完整 os.environ,代理变量、服务端点或凭据会泄漏进所有测试。这个 PR 只在单个 mcp 测试里清代理,runner 层仍会导致其他 httpx/requests 测试不稳定;建议统一清理/白名单 env。

Comment thread test/run_all_test.py
logger.error("No test files found")
return False

for coverage_artifact in [current_dir / ".coverage", *current_dir.glob(".coverage.*")]:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] 这里删除 test/.coverage* 和 coverage.xml,是 repo 内共享路径。两个 CI job 或本地同时跑这个脚本时会互相删覆盖率文件;建议为每次运行创建独立 coverage 目录或带 run id 的 data_file。

Comment thread test/run_all_test.py
for index, test_file in enumerate(test_files)
]
for future in as_completed(futures):
result = future.result()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] future.result() 没有兜底,_run_test_file 里任何未捕获异常都会中断 as_completed 循环,剩余测试结果和 coverage combine 都不会执行。建议把 executor 异常转成失败结果继续汇总。

Comment thread test/run_all_test.py
if failed:
logger.error("\nFailed test files: %s", len(failed))
for result in failed[:10]:
logger.error("\n%s\n%s\n%s", result["file"], result["stdout"][-4000:], result["stderr"][-2000:])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] 失败时直接把 stdout/stderr 尾部写入 CI 日志,测试输出里如果包含 token、连接串或 presigned URL 会被持久化。建议增加敏感字段脱敏或只输出 pytest short summary。

Comment thread test/run_all_test.py
if raw_workers in {"", "0", "false", "none", "off"}:
return 1
if raw_workers == "auto":
return max(1, min(os.cpu_count() or 1, total_files))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] auto 模式直接用 os.cpu_count() 作为并发上限,在 16/32 核 runner 上会同时起大量 pytest+coverage 子进程,内存和文件句柄都容易被打满。建议默认 cap 到较小值,只有显式 env 才放大。

Comment thread test/run_all_test.py
if cov_config.exists():
cmd.append("--cov-config=test/.coveragerc")

env = os.environ.copy()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] 每个测试子进程继承完整 os.environ,代理变量、服务端点或凭据会泄漏进所有测试。这个 PR 只在单个 mcp 测试里清代理,runner 层仍会导致其他 httpx/requests 测试不稳定;建议统一清理/白名单 env。

Comment thread test/run_all_test.py
logger.error("No test files found")
return False

for coverage_artifact in [current_dir / ".coverage", *current_dir.glob(".coverage.*")]:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] 这里删除 test/.coverage* 和 coverage.xml,是 repo 内共享路径。两个 CI job 或本地同时跑这个脚本时会互相删覆盖率文件;建议为每次运行创建独立 coverage 目录或带 run id 的 data_file。

Comment thread test/run_all_test.py
for index, test_file in enumerate(test_files)
]
for future in as_completed(futures):
result = future.result()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] future.result() 没有兜底,_run_test_file 里任何未捕获异常都会中断 as_completed 循环,剩余测试结果和 coverage combine 都不会执行。建议把 executor 异常转成失败结果继续汇总。

Comment thread test/run_all_test.py
if failed:
logger.error("\nFailed test files: %s", len(failed))
for result in failed[:10]:
logger.error("\n%s\n%s\n%s", result["file"], result["stdout"][-4000:], result["stderr"][-2000:])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] 失败时直接把 stdout/stderr 尾部写入 CI 日志,测试输出里如果包含 token、连接串或 presigned URL 会被持久化。建议增加敏感字段脱敏或只输出 pytest short summary。

Dallas98 pushed a commit that referenced this pull request Jun 29, 2026
* Revert "fix: Parallel unit test runner with file-level subprocess isolation (#3285)"

This reverts commit b6e59cf.

* fix: restore parallel UT runner with coverage reporting

* fix: pin codecov action for sonar

* ci: add repo controlled codecov gates
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants