fix: Parallel unit test runner with file-level subprocess isolation#3285
Conversation
- Rewrite test/run_all_test.py as file-level parallel runner using ThreadPoolExecutor with configurable workers (NEXENT_PYTEST_WORKERS) and per-file timeout (NEXENT_PYTEST_FILE_TIMEOUT) - Add pytest-xdist to backend test extras - Fix test_mcp_service.py: clear proxy env vars (socks://) in fixture to prevent httpx.AsyncClient ValueError - Fix test_remote_mcp_service.py: mock check_runtime_host_port_available to prevent port conflict in container enable test - Fix test_openai_llm.py: reduce memory leak from repeated module imports - Update CI workflow: default to parallel mode, add dispatch inputs for worker count and per-file timeout Serial: 229/229 pass (7m7s). Parallel: 229/229 pass (1m1s, ~7x speedup).
The parallel runner uses ThreadPoolExecutor with per-file subprocess isolation, not pytest-xdist. The xdist package was added but never used due to sys.modules mock conflicts during pytest collection.
| if xml.returncode != 0: | ||
| logger.error("Coverage XML generation failed:\n%s\n%s", xml.stdout, xml.stderr) | ||
| return False | ||
| logger.info("Coverage XML file generated: %s", coverage_xml_file) |
There was a problem hiding this comment.
coverage combine 使用 str(current_dir) 作为数据源目录,但每个 worker 的 .coverage.{index} 文件是通过 COVERAGE_FILE 环境变量写入 current_dir 的。coverage combine 默认在源目录中查找 .coverage.* 文件,但 --data-file 参数指定的是合并后的输出文件路径,不是源目录。如果 coverage combine 的 positional argument 被解释为数据文件列表而非目录,可能导致合并失败。建议验证 CI 中此路径是否正确工作。
🔍 Code Review Comments1. [逻辑漏洞] Windows 路径分隔符与 coverage combine 2. [代码规范] 容器环境下 CPU 计数不准确 3. [逻辑漏洞] coverage combine 参数问题 |
YehongPan
left a comment
There was a problem hiding this comment.
Code Review
- [逻辑漏洞]
_run_test_file中env["PYTHONPATH"]使用path_separator拼接,但coverage_dir下的.coverage.{index}文件名在 Windows 上可能因路径分隔符问题导致 coverage combine 失败。 - [代码规范]
_worker_count中os.cpu_count()在容器环境中可能返回宿主机的 CPU 数量而非容器限制值,应使用len(os.sched_getaffinity(0))或读取 cgroup 限制。 - [逻辑漏洞]
_combine_coverage中coverage combine传入str(current_dir)作为参数,但 coverage combine 期望的是 coverage data 文件路径而非目录路径,可能导致 combine 找不到.coverage.*文件。
| if raw_workers in {"", "0", "false", "none", "off"}: | ||
| return 1 | ||
| if raw_workers == "auto": | ||
| return max(1, min(os.cpu_count() or 1, total_files)) |
There was a problem hiding this comment.
os.cpu_count() 在容器环境中可能返回宿主机的 CPU 数量而非容器实际分配的 CPU 配额。建议使用 len(os.sched_getaffinity(0)) 获取实际可用 CPU 数(Linux),或读取 cgroup 的 cpu.cfs_quota_us 限制,避免在 CI 容器中过度并行导致 OOM。
| try: | ||
| result = subprocess.run( | ||
| cmd, | ||
| capture_output=True, |
There was a problem hiding this comment.
subprocess.run 使用 capture_output=True,当测试输出非常大时会全部加载到内存中。建议对 stdout/stderr 设置最大缓冲区大小(如 max_output_size),超出部分截断并写入临时文件,防止单个测试文件的巨量输出耗尽内存。
| if raw_workers in {"", "0", "false", "none", "off"}: | ||
| return 1 | ||
| if raw_workers == "auto": | ||
| return max(1, min(os.cpu_count() or 1, total_files)) |
There was a problem hiding this comment.
[P2] auto 模式直接用 os.cpu_count() 作为并发上限,在 16/32 核 runner 上会同时起大量 pytest+coverage 子进程,内存和文件句柄都容易被打满。建议默认 cap 到较小值,只有显式 env 才放大。
| if cov_config.exists(): | ||
| cmd.append("--cov-config=test/.coveragerc") | ||
|
|
||
| env = os.environ.copy() |
There was a problem hiding this comment.
[P2] 每个测试子进程继承完整 os.environ,代理变量、服务端点或凭据会泄漏进所有测试。这个 PR 只在单个 mcp 测试里清代理,runner 层仍会导致其他 httpx/requests 测试不稳定;建议统一清理/白名单 env。
| logger.error("No test files found") | ||
| return False | ||
|
|
||
| for coverage_artifact in [current_dir / ".coverage", *current_dir.glob(".coverage.*")]: |
There was a problem hiding this comment.
[P2] 这里删除 test/.coverage* 和 coverage.xml,是 repo 内共享路径。两个 CI job 或本地同时跑这个脚本时会互相删覆盖率文件;建议为每次运行创建独立 coverage 目录或带 run id 的 data_file。
| for index, test_file in enumerate(test_files) | ||
| ] | ||
| for future in as_completed(futures): | ||
| result = future.result() |
There was a problem hiding this comment.
[P2] future.result() 没有兜底,_run_test_file 里任何未捕获异常都会中断 as_completed 循环,剩余测试结果和 coverage combine 都不会执行。建议把 executor 异常转成失败结果继续汇总。
| if failed: | ||
| logger.error("\nFailed test files: %s", len(failed)) | ||
| for result in failed[:10]: | ||
| logger.error("\n%s\n%s\n%s", result["file"], result["stdout"][-4000:], result["stderr"][-2000:]) |
There was a problem hiding this comment.
[P2] 失败时直接把 stdout/stderr 尾部写入 CI 日志,测试输出里如果包含 token、连接串或 presigned URL 会被持久化。建议增加敏感字段脱敏或只输出 pytest short summary。
| if raw_workers in {"", "0", "false", "none", "off"}: | ||
| return 1 | ||
| if raw_workers == "auto": | ||
| return max(1, min(os.cpu_count() or 1, total_files)) |
There was a problem hiding this comment.
[P2] auto 模式直接用 os.cpu_count() 作为并发上限,在 16/32 核 runner 上会同时起大量 pytest+coverage 子进程,内存和文件句柄都容易被打满。建议默认 cap 到较小值,只有显式 env 才放大。
| if cov_config.exists(): | ||
| cmd.append("--cov-config=test/.coveragerc") | ||
|
|
||
| env = os.environ.copy() |
There was a problem hiding this comment.
[P2] 每个测试子进程继承完整 os.environ,代理变量、服务端点或凭据会泄漏进所有测试。这个 PR 只在单个 mcp 测试里清代理,runner 层仍会导致其他 httpx/requests 测试不稳定;建议统一清理/白名单 env。
| logger.error("No test files found") | ||
| return False | ||
|
|
||
| for coverage_artifact in [current_dir / ".coverage", *current_dir.glob(".coverage.*")]: |
There was a problem hiding this comment.
[P2] 这里删除 test/.coverage* 和 coverage.xml,是 repo 内共享路径。两个 CI job 或本地同时跑这个脚本时会互相删覆盖率文件;建议为每次运行创建独立 coverage 目录或带 run id 的 data_file。
| for index, test_file in enumerate(test_files) | ||
| ] | ||
| for future in as_completed(futures): | ||
| result = future.result() |
There was a problem hiding this comment.
[P2] future.result() 没有兜底,_run_test_file 里任何未捕获异常都会中断 as_completed 循环,剩余测试结果和 coverage combine 都不会执行。建议把 executor 异常转成失败结果继续汇总。
| if failed: | ||
| logger.error("\nFailed test files: %s", len(failed)) | ||
| for result in failed[:10]: | ||
| logger.error("\n%s\n%s\n%s", result["file"], result["stdout"][-4000:], result["stderr"][-2000:]) |
There was a problem hiding this comment.
[P2] 失败时直接把 stdout/stderr 尾部写入 CI 日志,测试输出里如果包含 token、连接串或 presigned URL 会被持久化。建议增加敏感字段脱敏或只输出 pytest short summary。
Summary
Rewrites the unit test runner to support file-level parallel execution via subprocess isolation, achieving ~7x speedup while maintaining 100% pass rate.
Results (229 test files, ~6,200+ tests)
Changes
test/run_all_test.py— Parallel runnerThreadPoolExecutorNEXENT_PYTEST_WORKERS:auto(default, uses CPU count),0(serial), or integerNEXENT_PYTEST_FILE_TIMEOUT: per-file timeout in seconds (default: 600)NEXENT_PYTEST_TARGETS: override test directories/files.coverage.N), combined viacoverage combinetest/backend/services/test_mcp_service.py— Bug fixsocks://) in autouse fixture to preventhttpx.AsyncClientValueError (8 test failures)test/backend/services/test_remote_mcp_service.py— Bug fixcheck_runtime_host_port_availableto prevent port conflict in container enable test (1 test failure)test/sdk/core/models/test_openai_llm.py— Memory fix.github/workflows/auto-unit-test.yml— CI optimizationNEXENT_PYTEST_WORKERS=auto)Why file-level parallel instead of pytest-xdist?
Direct
pytest -n autoacross the full test suite fails because many tests use globalsys.modulesmocking and import patching that conflict during a single pytest collection phase. File-level subprocess isolation preserves the existing per-file isolation guarantee while parallelizing across files.Note: This PR does NOT add
pytest-xdistas a dependency. The parallel runner uses Python's built-inconcurrent.futures.ThreadPoolExecutorwith per-filesubprocess.run()calls — no xdist needed.