Skip to content

Commit 9deb14c

Browse files
committed
[None][test] CBTS sitecustomize: fix nested-pytest hang + argv detection
Two fixes in jenkins/scripts/cbts/coverage_utils/sitecustomize.py. 1. Use sys.orig_argv instead of sys.argv for pytest detection. When invoked as `python -m pytest`, sys.argv at sitecustomize-run time is ['-m', ...] -- the pytest path is injected later by the runpy machinery. The previous check `any("pytest" in a for a in sys.argv[:2])` therefore returned False for every `python -m pytest` process, including pytest main itself, causing daemon threads to spawn where they shouldn't. 2. Detect the nested-pytest layer used by tests/integration/defs/test_unittests.py::test_unittests_v2 (an outer pytest test that subprocess-spawns `python -m pytest <unittest_file>`), and in that layer skip the periodic-save / marker-poll / mpi-watcher daemon threads while still synchronously applying install_mpi_pool_patch. The 100ms-loop daemons in the middle layer were contending with the zmq IPC threads talking to MPI workers, pushing LLM() fixture setup past the 120s pytest-timeout on A30-PyTorch-1 (128 timeouts in unittest/_torch/sampler/test_logits_logprobs.py). MPI workers still get the widened env via the synchronous patch, so worker coverage (the bulk of py_executor.py data) is preserved. Validated locally on H100 with a minimal repro: nested pytest spawns LLM(TinyLlama), inner pytest has zero cbts-* daemon threads, mpi patch is applied, MPI worker starts full coverage, LLM init + generate complete normally. Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
1 parent 697959a commit 9deb14c

1 file changed

Lines changed: 69 additions & 17 deletions

File tree

jenkins/scripts/cbts/coverage_utils/sitecustomize.py

Lines changed: 69 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -21,16 +21,15 @@
2121
* Starts a coverage.Coverage instance with the given config in every
2222
Python process under our PYTHONPATH (pytest main + MPI workers +
2323
trtllm-serve subprocesses).
24-
* In subprocesses (pytest is NOT on argv[:2]), runs a daemon thread
25-
that calls cov.save() every 5s so processes killed via SIGKILL
26-
(e.g. trtllm-serve teardown in disaggregated tests) still leave
27-
most of their data on disk. Pytest main exits cleanly via atexit,
28-
so it doesn't need periodic save.
29-
* In the same subprocesses, runs a second daemon thread that polls a
30-
marker file and calls cov.switch_context(<test_nodeid>) whenever
31-
the marker changes; the marker file is written by cbts_plugin in
32-
the pytest main process. This is how worker code gets attributed
33-
to the test that triggered it.
24+
* In non-pytest subprocesses (MPI workers, trtllm-bench, trtllm-serve)
25+
runs a daemon thread that calls cov.save() every 5s so processes
26+
killed via SIGKILL still leave most of their data on disk. Pytest
27+
layers (main or nested) skip this — they exit cleanly via atexit.
28+
* In the same non-pytest subprocesses, runs a second daemon thread
29+
that polls a marker file and calls cov.switch_context(<test_nodeid>)
30+
whenever the marker changes; the marker file is written by
31+
cbts_plugin in the pytest main process. This is how worker code
32+
gets attributed to the test that triggered it.
3433
* Registers a single atexit handler that signals the periodic thread
3534
to stop, then stops coverage and saves once. This avoids the
3635
periodic thread racing with atexit's own cov.save().
@@ -41,6 +40,27 @@
4140
import os
4241
import sys
4342

43+
44+
def _parent_is_pytest():
45+
"""Return True if our parent process is also running pytest.
46+
47+
Detects the nested-pytest pattern used by
48+
``tests/integration/defs/test_unittests.py::test_unittests_v2``: the
49+
outer pytest spawns ``python -m pytest <unittest_file>`` as a
50+
subprocess for each unittest module. In that nested case we keep
51+
coverage running but skip the daemon threads in this layer (they
52+
contend with the zmq IPC threads that talk to MPI workers spawned
53+
by LLM() fixtures, and the inner pytest exits cleanly via atexit so
54+
periodic save is unnecessary).
55+
"""
56+
try:
57+
with open(f"/proc/{os.getppid()}/cmdline", "rb") as f:
58+
parent_cmdline = f.read().split(b"\x00")
59+
except OSError:
60+
return False
61+
return any(b"pytest" in part for part in parent_cmdline)
62+
63+
4464
if os.getenv("CBTS_COVERAGE_CONFIG"):
4565
import atexit
4666

@@ -70,7 +90,22 @@
7090
)
7191
cov.start()
7292

73-
_is_pytest_main = any("pytest" in a for a in sys.argv[:2])
93+
# Use sys.orig_argv (Python 3.10+) instead of sys.argv: at the moment
94+
# sitecustomize runs, Python has not yet rewritten argv[0] from "-m"
95+
# to the pytest __main__.py path for `python -m pytest` invocations,
96+
# so sys.argv looks like ['-m', '--co', ...] -- pytest is not in
97+
# there yet. sys.orig_argv preserves the full launching cmdline,
98+
# e.g. ['python', '-m', 'pytest', '--co', ...].
99+
_orig_argv = getattr(sys, "orig_argv", sys.argv)
100+
_is_pytest_main = any("pytest" in a for a in _orig_argv[:4])
101+
_is_nested_pytest = _parent_is_pytest() and _is_pytest_main
102+
# Skip daemon threads in pytest main AND in nested-pytest layers
103+
# (see _parent_is_pytest docstring). Both exit cleanly via atexit
104+
# so periodic save is unnecessary, and in the nested case the
105+
# 100ms-loop daemon threads contend for the GIL with the zmq IPC
106+
# threads talking to MPI workers, which on slower hosts (A30) can
107+
# push LLM() fixture setup past the per-test pytest-timeout.
108+
_skip_daemons = _is_pytest_main or _is_nested_pytest
74109

75110
# Periodic save protects against SIGKILL'd subprocesses (e.g.
76111
# trtllm-serve cleanup in disaggregated tests). Pytest main exits
@@ -112,7 +147,7 @@ def _final_save():
112147

113148
atexit.register(_final_save)
114149

115-
if not _is_pytest_main:
150+
if not _skip_daemons:
116151
threading.Thread(
117152
target=_periodic_save,
118153
daemon=True,
@@ -139,12 +174,29 @@ def _final_save():
139174
# actually run py_executor.py -- which is why those workers ended
140175
# up with no coverage even though sitecustomize ran here. Without
141176
# this, the 21 SUBPROCESS_LOST tests stay SUBPROCESS_LOST.
142-
if not _is_pytest_main:
143-
import time as _time
177+
_initial_nodeid = os.environ.get("CBTS_TEST_ID", "").strip()
178+
if _initial_nodeid and not _is_pytest_main:
179+
cov.switch_context(_initial_nodeid)
180+
181+
if _is_nested_pytest:
182+
# Inner pytest: apply the mpi_session env-whitelist patch
183+
# synchronously so MPI workers spawned by LLM() inside the
184+
# inner test fixtures still get the widened env. The watcher
185+
# thread used in non-pytest subprocesses would contend with
186+
# zmq IPC threads here, so we just do it inline.
187+
try:
188+
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
189+
from cbts_plugin import install_mpi_pool_patch
144190

145-
_initial_nodeid = os.environ.get("CBTS_TEST_ID", "").strip()
146-
if _initial_nodeid:
147-
cov.switch_context(_initial_nodeid)
191+
install_mpi_pool_patch(raise_on_refactor=False)
192+
except Exception as _exc:
193+
print(
194+
f"[cbts] nested-pytest mpi patch skipped in pid {os.getpid()}: {_exc!r}",
195+
file=sys.stderr,
196+
)
197+
198+
if not _skip_daemons:
199+
import time as _time
148200

149201
def _watch_mpi_session():
150202
try:

0 commit comments

Comments
 (0)