Skip to content

Commit f90838d

Browse files
authored
[CI] Fix CI subprocess test hangs (#557)
## Summary Subprocess-spawning tests hang indefinitely on CI. ## Causes & Fixes ### Problems From Lab: 1. Lab reports "AppLauncher doesnt quit properly after app.close(), app.quit() doesn't help either." 2. Cold startup times for tests using IS can be upwards of 10 min on Lab CI machines. Above issues apply to us, because tests hang during sub-process tests section, between the end of last test and the beginning of the next test. See detailed logs and analysis from reproducing locally [here](#568) ### Fixes 1. `SimulationApp` Force Exit: Skips `app.close()` (which can hang indefinitely in Kit's shutdown path) when the env var `ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE=1` is set. Calls a new `_kill_child_processes()` helper that walks `/proc` to `SIGKILL` all direct children before doing `os._exit(0)`, preventing orphaned Kit processes from holding GPU resources. 2. `run_subprocess` has a configuarable wall-clock timeouts and process isolation, such that when needed, it could trigger the force exit path above. 3. Add wall-clock timing and logging inside the SimulationApp start method. Keep track of how much startup time is taking on CI. ## Minor fixes 1. Add timing stats into pytest cmds such that it reports the slowests test func at the end of each section. 2. Parametrize multi-config tests: Convert nested for-loops in `test_zero_action_policy_kitchen_pick_and_place` (6 configs) and `test_zero_action_policy_gr1_open_microwave` (3 configs) into `@pytest.mark.parametrize.` Each config gets its own timeout, pass/fail, and timing. 3. Reduce num_envs in gr00t eval_runner test to speed up. ### Local validation With the repro script #568, I do not have local stalling. Log for more details. [repro_20260410_041313.log](https://github.com/user-attachments/files/26620524/repro_20260410_041313.log) ### CI Before -- timeout <img width="1219" height="170" alt="image" src="https://github.com/user-attachments/assets/2f9eabb2-403d-4257-bd84-4da508de7d00" /> ### CI After <img width="1219" height="170" alt="image" src="https://github.com/user-attachments/assets/dbaf2a7d-e3a4-4ad2-85a4-389eae962c1d" /> <img width="1198" height="472" alt="image" src="https://github.com/user-attachments/assets/8a24f1aa-4bcb-4030-b075-09f3885673c2" /> ## TODOs - test_camera_observations takes 10mins to start the app due to Kit cold start. Experimenting with a warm start before tests process here #565 - Kit itself intermittently deadlocks during startup — not because of orphans, but because Kit's internal thread synchronization fails on low-CPU runners. Experimenting with retry here #570
1 parent e3f1283 commit f90838d

5 files changed

Lines changed: 136 additions & 51 deletions

File tree

.github/workflows/ci.yml

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -142,22 +142,23 @@ jobs:
142142
# To restore sanity here is a goal for Isaac Lab Arena v0.3.
143143
- name: Run in-process Newton tests
144144
run: |
145-
/isaac-sim/python.sh -m pytest -sv -m with_newton \
145+
/isaac-sim/python.sh -m pytest -sv --durations=0 -m with_newton \
146146
isaaclab_arena/tests/
147147
148148
- name: Run in-process PhysX tests without cameras
149149
run: |
150-
/isaac-sim/python.sh -m pytest -sv -m "not with_cameras and not with_subprocess and not with_newton" \
150+
/isaac-sim/python.sh -m pytest -sv --durations=0 -m "not with_cameras and not with_subprocess and not with_newton" \
151151
isaaclab_arena/tests/
152152
153153
- name: Run in-process PhysX tests with cameras
154154
run: |
155-
/isaac-sim/python.sh -m pytest -sv -m "with_cameras and not with_subprocess and not with_newton" \
155+
/isaac-sim/python.sh -m pytest -sv --durations=0 -m "with_cameras and not with_subprocess and not with_newton" \
156156
isaaclab_arena/tests/
157157
158-
- name: Run subprocess-spawning PhysX tests
158+
- name: Run subprocess-spawning PhysX tests
159159
run: |
160-
/isaac-sim/python.sh -m pytest -sv -m with_subprocess \
160+
ISAACLAB_ARENA_SUBPROCESS_TIMEOUT=900 \
161+
/isaac-sim/python.sh -m pytest -sv --durations=0 -m with_subprocess \
161162
isaaclab_arena/tests/
162163
163164
@@ -188,7 +189,9 @@ jobs:
188189

189190
# Run the policy (GR00T) related tests.
190191
- name: Run policy-related pytest
191-
run: /isaac-sim/python.sh -m pytest -sv isaaclab_arena_gr00t/tests/
192+
run: |
193+
ISAACLAB_ARENA_SUBPROCESS_TIMEOUT=900 \
194+
/isaac-sim/python.sh -m pytest -sv --durations=0 isaaclab_arena_gr00t/tests/
192195
193196
194197
build_docs_pre_merge:

isaaclab_arena/tests/test_policy_runner.py

Lines changed: 19 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -65,21 +65,18 @@ def test_zero_action_policy_press_button():
6565

6666

6767
@pytest.mark.with_subprocess
68-
def test_zero_action_policy_kitchen_pick_and_place():
68+
@pytest.mark.parametrize("embodiment", ["franka_ik", "gr1_pink", "gr1_joint"])
69+
@pytest.mark.parametrize("object_name", ["cracker_box", "tomato_soup_can"])
70+
def test_zero_action_policy_kitchen_pick_and_place(embodiment, object_name):
6971
# TODO(alexmillane, 2025.07.29): Get an exhaustive list of all scenes and embodiments
7072
# from a registry when we have one.
71-
example_environment = "kitchen_pick_and_place"
72-
embodiments = ["franka_ik", "gr1_pink", "gr1_joint"]
73-
object_names = ["cracker_box", "tomato_soup_can"]
74-
for embodiment in embodiments:
75-
for object_name in object_names:
76-
run_policy_runner(
77-
policy_type="zero_action",
78-
example_environment=example_environment,
79-
embodiment=embodiment,
80-
object_name=object_name,
81-
num_steps=NUM_STEPS,
82-
)
73+
run_policy_runner(
74+
policy_type="zero_action",
75+
example_environment="kitchen_pick_and_place",
76+
embodiment=embodiment,
77+
object_name=object_name,
78+
num_steps=NUM_STEPS,
79+
)
8380

8481

8582
@pytest.mark.with_subprocess
@@ -98,20 +95,17 @@ def test_zero_action_policy_galileo_pick_and_place():
9895

9996

10097
@pytest.mark.with_subprocess
101-
def test_zero_action_policy_gr1_open_microwave():
98+
@pytest.mark.parametrize("object_name", ["cracker_box", "tomato_soup_can", "mustard_bottle"])
99+
def test_zero_action_policy_gr1_open_microwave(object_name):
102100
# TODO(alexmillane, 2025.07.29): Get an exhaustive list of all scenes and embodiments
103101
# from a registry when we have one.
104-
example_environment = "gr1_open_microwave"
105-
object_name = ["cracker_box", "tomato_soup_can", "mustard_bottle"]
106-
for object_name in object_name:
107-
run_policy_runner(
108-
policy_type="zero_action",
109-
example_environment=example_environment,
110-
embodiment="gr1_pink",
111-
background=None,
112-
object_name=object_name,
113-
num_steps=NUM_STEPS,
114-
)
102+
run_policy_runner(
103+
policy_type="zero_action",
104+
example_environment="gr1_open_microwave",
105+
embodiment="gr1_pink",
106+
object_name=object_name,
107+
num_steps=NUM_STEPS,
108+
)
115109

116110

117111
@pytest.mark.with_subprocess

isaaclab_arena/tests/utils/subprocess.py

Lines changed: 58 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -31,26 +31,70 @@
3131
_AT_LEAST_ONE_TEST_FAILED = False
3232

3333

34-
def run_subprocess(cmd, env=None):
35-
print(f"Running command: {cmd}")
34+
_SUBPROCESS_TIMEOUT_SEC = int(os.environ.get("ISAACLAB_ARENA_SUBPROCESS_TIMEOUT", "600"))
35+
36+
37+
def run_subprocess(
38+
cmd,
39+
env=None,
40+
timeout_sec: int | None = None,
41+
capture_output: bool = False,
42+
) -> subprocess.CompletedProcess | None:
43+
"""Run a command in a subprocess with timeout.
44+
45+
The child is launched with ``start_new_session=True`` so it lives in its
46+
own process group. The child-side ``SimulationAppContext`` uses this to
47+
SIGTERM its entire group before ``os._exit()``, preventing orphaned Kit
48+
children (shader compiler, GPU workers, …) from holding GPU resources and
49+
blocking the next subprocess.
50+
51+
Args:
52+
cmd: Command to run (list of strings).
53+
env: Optional environment dict. Defaults to inheriting the parent env.
54+
timeout_sec: Per-subprocess wall-clock timeout in seconds.
55+
Defaults to ``_SUBPROCESS_TIMEOUT_SEC`` (env ``ISAACLAB_ARENA_SUBPROCESS_TIMEOUT``, fallback 600).
56+
capture_output: If True, capture stdout/stderr and return a
57+
``CompletedProcess``. When False (default) output streams to
58+
the parent process and the function returns None on success.
59+
60+
Returns:
61+
``CompletedProcess`` when *capture_output* is True, else None.
62+
"""
63+
if timeout_sec is None:
64+
timeout_sec = _SUBPROCESS_TIMEOUT_SEC
65+
66+
print(f"Running command (timeout={timeout_sec}s): {cmd}")
3667
global _AT_LEAST_ONE_TEST_FAILED
68+
69+
if env is None:
70+
env = os.environ.copy()
71+
env["ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE"] = "1"
72+
3773
try:
3874
result = subprocess.run(
3975
cmd,
40-
check=True,
4176
env=env,
42-
# Don't capture output, let it flow through in real-time
43-
capture_output=False,
44-
text=True,
45-
# Explicitly set stdout and stderr to None to use parent process's pipes
46-
stdout=None,
47-
stderr=None,
77+
timeout=timeout_sec,
78+
capture_output=capture_output,
79+
text=capture_output,
80+
start_new_session=True,
4881
)
49-
print(f"Command completed with return code: {result.returncode}")
50-
except subprocess.CalledProcessError as e:
51-
sys.stderr.write(f"Command failed with return code {e.returncode}: {e}\n")
82+
except subprocess.TimeoutExpired:
83+
sys.stderr.write(f"\n[isaaclab-arena] Subprocess timed out after {timeout_sec}s\n")
5284
_AT_LEAST_ONE_TEST_FAILED = True
53-
raise e
85+
raise subprocess.SubprocessError(f"Subprocess timed out after {timeout_sec}s: {cmd}")
86+
87+
print(f"Command completed with return code: {result.returncode}")
88+
if result.returncode != 0:
89+
sys.stderr.write(f"Command failed with return code {result.returncode}\n")
90+
if capture_output and result.stderr:
91+
sys.stderr.write(result.stderr)
92+
_AT_LEAST_ONE_TEST_FAILED = True
93+
raise subprocess.CalledProcessError(result.returncode, cmd, result.stdout, result.stderr)
94+
95+
if capture_output:
96+
return result
97+
return None
5498

5599

56100
class _IsolatedArgv:
@@ -108,7 +152,7 @@ def get_persistent_simulation_app(headless: bool, enable_cameras: bool = False)
108152
first_headless, first_enable_cameras = _PERSISTENT_INIT_ARGS
109153
if (headless != first_headless) or (enable_cameras != first_enable_cameras):
110154
print(
111-
"[isaac-arena] Warning: persistent SimulationApp already initialized with "
155+
"[isaaclab-arena] Warning: persistent SimulationApp already initialized with "
112156
f"headless={first_headless}, enable_cameras={first_enable_cameras}. "
113157
"Ignoring new values."
114158
)

isaaclab_arena/utils/isaaclab_utils/simulation_app.py

Lines changed: 49 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,18 @@ def get_isaac_sim_version() -> str:
1818
return omni.kit.app.get_app().get_app_version()
1919

2020

21+
STARTUP_COMPLETE_MARKER = "[isaaclab-arena] AppLauncher initialization complete"
22+
23+
2124
def get_app_launcher(args: argparse.Namespace) -> AppLauncher:
2225
"""Get an app launcher."""
26+
import time
27+
28+
t0 = time.monotonic()
2329
app_launcher = AppLauncher(args)
30+
elapsed = time.monotonic() - t0
31+
sys.__stderr__.write(f"{STARTUP_COMPLETE_MARKER} ({elapsed:.1f}s)\n")
32+
sys.__stderr__.flush()
2433
return app_launcher
2534

2635

@@ -87,6 +96,26 @@ def reapply_viewer_cfg(env) -> None:
8796
vcc.update_view_location()
8897

8998

99+
def _kill_child_processes() -> None:
100+
"""SIGKILL all direct child processes of the current process via /proc."""
101+
import signal
102+
103+
my_pid = os.getpid()
104+
with suppress(FileNotFoundError, PermissionError):
105+
for entry in os.scandir("/proc"):
106+
if not entry.name.isdigit():
107+
continue
108+
try:
109+
with open(f"/proc/{entry.name}/status") as f:
110+
for line in f:
111+
if line.startswith("PPid:"):
112+
if int(line.split()[1]) == my_pid:
113+
os.kill(int(entry.name), signal.SIGKILL)
114+
break
115+
except (FileNotFoundError, PermissionError, ProcessLookupError, ValueError):
116+
continue
117+
118+
90119
class SimulationAppContext:
91120
"""Context manager for launching and closing a simulation app."""
92121

@@ -110,15 +139,30 @@ def __enter__(self):
110139

111140
def __exit__(self, exc_type, exc_val, exc_tb):
112141
print("Closing simulation app")
113-
# app_launcher.close() will terminate the whole process with exit code 0, i.e. preventing errors from being seen by the caller. There are seemingly no ways around this.
114-
# As a workaround, we call os._exit(1) that terminates immediately. The downside is that any cleanup would be omitted
115-
if exc_type is None:
116-
self.app_launcher.app.close()
117-
else:
142+
if exc_type is not None:
118143
print(f"Exception caught in SimulationAppContext: {exc_type.__name__}: {exc_val}")
119144
print("Traceback:")
120145
traceback.print_exception(exc_type, exc_val, exc_tb)
121146
print("Killing the process without cleaning up")
122147
sys.stdout.flush()
123148
sys.stderr.flush()
124149
os._exit(1)
150+
151+
# When launched as a test subprocess, skip app.close() which can hang
152+
# indefinitely in Kit's shutdown path.
153+
if os.environ.get("ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE") == "1":
154+
print("Force-exiting subprocess (ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE=1)")
155+
sys.stdout.flush()
156+
sys.stderr.flush()
157+
# SIGKILL orphaned Kit children (shader compiler, GPU workers, …)
158+
# so they don't hold GPU resources and block the next test subprocess.
159+
# We target each child individually via /proc to avoid signalling
160+
# ourselves (Kit installs a C-level SIGTERM handler that overrides
161+
# Python's SIG_IGN, so os.killpg is not safe here).
162+
_kill_child_processes()
163+
os._exit(0)
164+
165+
# Normal interactive / non-test path: attempt a clean Kit shutdown.
166+
# app.close() may terminate the process with exit code 0 regardless of
167+
# errors — see the error branch above for the workaround.
168+
self.app_launcher.app.close()

isaaclab_arena_gr00t/tests/test_gr00t_closedloop_policy.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -236,7 +236,7 @@ def test_g1_locomanip_gr00t_closedloop_policy_runner_eval_runner(gr00t_finetuned
236236
"name": "gr1_open_microwave_cracker_box",
237237
"arena_env_args": {
238238
"environment": "gr1_open_microwave",
239-
"num_envs": 10,
239+
"num_envs": 3,
240240
"object": "cracker_box",
241241
"embodiment": "gr1_joint",
242242
},

0 commit comments

Comments
 (0)