Skip to content

Commit f71c9bd

Browse files
Move ovphysx lifecycle workaround into OvPhysxManager
The dual-Carbonite destructor race that crashed test_rigid_object.py between tests is not test-specific: any production caller that closes a SimulationContext and constructs a new one in the same process hits the same path. The previous PR worked around it with session-scoped class-level monkey patches in the test file (_PHYSX_BY_DEVICE cache, _patched_release_physx, _patched_warmup_and_load); production code remained latently buggy. Move the workaround into the manager itself: * _release_physx is now a soft reset -- physx.reset() + wait_op while keeping the cached ovphysx.PhysX reference alive on cls._physx, so the C++ destructor never fires mid-process. * _warmup_and_load reuses the cached instance on subsequent calls. First call constructs + locks the device + registers atexit; later calls re-export the USD, run physx.reset() to clear the prior stage, add_usd, replay clones, and (on GPU) re-run warmup_gpu so the new stage's bodies are resident. * New _locked_device class attribute mirrors the wheel's process-global device-mode lock; the manager raises RuntimeError with a clear message instead of letting the wheel's PhysXDeviceError fire. * _construct_physx splits out the first-time bootstrap so the orchestration in _warmup_and_load stays linear. * HACK comments on _release_physx are keyed to the same wheel-fix milestone (namespace-isolated Carbonite) tracked by gap G5. The dict keyed by device (_PHYSX_BY_DEVICE) is gone -- since the wheel locks the process to one device, the dict could never hold more than one entry and was misleading. In test_rigid_object.py, drop the entire patch block (~150 lines): _PHYSX_BY_DEVICE, _patched_release_physx, _patched_warmup_and_load, _orig_*, _ovphysx_session_patches. Keep _LOCKED_DEVICE plus _ovphysx_skip_other_device autouse fixture -- the only remaining test-side concern is pre-empting the device-lock RuntimeError with a clean pytest.skip on parametrized device mismatch, so two-pass CI finishes cleanly when only one device is exercised. Update test_warmup_attach_stage_not_called_for_cpu to spy on a real PhysX after construction (the old approach silently relied on the patch overwriting cls._physx). Tests: 42 CPU + 41 GPU passing (two pytest invocations per Marco's two-pass requirement). Bumps isaaclab_ovphysx 0.2.16 -> 0.2.17.
1 parent b340a55 commit f71c9bd

4 files changed

Lines changed: 177 additions & 208 deletions

File tree

source/isaaclab_ovphysx/config/extension.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
[package]
22

33
# Note: Semantic Versioning is used: https://semver.org/
4-
version = "0.2.16"
4+
version = "0.2.17"
55

66
# Description
77
title = "OvPhysX simulation interfaces for IsaacLab core package"

source/isaaclab_ovphysx/docs/CHANGELOG.rst

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,32 @@
11
Changelog
22
---------
33

4+
0.2.17 (2026-05-05)
5+
~~~~~~~~~~~~~~~~~~~~
6+
7+
Changed
8+
^^^^^^^
9+
10+
* Made :meth:`~isaaclab_ovphysx.physics.OvPhysxManager._release_physx` a
11+
soft reset that calls ``physx.reset()`` and keeps the cached
12+
:class:`ovphysx.PhysX` reference alive, instead of dropping it to ``None``
13+
(which triggered a dual-Carbonite destructor race on refcount drop).
14+
:meth:`~isaaclab_ovphysx.physics.OvPhysxManager._warmup_and_load` now
15+
reuses the cached instance on subsequent calls, re-running ``add_usd``,
16+
pending clones, and (on GPU) ``warmup_gpu`` per stage swap. This makes
17+
back-to-back :class:`SimulationContext` lifetimes work natively without
18+
the test-side monkey patches the previous iteration of the rigid-object
19+
tests required.
20+
21+
Added
22+
^^^^^
23+
24+
* Added :attr:`~isaaclab_ovphysx.physics.OvPhysxManager._locked_device` so
25+
the manager raises a clear :exc:`RuntimeError` when a later
26+
:class:`SimulationContext` requests a different device, surfacing the
27+
wheel's process-global device-mode lock as a Python error before
28+
:exc:`ovphysx.types.PhysXDeviceError` would fire.
29+
430
0.2.16 (2026-04-30)
531
~~~~~~~~~~~~~~~~~~~~
632

source/isaaclab_ovphysx/isaaclab_ovphysx/physics/ovphysx_manager.py

Lines changed: 119 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,12 @@ class OvPhysxManager(PhysicsManager):
4646
_stage_path: ClassVar[str | None] = None
4747
_warmup_done: ClassVar[bool] = False
4848
_tmp_dir: ClassVar[tempfile.TemporaryDirectory | None] = None
49+
# Device the process is locked to once :meth:`_warmup_and_load` constructs the
50+
# ``ovphysx.PhysX`` instance for the first time. ``ovphysx<=0.3.7`` enforces
51+
# a process-global device-mode lock at the C++ layer (see HACK note on
52+
# :meth:`_release_physx`); we mirror it here so a clear Python error is raised
53+
# if a later :class:`~isaaclab.sim.SimulationContext` requests a different device.
54+
_locked_device: ClassVar[str | None] = None
4955
# Pending (source, targets, parent_positions) triples registered by
5056
# ovphysx_replicate() before the PhysX instance exists. Replayed via
5157
# physx.clone() in _warmup_and_load().
@@ -84,13 +90,20 @@ def register_clone(
8490
def initialize(cls, sim_context: SimulationContext) -> None:
8591
"""Initialize the physics manager with simulation context.
8692
87-
This stores the config and device but does not create the ovphysx
88-
instance yet -- the USD stage may not be fully populated at this point.
89-
The actual creation happens lazily in :meth:`reset`.
93+
This stores the config and device but does not load the USD stage yet --
94+
the stage may not be fully populated at this point. The actual load
95+
happens lazily in :meth:`reset`.
96+
97+
``cls._physx`` is intentionally not cleared here: the ovphysx C++ instance
98+
is process-global (see HACK on :meth:`_release_physx`). When a previous
99+
:class:`SimulationContext` has already constructed it, we reuse it rather
100+
than dropping the only Python reference (which would trigger the
101+
destructor race) or re-constructing (which would hit the wheel's
102+
device-mode lock). ``cls._locked_device`` carries the device the cached
103+
instance is bound to.
90104
"""
91105
super().initialize(sim_context)
92106
cls._warmup_done = False
93-
cls._physx = None
94107
cls._usd_handle = None
95108
cls._stage_path = None
96109
cls._pending_clones = []
@@ -143,15 +156,27 @@ def close(cls) -> None:
143156

144157
@classmethod
145158
def _release_physx(cls) -> None:
146-
"""Release the ovphysx instance if it exists. Safe to call multiple times.
147-
148-
With ovphysx<=0.3.7 and Kit's pxr in the same process, physx.release()
149-
deadlocks due to dual-Carbonite static destructor races. Skip the
150-
native release and let os._exit() (registered via atexit) terminate the
151-
process; GPU resources are reclaimed by the driver.
159+
"""Soft-reset the ovphysx runtime stage; keep the C++ instance alive.
160+
161+
Calls ``physx.reset()`` to clear the loaded scene, but does **not** drop
162+
the Python reference. The cached :class:`ovphysx.PhysX` is reused by the
163+
next :class:`~isaaclab.sim.SimulationContext` via the reuse path in
164+
:meth:`_warmup_and_load`. Safe to call multiple times.
165+
166+
HACK(ovphysx<=0.3.7): the wheel's bundled libcarb.so and Kit's libcarb.so
167+
coexist in the same process whenever ``import pxr`` runs (Kit USD plugins
168+
on ``LD_LIBRARY_PATH`` pull in Kit's Carbonite). Both register C++ static
169+
destructors that race at process exit -- and crucially, also race when
170+
``ovphysx.PhysX``'s Python destructor fires mid-process via refcount drop.
171+
So we must never let the only Python reference go to zero while the
172+
process is alive. ``os._exit(0)`` (registered via ``atexit`` in
173+
:meth:`_warmup_and_load`) sidesteps the static-destructor phase entirely
174+
at process exit. Remove this workaround once the wheel ships a
175+
namespace-isolated Carbonite (different soname / hidden visibility).
152176
"""
153177
if cls._physx is not None:
154-
cls._physx = None
178+
op = cls._physx.reset()
179+
cls._physx.wait_op(op)
155180

156181
@classmethod
157182
def get_physx_instance(cls) -> Any:
@@ -164,7 +189,22 @@ def get_physx_instance(cls) -> Any:
164189

165190
@classmethod
166191
def _warmup_and_load(cls) -> None:
167-
"""Export the USD stage, create the ovphysx instance, and load the scene."""
192+
"""Export the USD stage and load it into the ovphysx runtime.
193+
194+
On the first call per process, constructs the :class:`ovphysx.PhysX`
195+
instance, registers the ``atexit`` handler, and locks the process to
196+
the resolved device. On subsequent calls, reuses the cached instance
197+
(see HACK on :meth:`_release_physx`) -- exporting the new USD,
198+
re-attaching it via ``add_usd``, replaying pending clones, and (on GPU)
199+
re-running ``warmup_gpu`` so the new stage's bodies are resident.
200+
201+
Raises:
202+
RuntimeError: if ``SimulationContext`` is not set, or if a device
203+
different from the process-locked one is requested. The wheel
204+
enforces a process-global device-mode lock at the C++ layer;
205+
we surface it here as a clear Python error before the wheel
206+
would raise :exc:`ovphysx.types.PhysXDeviceError`.
207+
"""
168208
sim = PhysicsManager._sim
169209
if sim is None:
170210
raise RuntimeError("OvPhysxManager: SimulationContext is not set.")
@@ -178,6 +218,13 @@ def _warmup_and_load(cls) -> None:
178218
gpu_index = 0
179219
ovphysx_device = "cpu"
180220

221+
if cls._locked_device is not None and ovphysx_device != cls._locked_device:
222+
raise RuntimeError(
223+
f"OvPhysxManager is locked to device {cls._locked_device!r} for the lifetime of this process; "
224+
f"cannot switch to {ovphysx_device!r}. ovphysx<=0.3.7 binds device mode at the C++ layer on the "
225+
"first ovphysx.PhysX(...) construction and it cannot be changed without restarting the process."
226+
)
227+
181228
scene_prim = sim.stage.GetPrimAtPath(sim.cfg.physics_prim_path)
182229
if scene_prim.IsValid():
183230
cls._configure_physx_scene_prim(scene_prim, PhysicsManager._cfg, ovphysx_device)
@@ -189,6 +236,66 @@ def _warmup_and_load(cls) -> None:
189236
cls._stage_path = stage_file
190237
logger.info("OvPhysxManager: exported USD stage to %s", stage_file)
191238

239+
if cls._physx is None:
240+
cls._construct_physx(ovphysx_device, gpu_index)
241+
cls._locked_device = ovphysx_device
242+
else:
243+
# Reuse path: the cached PhysX may still hold the prior stage (the
244+
# wheel allows only one loaded USD at a time). ``physx.reset()`` is
245+
# idempotent on an already-cleared stage and required when this is
246+
# a second :meth:`_warmup_and_load` within the same SimulationContext
247+
# (e.g. when a caller manually clears ``_warmup_done`` to force a
248+
# re-warmup).
249+
op = cls._physx.reset()
250+
cls._physx.wait_op(op)
251+
252+
usd_handle, op_idx = cls._physx.add_usd(stage_file)
253+
cls._physx.wait_op(op_idx)
254+
cls._usd_handle = usd_handle
255+
logger.info("OvPhysxManager: loaded USD into ovphysx (device=%s)", ovphysx_device)
256+
257+
# Replay pending physics clones registered by ovphysx_replicate().
258+
# The USD stage contains only env_0's physics; env_1..N are empty
259+
# Xform containers. physx.clone() creates the remaining environments
260+
# in the physics runtime without modifying the USD file.
261+
if cls._pending_clones:
262+
# ovphysx_replicate() only registers pending clones when clone_usd=False,
263+
# meaning the USD contains only env_0 physics and physx.clone() is required
264+
# to populate env_1..N in the physics runtime. Execute unconditionally —
265+
# no USD content heuristic is needed.
266+
for source, targets, parent_positions in cls._pending_clones:
267+
logger.info(
268+
"OvPhysxManager: cloning %s -> %d targets (%s ... %s)",
269+
source,
270+
len(targets),
271+
targets[0],
272+
targets[-1],
273+
)
274+
if parent_positions:
275+
transforms = [(x, y, z, 0.0, 0.0, 0.0, 1.0) for x, y, z in parent_positions]
276+
else:
277+
transforms = None
278+
op_idx = cls._physx.clone(source, targets, transforms)
279+
cls._physx.wait_op(op_idx)
280+
cls._pending_clones = []
281+
282+
# GPU bodies must be re-warmed after every add_usd: the cached PhysX
283+
# instance carries its old buffer layout from the previous stage.
284+
if ovphysx_device == "gpu":
285+
cls._physx.warmup_gpu()
286+
287+
cls.dispatch_event(PhysicsEvent.MODEL_INIT, payload={})
288+
cls._warmup_done = True
289+
290+
@classmethod
291+
def _construct_physx(cls, ovphysx_device: str, gpu_index: int) -> None:
292+
"""Bootstrap the ``ovphysx`` wheel and create the :class:`ovphysx.PhysX` instance.
293+
294+
Runs once per process. Configures worker threads, registers the
295+
process-exit ``os._exit(0)`` handler, and stores the result on
296+
``cls._physx``. See HACK on :meth:`_release_physx` for why the
297+
instance must outlive every individual :class:`SimulationContext`.
298+
"""
192299
# HACK (temporary): hide pxr from sys.modules during ovphysx bootstrap.
193300
# IsaacSim's pxr reports version 0.25.5 (pip convention) while ovphysx
194301
# expects 25.11 (OpenUSD release convention). Hiding pxr causes
@@ -258,42 +365,6 @@ def _atexit_release_and_exit():
258365
atexit.register(_atexit_release_and_exit)
259366
cls._atexit_registered = True
260367

261-
usd_handle, op_idx = cls._physx.add_usd(stage_file)
262-
cls._physx.wait_op(op_idx)
263-
cls._usd_handle = usd_handle
264-
logger.info("OvPhysxManager: loaded USD into ovphysx (device=%s)", ovphysx_device)
265-
266-
# Replay pending physics clones registered by ovphysx_replicate().
267-
# The USD stage contains only env_0's physics; env_1..N are empty
268-
# Xform containers. physx.clone() creates the remaining environments
269-
# in the physics runtime without modifying the USD file.
270-
if cls._pending_clones:
271-
# ovphysx_replicate() only registers pending clones when clone_usd=False,
272-
# meaning the USD contains only env_0 physics and physx.clone() is required
273-
# to populate env_1..N in the physics runtime. Execute unconditionally —
274-
# no USD content heuristic is needed.
275-
for source, targets, parent_positions in cls._pending_clones:
276-
logger.info(
277-
"OvPhysxManager: cloning %s -> %d targets (%s ... %s)",
278-
source,
279-
len(targets),
280-
targets[0],
281-
targets[-1],
282-
)
283-
if parent_positions:
284-
transforms = [(x, y, z, 0.0, 0.0, 0.0, 1.0) for x, y, z in parent_positions]
285-
else:
286-
transforms = None
287-
op_idx = cls._physx.clone(source, targets, transforms)
288-
cls._physx.wait_op(op_idx)
289-
cls._pending_clones = []
290-
291-
if ovphysx_device == "gpu":
292-
cls._physx.warmup_gpu()
293-
294-
cls.dispatch_event(PhysicsEvent.MODEL_INIT, payload={})
295-
cls._warmup_done = True
296-
297368
@staticmethod
298369
def _configure_physx_scene_prim(scene_prim, cfg, device: str) -> None:
299370
"""Apply PhysxSceneAPI schema and device-specific scene attributes to the

0 commit comments

Comments
 (0)