Move ovphysx lifecycle workaround into OvPhysxManager

AntoineRichard · AntoineRichard · commit f71c9bd90349 · 2026-05-05T16:27:04.000+02:00
The dual-Carbonite destructor race that crashed test_rigid_object.py
between tests is not test-specific: any production caller that closes
a SimulationContext and constructs a new one in the same process hits
the same path. The previous PR worked around it with session-scoped
class-level monkey patches in the test file (_PHYSX_BY_DEVICE cache,
_patched_release_physx, _patched_warmup_and_load); production code
remained latently buggy.

Move the workaround into the manager itself:

* _release_physx is now a soft reset -- physx.reset() + wait_op while
  keeping the cached ovphysx.PhysX reference alive on cls._physx, so
  the C++ destructor never fires mid-process.
* _warmup_and_load reuses the cached instance on subsequent calls.
  First call constructs + locks the device + registers atexit; later
  calls re-export the USD, run physx.reset() to clear the prior stage,
  add_usd, replay clones, and (on GPU) re-run warmup_gpu so the new
  stage's bodies are resident.
* New _locked_device class attribute mirrors the wheel's process-global
  device-mode lock; the manager raises RuntimeError with a clear
  message instead of letting the wheel's PhysXDeviceError fire.
* _construct_physx splits out the first-time bootstrap so the orchestration
  in _warmup_and_load stays linear.
* HACK comments on _release_physx are keyed to the same wheel-fix
  milestone (namespace-isolated Carbonite) tracked by gap G5.

The dict keyed by device (_PHYSX_BY_DEVICE) is gone -- since the wheel
locks the process to one device, the dict could never hold more than
one entry and was misleading.

In test_rigid_object.py, drop the entire patch block (~150 lines):
_PHYSX_BY_DEVICE, _patched_release_physx, _patched_warmup_and_load,
_orig_*, _ovphysx_session_patches. Keep _LOCKED_DEVICE plus
_ovphysx_skip_other_device autouse fixture -- the only remaining
test-side concern is pre-empting the device-lock RuntimeError with a
clean pytest.skip on parametrized device mismatch, so two-pass CI
finishes cleanly when only one device is exercised. Update
test_warmup_attach_stage_not_called_for_cpu to spy on a real PhysX
after construction (the old approach silently relied on the patch
overwriting cls._physx).

Tests: 42 CPU + 41 GPU passing (two pytest invocations per Marco's
two-pass requirement).

Bumps isaaclab_ovphysx 0.2.16 -&gt; 0.2.17.
diff --git a/source/isaaclab_ovphysx/config/extension.toml b/source/isaaclab_ovphysx/config/extension.toml
@@ -1,7 +1,7 @@
 [package]
 
 # Note: Semantic Versioning is used: https://semver.org/
-version = "0.2.16"
+version = "0.2.17"
 
 # Description
 title = "OvPhysX simulation interfaces for IsaacLab core package"
diff --git a/source/isaaclab_ovphysx/docs/CHANGELOG.rst b/source/isaaclab_ovphysx/docs/CHANGELOG.rst
@@ -1,6 +1,32 @@
 Changelog
 ---------
 
+0.2.17 (2026-05-05)
+~~~~~~~~~~~~~~~~~~~~
+
+Changed
+^^^^^^^
+
+* Made :meth:`~isaaclab_ovphysx.physics.OvPhysxManager._release_physx` a
+  soft reset that calls ``physx.reset()`` and keeps the cached
+  :class:`ovphysx.PhysX` reference alive, instead of dropping it to ``None``
+  (which triggered a dual-Carbonite destructor race on refcount drop).
+  :meth:`~isaaclab_ovphysx.physics.OvPhysxManager._warmup_and_load` now
+  reuses the cached instance on subsequent calls, re-running ``add_usd``,
+  pending clones, and (on GPU) ``warmup_gpu`` per stage swap.  This makes
+  back-to-back :class:`SimulationContext` lifetimes work natively without
+  the test-side monkey patches the previous iteration of the rigid-object
+  tests required.
+
+Added
+^^^^^
+
+* Added :attr:`~isaaclab_ovphysx.physics.OvPhysxManager._locked_device` so
+  the manager raises a clear :exc:`RuntimeError` when a later
+  :class:`SimulationContext` requests a different device, surfacing the
+  wheel's process-global device-mode lock as a Python error before
+  :exc:`ovphysx.types.PhysXDeviceError` would fire.
+
 0.2.16 (2026-04-30)
 ~~~~~~~~~~~~~~~~~~~~
 
diff --git a/source/isaaclab_ovphysx/isaaclab_ovphysx/physics/ovphysx_manager.py b/source/isaaclab_ovphysx/isaaclab_ovphysx/physics/ovphysx_manager.py
@@ -46,6 +46,12 @@ class OvPhysxManager(PhysicsManager):
     _stage_path: ClassVar[str | None] = None
     _warmup_done: ClassVar[bool] = False
     _tmp_dir: ClassVar[tempfile.TemporaryDirectory | None] = None
+    # Device the process is locked to once :meth:`_warmup_and_load` constructs the
+    # ``ovphysx.PhysX`` instance for the first time.  ``ovphysx<=0.3.7`` enforces
+    # a process-global device-mode lock at the C++ layer (see HACK note on
+    # :meth:`_release_physx`); we mirror it here so a clear Python error is raised
+    # if a later :class:`~isaaclab.sim.SimulationContext` requests a different device.
+    _locked_device: ClassVar[str | None] = None
     # Pending (source, targets, parent_positions) triples registered by
     # ovphysx_replicate() before the PhysX instance exists.  Replayed via
     # physx.clone() in _warmup_and_load().
@@ -84,13 +90,20 @@ def register_clone(
     def initialize(cls, sim_context: SimulationContext) -> None:
         """Initialize the physics manager with simulation context.
 
-        This stores the config and device but does not create the ovphysx
-        instance yet -- the USD stage may not be fully populated at this point.
-        The actual creation happens lazily in :meth:`reset`.
+        This stores the config and device but does not load the USD stage yet --
+        the stage may not be fully populated at this point.  The actual load
+        happens lazily in :meth:`reset`.
+
+        ``cls._physx`` is intentionally not cleared here: the ovphysx C++ instance
+        is process-global (see HACK on :meth:`_release_physx`).  When a previous
+        :class:`SimulationContext` has already constructed it, we reuse it rather
+        than dropping the only Python reference (which would trigger the
+        destructor race) or re-constructing (which would hit the wheel's
+        device-mode lock).  ``cls._locked_device`` carries the device the cached
+        instance is bound to.
         """
         super().initialize(sim_context)
         cls._warmup_done = False
-        cls._physx = None
         cls._usd_handle = None
         cls._stage_path = None
         cls._pending_clones = []
@@ -143,15 +156,27 @@ def close(cls) -> None:
 
     @classmethod
     def _release_physx(cls) -> None:
-        """Release the ovphysx instance if it exists.  Safe to call multiple times.
-
-        With ovphysx<=0.3.7 and Kit's pxr in the same process, physx.release()
-        deadlocks due to dual-Carbonite static destructor races.  Skip the
-        native release and let os._exit() (registered via atexit) terminate the
-        process; GPU resources are reclaimed by the driver.
+        """Soft-reset the ovphysx runtime stage; keep the C++ instance alive.
+
+        Calls ``physx.reset()`` to clear the loaded scene, but does **not** drop
+        the Python reference.  The cached :class:`ovphysx.PhysX` is reused by the
+        next :class:`~isaaclab.sim.SimulationContext` via the reuse path in
+        :meth:`_warmup_and_load`.  Safe to call multiple times.
+
+        HACK(ovphysx<=0.3.7): the wheel's bundled libcarb.so and Kit's libcarb.so
+        coexist in the same process whenever ``import pxr`` runs (Kit USD plugins
+        on ``LD_LIBRARY_PATH`` pull in Kit's Carbonite).  Both register C++ static
+        destructors that race at process exit -- and crucially, also race when
+        ``ovphysx.PhysX``'s Python destructor fires mid-process via refcount drop.
+        So we must never let the only Python reference go to zero while the
+        process is alive.  ``os._exit(0)`` (registered via ``atexit`` in
+        :meth:`_warmup_and_load`) sidesteps the static-destructor phase entirely
+        at process exit.  Remove this workaround once the wheel ships a
+        namespace-isolated Carbonite (different soname / hidden visibility).
         """
         if cls._physx is not None:
-            cls._physx = None
+            op = cls._physx.reset()
+            cls._physx.wait_op(op)
 
     @classmethod
     def get_physx_instance(cls) -> Any:
@@ -164,7 +189,22 @@ def get_physx_instance(cls) -> Any:
 
     @classmethod
     def _warmup_and_load(cls) -> None:
-        """Export the USD stage, create the ovphysx instance, and load the scene."""
+        """Export the USD stage and load it into the ovphysx runtime.
+
+        On the first call per process, constructs the :class:`ovphysx.PhysX`
+        instance, registers the ``atexit`` handler, and locks the process to
+        the resolved device.  On subsequent calls, reuses the cached instance
+        (see HACK on :meth:`_release_physx`) -- exporting the new USD,
+        re-attaching it via ``add_usd``, replaying pending clones, and (on GPU)
+        re-running ``warmup_gpu`` so the new stage's bodies are resident.
+
+        Raises:
+            RuntimeError: if ``SimulationContext`` is not set, or if a device
+                different from the process-locked one is requested.  The wheel
+                enforces a process-global device-mode lock at the C++ layer;
+                we surface it here as a clear Python error before the wheel
+                would raise :exc:`ovphysx.types.PhysXDeviceError`.
+        """
         sim = PhysicsManager._sim
         if sim is None:
             raise RuntimeError("OvPhysxManager: SimulationContext is not set.")
@@ -178,6 +218,13 @@ def _warmup_and_load(cls) -> None:
             gpu_index = 0
             ovphysx_device = "cpu"
 
+        if cls._locked_device is not None and ovphysx_device != cls._locked_device:
+            raise RuntimeError(
+                f"OvPhysxManager is locked to device {cls._locked_device!r} for the lifetime of this process; "
+                f"cannot switch to {ovphysx_device!r}.  ovphysx<=0.3.7 binds device mode at the C++ layer on the "
+                "first ovphysx.PhysX(...) construction and it cannot be changed without restarting the process."
+            )
+
         scene_prim = sim.stage.GetPrimAtPath(sim.cfg.physics_prim_path)
         if scene_prim.IsValid():
             cls._configure_physx_scene_prim(scene_prim, PhysicsManager._cfg, ovphysx_device)
@@ -189,6 +236,66 @@ def _warmup_and_load(cls) -> None:
         cls._stage_path = stage_file
         logger.info("OvPhysxManager: exported USD stage to %s", stage_file)
 
+        if cls._physx is None:
+            cls._construct_physx(ovphysx_device, gpu_index)
+            cls._locked_device = ovphysx_device
+        else:
+            # Reuse path: the cached PhysX may still hold the prior stage (the
+            # wheel allows only one loaded USD at a time).  ``physx.reset()`` is
+            # idempotent on an already-cleared stage and required when this is
+            # a second :meth:`_warmup_and_load` within the same SimulationContext
+            # (e.g. when a caller manually clears ``_warmup_done`` to force a
+            # re-warmup).
+            op = cls._physx.reset()
+            cls._physx.wait_op(op)
+
+        usd_handle, op_idx = cls._physx.add_usd(stage_file)
+        cls._physx.wait_op(op_idx)
+        cls._usd_handle = usd_handle
+        logger.info("OvPhysxManager: loaded USD into ovphysx (device=%s)", ovphysx_device)
+
+        # Replay pending physics clones registered by ovphysx_replicate().
+        # The USD stage contains only env_0's physics; env_1..N are empty
+        # Xform containers.  physx.clone() creates the remaining environments
+        # in the physics runtime without modifying the USD file.
+        if cls._pending_clones:
+            # ovphysx_replicate() only registers pending clones when clone_usd=False,
+            # meaning the USD contains only env_0 physics and physx.clone() is required
+            # to populate env_1..N in the physics runtime.  Execute unconditionally —
+            # no USD content heuristic is needed.
+            for source, targets, parent_positions in cls._pending_clones:
+                logger.info(
+                    "OvPhysxManager: cloning %s -> %d targets (%s ... %s)",
+                    source,
+                    len(targets),
+                    targets[0],
+                    targets[-1],
+                )
+                if parent_positions:
+                    transforms = [(x, y, z, 0.0, 0.0, 0.0, 1.0) for x, y, z in parent_positions]
+                else:
+                    transforms = None
+                op_idx = cls._physx.clone(source, targets, transforms)
+                cls._physx.wait_op(op_idx)
+            cls._pending_clones = []
+
+        # GPU bodies must be re-warmed after every add_usd: the cached PhysX
+        # instance carries its old buffer layout from the previous stage.
+        if ovphysx_device == "gpu":
+            cls._physx.warmup_gpu()
+
+        cls.dispatch_event(PhysicsEvent.MODEL_INIT, payload={})
+        cls._warmup_done = True
+
+    @classmethod
+    def _construct_physx(cls, ovphysx_device: str, gpu_index: int) -> None:
+        """Bootstrap the ``ovphysx`` wheel and create the :class:`ovphysx.PhysX` instance.
+
+        Runs once per process.  Configures worker threads, registers the
+        process-exit ``os._exit(0)`` handler, and stores the result on
+        ``cls._physx``.  See HACK on :meth:`_release_physx` for why the
+        instance must outlive every individual :class:`SimulationContext`.
+        """
         # HACK (temporary): hide pxr from sys.modules during ovphysx bootstrap.
         # IsaacSim's pxr reports version 0.25.5 (pip convention) while ovphysx
         # expects 25.11 (OpenUSD release convention).  Hiding pxr causes
@@ -258,42 +365,6 @@ def _atexit_release_and_exit():
             atexit.register(_atexit_release_and_exit)
             cls._atexit_registered = True
 
-        usd_handle, op_idx = cls._physx.add_usd(stage_file)
-        cls._physx.wait_op(op_idx)
-        cls._usd_handle = usd_handle
-        logger.info("OvPhysxManager: loaded USD into ovphysx (device=%s)", ovphysx_device)
-
-        # Replay pending physics clones registered by ovphysx_replicate().
-        # The USD stage contains only env_0's physics; env_1..N are empty
-        # Xform containers.  physx.clone() creates the remaining environments
-        # in the physics runtime without modifying the USD file.
-        if cls._pending_clones:
-            # ovphysx_replicate() only registers pending clones when clone_usd=False,
-            # meaning the USD contains only env_0 physics and physx.clone() is required
-            # to populate env_1..N in the physics runtime.  Execute unconditionally —
-            # no USD content heuristic is needed.
-            for source, targets, parent_positions in cls._pending_clones:
-                logger.info(
-                    "OvPhysxManager: cloning %s -> %d targets (%s ... %s)",
-                    source,
-                    len(targets),
-                    targets[0],
-                    targets[-1],
-                )
-                if parent_positions:
-                    transforms = [(x, y, z, 0.0, 0.0, 0.0, 1.0) for x, y, z in parent_positions]
-                else:
-                    transforms = None
-                op_idx = cls._physx.clone(source, targets, transforms)
-                cls._physx.wait_op(op_idx)
-            cls._pending_clones = []
-
-        if ovphysx_device == "gpu":
-            cls._physx.warmup_gpu()
-
-        cls.dispatch_event(PhysicsEvent.MODEL_INIT, payload={})
-        cls._warmup_done = True
-
     @staticmethod
     def _configure_physx_scene_prim(scene_prim, cfg, device: str) -> None:
         """Apply PhysxSceneAPI schema and device-specific scene attributes to the
diff --git a/source/isaaclab_ovphysx/test/assets/test_rigid_object.py b/source/isaaclab_ovphysx/test/assets/test_rigid_object.py