Adds Newton + Isaac RTX Rendering Performance Optimizations (#5017)

ncournia-nv · web-flow · commit e7f34ebcec7c · 2026-04-01T13:06:12.000-07:00
# Newton + Isaac RTX Rendering Performance Optimizations This document describes four performance optimizations applied to the Newton physics simulator when used with the Isaac Sim RTX renderer inside Isaac Lab. Together they reduce per-frame time from **~323 ms to ~60 ms** (a **5.4x speedup**), making Newton's rendering path slightly faster than PhysX's equivalent (~65 ms). All live primarily in two files: - `source/isaaclab_newton/isaaclab_newton/physics/newton_manager.py` - `source/isaaclab_newton/isaaclab_newton/physics/_cubric.py` (new) with small additions to `PhysicsManager` and `SimulationContext` in the core `isaaclab` package. --- ## Baseline: ~323 ms per frame The starting point is the unoptimized Newton + RTX rendering loop. A Nsight Systems trace reveals the structure: - **Two physics steps** execute per frame (typical for 2× physics substeps per render frame). - **After each physics step**, Newton writes updated body transforms to Fabric (Omniverse's GPU scene-graph cache) and then triggers a full CPU hierarchy update via `update_world_xforms()`. This hierarchy walk recomputes every world-space transform in the scene from parent-child relationships — even though Newton already computed the correct world transforms and wrote them directly. - The Kit renderer also runs its own, lighter, internal hierarchy update. The per-step Fabric sync and hierarchy update dominates the frame. Because it runs after *every* physics step (not just before rendering), the cost is multiplied by the number of substeps. <img width="2169" height="750" alt="newton-rtx-baseline" src="https://github.com/user-attachments/assets/f7fc0079-9cca-43d2-9ade-9069e29718d4" /> --- ## Optimization 1 — Dirty-Flag Deferred Sync: ~244 ms per frame ### Problem Every physics substep was calling `sync_transforms_to_usd()`, which writes Newton body poses to Fabric and then invokes `update_world_xforms()`. The hierarchy update is expensive and only needs to happen once before the renderer reads the scene — not after every substep. ### Solution A **dirty-flag pattern** decouples physics stepping from Fabric synchronization: 1. **`_mark_transforms_dirty()`** — called at the end of each `_simulate()` call, sets `_transforms_dirty = True`. This is cheap (a boolean assignment). 2. **`sync_transforms_to_usd()`** — now checks `_transforms_dirty` at the top and returns immediately if transforms haven't changed. When dirty, it writes transforms and calls the hierarchy update, then clears the flag. 3. **`pre_render()`** — a new method added to `PhysicsManager` (base class) and overridden by `NewtonManager`. It calls `sync_transforms_to_usd()`. The `SimulationContext.render()` method calls `physics_manager.pre_render()` before updating visualizers and cameras, ensuring transforms are flushed exactly once per render frame. The key insight is that the renderer only reads scene transforms during `render()`, not during `step()`. By deferring the Fabric write and hierarchy update to render time, we eliminate redundant work when multiple physics substeps run per render frame. For 2 substeps per frame, this cuts the hierarchy update count in half. ### Key code paths - `_simulate()` → `_mark_transforms_dirty()` (just sets a flag) - `SimulationContext.render()` → `PhysicsManager.pre_render()` → `NewtonManager.sync_transforms_to_usd()` (runs once, clears the flag) <img width="2174" height="765" alt="newton-rtx-dirty" src="https://github.com/user-attachments/assets/eae6dbd9-7936-492e-a922-4fff5c0d7861" /> --- ## Optimization 2 — CUDA Graph Capture (Relaxed Mode): ~144 ms per frame ### Problem Looking at the physics steps in the trace, the GPU is underutilized. Each Warp kernel launch (collision detection, constraint solve, integration, FK evaluation) incurs a round-trip to the CPU via Python — launch overhead, GIL acquisition, and driver calls. For a simulation with many small kernels per substep, this CPU-side overhead becomes the bottleneck while the GPU sits idle between dispatches. Newton already supported CUDA graphs (pre-recording a sequence of kernel launches and replaying them with a single driver call), but CUDA graph capture was **disabled when RTX rendering was active**. The original code had: ```python use_cuda_graph = cfg.use_cuda_graph and (cls._usdrt_stage is None) ``` This was necessary because RTX's background threads use CUDA's legacy stream (stream 0) for async operations like `cudaImportExternalMemory`. Warp's standard `ScopedCapture()` uses `cudaStreamCaptureModeThreadLocal` on a blocking stream, which implicitly synchronizes with legacy stream 0. If RTX ops happen during capture, the CUDA runtime raises error 906 (`cudaErrorStreamCaptureImplicit`). ### Solution A **deferred, relaxed-mode CUDA graph capture** strategy that is compatible with RTX: **Deferral:** Graph capture is postponed from `initialize_solver()` to the first `step()` call. By that time, RTX has finished its initialization (all `cudaImportExternalMemory` calls are done) and is idle between render frames, providing a clean capture window. ```python # In initialize_solver(): cls._graph = None cls._graph_capture_pending = True # In step(): if cls._graph_capture_pending: cls._graph = cls._capture_relaxed_graph(device) ``` **Relaxed-mode capture** (`_capture_relaxed_graph`): This method works around two conflicting requirements: 1. **RTX compatibility**: RTX threads use legacy stream 0. A blocking stream (Warp's default) implicitly syncs with it, causing capture failures. Solution: create a **non-blocking stream** (`cudaStreamNonBlocking = 0x01`) that has no implicit synchronization with stream 0. 2. **Warp compatibility**: `mujoco_warp` internally calls `wp.capture_while`, which checks Warp's `device.captures` registry to decide whether to insert a conditional graph node or synchronize eagerly. Without a registered capture, it calls `wp.synchronize_stream` on the capturing stream — which is illegal inside graph capture. Solution: call `wp.capture_begin(external=True, stream=fresh_stream)` to register the capture in Warp's tracking without calling `cudaStreamBeginCapture` again (already done externally). The capture sequence: 1. **Warmup run** — execute `_simulate_physics_only()` eagerly to pre-allocate all MuJoCo-Warp scratch buffers (allocations are forbidden inside graph capture). 2. **Create a non-blocking CUDA stream** via `cudaStreamCreateWithFlags(..., NonBlocking)`. 3. **Begin capture** with `cudaStreamBeginCapture(..., cudaStreamCaptureModeRelaxed)` — relaxed mode allows other streams to operate freely during capture. 4. **Register with Warp** via `wp.capture_begin(external=True, stream=...)`. 5. **Record physics kernels** — `_simulate_physics_only()` inside `wp.ScopedStream(fresh_stream)`. 6. **Finalize** — `wp.capture_end()` then `cudaStreamEndCapture()` to obtain the graph. **Physics-only capture:** `_simulate_physics_only()` was factored out of `_simulate()` to exclude Fabric sync operations (`wp.synchronize_device`, `wp.fabricarray`) that are incompatible with graph capture. After graph replay, `step()` marks transforms dirty, and `pre_render()` handles the Fabric sync eagerly. The ctypes binding to `libcudart.so` is used directly because Warp's `ScopedCapture` doesn't expose control over capture mode or stream type. <img width="2168" height="745" alt="newton-rtx-cuda-graph" src="https://github.com/user-attachments/assets/eecfde04-41f4-488e-97e4-4d87cf617830" /> --- ## Optimization 3 — GPU Transform Hierarchy via cubric: ~60 ms per frame ### Problem Even with the dirty-flag pattern reducing hierarchy updates to once per render frame, the `update_world_xforms()` call is still a **CPU-side tree walk** over the entire Fabric scene graph. For scenes with thousands of prims (typical in multi-environment RL), this CPU hierarchy propagation is a significant bottleneck. The PhysX backend avoids this problem by using **cubric** — a GPU-accelerated transform hierarchy library. cubric runs the parent-child transform propagation entirely on the GPU via `IAdapter::compute()`, which is dramatically faster than the CPU walk. However, cubric has no Python bindings. ### Solution **Pure-Python ctypes bindings to cubric's Carbonite interface** (`_cubric.py`), allowing Newton to use the same GPU hierarchy propagation that PhysX uses. cubric is implemented as a Carbonite plugin and exposes its API through the `omni::cubric::IAdapter` interface. The bindings work by: 1. **Acquiring the Carbonite Framework** — `libcarb.so`'s `acquireFramework()` returns the singleton `Framework*`. 2. **Acquiring the IAdapter interface** — calling `tryAcquireInterfaceWithClient()` with the interface descriptor `omni::cubric::IAdapter` version `0.1`. 3. **Wrapping function pointers** — the `IAdapter` struct is a C++ vtable-like struct with function pointers at known offsets. Each function pointer is read from the struct at its byte offset and wrapped with `ctypes.CFUNCTYPE` to make it callable from Python. **Integration in `sync_transforms_to_usd()`:** The sync method now mirrors PhysX's `ScopedUSDRT` pattern: 1. **Pause Fabric change tracking** — `track_world_xform_changes(False)` and `track_local_xform_changes(False)`. This is critical: `SelectPrims` with `ReadWrite` access internally calls `getAttributeArrayGpu`, which marks Fabric buffers dirty. If tracking is still active, the hierarchy records the change and Kit's `updateWorldXforms` will do an expensive connectivity rebuild every frame. 2. **Write transforms** — the existing Warp kernel writes Newton body poses to Fabric's `omni:fabric:worldMatrix`. 3. **Resume tracking** — re-enable change tracking (in a `finally` block for safety). 4. **Run cubric compute** — `IAdapter::compute()` with `eRigidBody | eForceUpdate` options and `eAll` dirty mode. The `eRigidBody` flag tells cubric to use **inverse propagation** on prims tagged with `PhysicsRigidBodyAPI` (preserve the world matrix that Newton wrote, derive the local transform) and **forward propagation** on everything else (propagate parent transforms to children). `eForceUpdate` bypasses cubric's change-listener dirty check since we know transforms have changed. The adapter is lazily created on the first `sync_transforms_to_usd()` call rather than during `initialize_solver()`, to avoid startup-ordering issues with the cubric plugin. When cubric is unavailable (e.g., plugin not loaded, CPU-only), the code falls back gracefully to the CPU `update_world_xforms()` path. ``` sync_transforms_to_usd(): ┌─────────────────────────────────┐ │ Pause Fabric change tracking │ ├─────────────────────────────────┤ │ SelectPrims (ReadWrite) │ │ wp.launch(_set_fabric_transforms) │ ← GPU: write Newton poses to Fabric │ wp.synchronize_device() │ ├─────────────────────────────────┤ │ cubric IAdapter::compute() │ ← GPU: propagate hierarchy ├─────────────────────────────────┤ │ Resume Fabric change tracking │ └─────────────────────────────────┘ ``` A future Kit release is expected to ship official Python bindings for cubric, at which point the ctypes approach can be replaced. The result is a frame time of **~60 ms** — slightly faster than PhysX's **~65 ms** on the same scene. <img width="2169" height="896" alt="newton-rtx-cubric" src="https://github.com/user-attachments/assets/1474b806-fe82-44be-add3-324971ec37a0" /> --- ## Summary | Optimization | Frame Time | Speedup vs. Baseline | Key Technique | |---|---|---|---| | Baseline | ~323 ms | — | Sync + hierarchy after every substep | | Dirty-flag deferred sync | ~244 ms | 1.3× | Sync once per render frame, not per substep | | CUDA graph (relaxed mode) | ~144 ms | 2.2× | Eliminate per-kernel CPU launch overhead | | cubric GPU hierarchy | ~60 ms | 5.4× | GPU hierarchy propagation via ctypes bindings | All four optimizations are complementary and stack on top of each other. The final result matches or slightly beats the PhysX rendering path (~65 ms) while using Newton as the physics backend. *Co-developed with Toby Jones (NVIDIA).*
diff --git a/source/isaaclab/isaaclab/physics/physics_manager.py b/source/isaaclab/isaaclab/physics/physics_manager.py
@@ -265,6 +265,17 @@ def step(cls) -> None:
         """Step physics simulation by one timestep (physics only, no rendering)."""
         pass
 
+    @classmethod
+    def pre_render(cls) -> None:
+        """Sync deferred physics state to the rendering backend.
+
+        Called by :meth:`~isaaclab.sim.SimulationContext.render` before cameras
+        and visualizers read scene data. The default implementation is a no-op.
+        Backends that defer transform writes (e.g. Newton's dirty-flag pattern)
+        should override this to flush pending updates.
+        """
+        pass
+
     @classmethod
     def close(cls) -> None:
         """Clean up physics resources.
diff --git a/source/isaaclab/isaaclab/sim/simulation_context.py b/source/isaaclab/isaaclab/sim/simulation_context.py
@@ -669,6 +669,7 @@ def render(self, mode: int | None = None) -> None:
         every physics step). Camera sensors drive their configured renderer when
         fetching data, so this method remains backend-agnostic.
         """
+        self.physics_manager.pre_render()
         self.update_visualizers(self.get_rendering_dt())
 
         # Call render callbacks
diff --git a/source/isaaclab_newton/isaaclab_newton/physics/_cubric.py b/source/isaaclab_newton/isaaclab_newton/physics/_cubric.py
@@ -0,0 +1,273 @@
+# Copyright (c) 2022-2026, The Isaac Lab Project Developers (https://github.com/isaac-sim/IsaacLab/blob/main/CONTRIBUTORS.md).
+# All rights reserved.
+#
+# SPDX-License-Identifier: BSD-3-Clause
+
+"""Pure-Python ctypes bindings for the cubric GPU transform-hierarchy API.
+
+Acquires the ``omni::cubric::IAdapter`` carb interface directly from the
+Carbonite framework and wraps its function-pointer methods so that Newton
+can call cubric's GPU transform propagation without C++ pybind11 changes.
+
+The flow mirrors PhysX's ``DirectGpuHelper::updateXForms_GPU()``:
+
+1. ``IAdapter::create``     → allocate a cubric adapter ID
+2. ``IAdapter::bindToStage`` → bind to the current Fabric stage
+3. ``IAdapter::compute``     → GPU kernel: propagate world transforms
+4. ``IAdapter::release``     → free the adapter
+
+When cubric is unavailable (e.g. CPU-only machine, plugin not loaded), the
+caller falls back to the CPU ``update_world_xforms()`` path.
+"""
+
+from __future__ import annotations
+
+import ctypes
+import logging
+
+logger = logging.getLogger(__name__)
+
+# ---------------------------------------------------------------------------
+#  Carb Framework struct layout (CARB_ABI function-pointer offsets, x86_64)
+# ---------------------------------------------------------------------------
+# Counting only CARB_ABI fields from the top of ``struct Framework``:
+#   0: loadPluginsEx
+#   8: unloadAllPlugins
+#  16: acquireInterfaceWithClient
+#  24: tryAcquireInterfaceWithClient  ← we use this one
+_FW_OFF_TRY_ACQUIRE = 24
+
+# ---------------------------------------------------------------------------
+#  IAdapter struct layout  (from omni/cubric/IAdapter.h)
+# ---------------------------------------------------------------------------
+#   0: getAttribute
+#   8: create(AdapterId*)
+#  16: refcount
+#  24: retain
+#  32: release(AdapterId)
+#  40: bindToStage(AdapterId, const FabricId&)
+#  48: unbind
+#  56: compute(AdapterId, options, dirtyMode, outFlags*)
+_IA_OFF_CREATE = 8
+_IA_OFF_RELEASE = 32
+_IA_OFF_BIND = 40
+_IA_OFF_COMPUTE = 56
+
+# AdapterId sentinel
+_INVALID_ADAPTER_ID = ctypes.c_uint64(~0).value
+
+# AdapterComputeOptions flags  (from IAdapter.h)
+_OPT_FORCE_UPDATE = 1 << 0  # Force update, ignoring invalidation status
+_OPT_FORCE_STATE_RECONSTRUCTION = 1 << 1  # Force full rebuild of internal accel structures
+_OPT_SKIP_ISOLATED = 1 << 2  # Skip prims with connectivity degree 0
+_OPT_RIGID_BODY = 1 << 3  # Use PhysicsRigidBodyAPI tag for inverse propagation
+
+# Newton prims get tagged with PhysicsRigidBodyAPI at init time so
+# cubric's eRigidBody mode can distinguish rigid-body buckets
+# (Inverse: preserve world matrix written by Newton, derive local)
+# from non-rigid-body buckets (Forward: propagate to children).
+# eForceUpdate is ORed in to bypass the change-listener check.
+_OPT_DEFAULT = _OPT_RIGID_BODY | _OPT_FORCE_UPDATE
+
+# AdapterDirtyMode
+_DIRTY_ALL = 0  # eAll — dirty all prims in the stage
+_DIRTY_COARSE = 1  # eCoarse — dirty all prims in visited buckets
+
+
+# ---------------------------------------------------------------------------
+#  ctypes struct mirrors
+# ---------------------------------------------------------------------------
+class _Version(ctypes.Structure):
+    _fields_ = [("major", ctypes.c_uint32), ("minor", ctypes.c_uint32)]
+
+
+class _InterfaceDesc(ctypes.Structure):
+    """``carb::InterfaceDesc`` — {const char* name, Version version}."""
+
+    _fields_ = [
+        ("name", ctypes.c_char_p),
+        ("version", _Version),
+    ]
+
+
+def _read_u64(addr: int) -> int:
+    return ctypes.c_uint64.from_address(addr).value
+
+
+# ---------------------------------------------------------------------------
+#  Public API
+# ---------------------------------------------------------------------------
+class CubricBindings:
+    """Typed wrappers around the cubric ``IAdapter`` API.
+
+    Call :meth:`initialize` once; if it returns ``True``, the four adapter
+    methods are available.
+    """
+
+    def __init__(self) -> None:
+        self._ia_ptr: int = 0
+        self._create_fn = None
+        self._release_fn = None
+        self._bind_fn = None
+        self._compute_fn = None
+
+    # -- lifecycle -----------------------------------------------------------
+
+    def initialize(self) -> bool:
+        """Acquire the cubric ``IAdapter`` from the carb framework."""
+        # Ensure the omni.cubric extension (native carb plugin) is loaded.
+        try:
+            import omni.kit.app
+
+            ext_mgr = omni.kit.app.get_app().get_extension_manager()
+            if not ext_mgr.is_extension_enabled("omni.cubric"):
+                ext_mgr.set_extension_enabled_immediate("omni.cubric", True)
+            if not ext_mgr.is_extension_enabled("omni.cubric"):
+                logger.warning("Failed to enable omni.cubric extension")
+                return False
+        except Exception as exc:
+            logger.warning("Cannot enable omni.cubric: %s", exc)
+            return False
+
+        # Get Framework* via libcarb.so acquireFramework (singleton).
+        try:
+            libcarb = ctypes.CDLL("libcarb.so")
+        except OSError:
+            logger.warning("Could not load libcarb.so")
+            return False
+
+        libcarb.acquireFramework.restype = ctypes.c_void_p
+        libcarb.acquireFramework.argtypes = [ctypes.c_char_p, _Version]
+        fw_ptr = libcarb.acquireFramework(b"isaaclab.cubric", _Version(0, 0))
+        if not fw_ptr:
+            logger.warning("acquireFramework returned null")
+            return False
+
+        # Read tryAcquireInterfaceWithClient fn-ptr from Framework vtable.
+        try_acquire_addr = _read_u64(fw_ptr + _FW_OFF_TRY_ACQUIRE)
+        if try_acquire_addr == 0:
+            logger.warning("tryAcquireInterfaceWithClient is null in Framework")
+            return False
+
+        try_acquire_fn = ctypes.CFUNCTYPE(
+            ctypes.c_void_p,  # return: void* (IAdapter*)
+            ctypes.c_char_p,  # clientName
+            _InterfaceDesc,  # desc (by value)
+            ctypes.c_char_p,  # pluginName
+        )(try_acquire_addr)
+
+        desc = _InterfaceDesc(
+            name=b"omni::cubric::IAdapter",
+            version=_Version(0, 1),
+        )
+
+        # Try several acquisition strategies — the required client name
+        # varies across Kit configurations.
+        ia_ptr = try_acquire_fn(b"carb.scripting-python.plugin", desc, None)
+        if not ia_ptr:
+            ia_ptr = try_acquire_fn(None, desc, None)
+        if not ia_ptr:
+            acquire_addr = _read_u64(fw_ptr + 16)  # acquireInterfaceWithClient
+            if acquire_addr:
+                acquire_fn = ctypes.CFUNCTYPE(
+                    ctypes.c_void_p,
+                    ctypes.c_char_p,
+                    _InterfaceDesc,
+                    ctypes.c_char_p,
+                )(acquire_addr)
+                ia_ptr = acquire_fn(b"isaaclab.cubric", desc, None)
+        if not ia_ptr:
+            logger.warning(
+                "Could not acquire omni::cubric::IAdapter — "
+                "cubric plugin may not be registered or interface version mismatch"
+            )
+            return False
+        self._ia_ptr = ia_ptr
+
+        # Wrap the four IAdapter function pointers we need.
+        create_addr = _read_u64(ia_ptr + _IA_OFF_CREATE)
+        release_addr = _read_u64(ia_ptr + _IA_OFF_RELEASE)
+        bind_addr = _read_u64(ia_ptr + _IA_OFF_BIND)
+        compute_addr = _read_u64(ia_ptr + _IA_OFF_COMPUTE)
+
+        if not all([create_addr, release_addr, bind_addr, compute_addr]):
+            logger.warning("One or more IAdapter function pointers are null")
+            return False
+
+        self._create_fn = ctypes.CFUNCTYPE(
+            ctypes.c_bool,
+            ctypes.POINTER(ctypes.c_uint64),
+        )(create_addr)
+
+        self._release_fn = ctypes.CFUNCTYPE(
+            ctypes.c_bool,
+            ctypes.c_uint64,
+        )(release_addr)
+
+        # FabricId is uint64, passed by const-ref -> pointer on x86_64
+        self._bind_fn = ctypes.CFUNCTYPE(
+            ctypes.c_bool,
+            ctypes.c_uint64,
+            ctypes.POINTER(ctypes.c_uint64),
+        )(bind_addr)
+
+        self._compute_fn = ctypes.CFUNCTYPE(
+            ctypes.c_bool,
+            ctypes.c_uint64,  # adapterId
+            ctypes.c_uint32,  # options  (AdapterComputeOptions)
+            ctypes.c_int32,  # dirtyMode (AdapterDirtyMode)
+            ctypes.c_void_p,  # outAccountFlags* (nullable)
+        )(compute_addr)
+
+        logger.info("cubric IAdapter bindings ready")
+        return True
+
+    @property
+    def available(self) -> bool:
+        return self._ia_ptr != 0
+
+    # -- cubric adapter methods ----------------------------------------------
+
+    def create_adapter(self) -> int | None:
+        """Create a cubric adapter. Returns an adapter ID or ``None``."""
+        if not self._create_fn:
+            return None
+        adapter_id = ctypes.c_uint64(_INVALID_ADAPTER_ID)
+        ok = self._create_fn(ctypes.byref(adapter_id))
+        if not ok or adapter_id.value == _INVALID_ADAPTER_ID:
+            logger.warning("IAdapter::create failed")
+            return None
+        return adapter_id.value
+
+    def bind_to_stage(self, adapter_id: int, fabric_id: int) -> bool:
+        """Bind the adapter to a Fabric stage."""
+        if not self._bind_fn:
+            return False
+        fid = ctypes.c_uint64(fabric_id)
+        ok = self._bind_fn(adapter_id, ctypes.byref(fid))
+        if not ok:
+            logger.warning("IAdapter::bindToStage failed (adapter=%d, fabricId=%d)", adapter_id, fabric_id)
+        return ok
+
+    def compute(self, adapter_id: int) -> bool:
+        """Run the GPU transform-hierarchy compute pass.
+
+        Uses ``eRigidBody | eForceUpdate`` with ``eAll`` dirty mode.
+        ``eRigidBody`` makes cubric apply Inverse propagation on buckets
+        tagged with ``PhysicsRigidBodyAPI`` (keeps Newton's world transforms,
+        derives local) and Forward on everything else (propagates to children).
+        ``eForceUpdate`` bypasses the change-listener dirty check.
+        """
+        if not self._compute_fn:
+            return False
+        flags = ctypes.c_uint32(0)
+        ok = self._compute_fn(adapter_id, _OPT_DEFAULT, _DIRTY_ALL, ctypes.byref(flags))
+        if not ok:
+            logger.warning("IAdapter::compute returned false (flags=0x%x)", flags.value)
+        return ok
+
+    def release_adapter(self, adapter_id: int) -> None:
+        """Release an adapter."""
+        if not adapter_id or not self._release_fn:
+            return
+        self._release_fn(adapter_id)
diff --git a/source/isaaclab_newton/isaaclab_newton/physics/newton_manager.py b/source/isaaclab_newton/isaaclab_newton/physics/newton_manager.py