Commit e7f34eb
authored
Adds Newton + Isaac RTX Rendering Performance Optimizations (#5017)
# Newton + Isaac RTX Rendering Performance Optimizations
This document describes four performance optimizations applied to the
Newton physics simulator when used with the Isaac Sim RTX renderer
inside Isaac Lab. Together they reduce per-frame time from **~323 ms to
~60 ms** (a **5.4x speedup**), making Newton's rendering path slightly
faster than PhysX's equivalent (~65 ms).
All live primarily in two files:
- `source/isaaclab_newton/isaaclab_newton/physics/newton_manager.py`
- `source/isaaclab_newton/isaaclab_newton/physics/_cubric.py` (new)
with small additions to `PhysicsManager` and `SimulationContext` in the
core `isaaclab` package.
---
## Baseline: ~323 ms per frame
The starting point is the unoptimized Newton + RTX rendering loop. A
Nsight Systems trace reveals the structure:
- **Two physics steps** execute per frame (typical for 2× physics
substeps per render frame).
- **After each physics step**, Newton writes updated body transforms to
Fabric (Omniverse's GPU scene-graph cache) and then triggers a full CPU
hierarchy update via `update_world_xforms()`. This hierarchy walk
recomputes every world-space transform in the scene from parent-child
relationships — even though Newton already computed the correct world
transforms and wrote them directly.
- The Kit renderer also runs its own, lighter, internal hierarchy
update.
The per-step Fabric sync and hierarchy update dominates the frame.
Because it runs after *every* physics step (not just before rendering),
the cost is multiplied by the number of substeps.
<img width="2169" height="750" alt="newton-rtx-baseline"
src="https://github.com/user-attachments/assets/f7fc0079-9cca-43d2-9ade-9069e29718d4"
/>
---
## Optimization 1 — Dirty-Flag Deferred Sync: ~244 ms per frame
### Problem
Every physics substep was calling `sync_transforms_to_usd()`, which
writes Newton body poses to Fabric and then invokes
`update_world_xforms()`. The hierarchy update is expensive and only
needs to happen once before the renderer reads the scene — not after
every substep.
### Solution
A **dirty-flag pattern** decouples physics stepping from Fabric
synchronization:
1. **`_mark_transforms_dirty()`** — called at the end of each
`_simulate()` call, sets `_transforms_dirty = True`. This is cheap (a
boolean assignment).
2. **`sync_transforms_to_usd()`** — now checks `_transforms_dirty` at
the top and returns immediately if transforms haven't changed. When
dirty, it writes transforms and calls the hierarchy update, then clears
the flag.
3. **`pre_render()`** — a new method added to `PhysicsManager` (base
class) and overridden by `NewtonManager`. It calls
`sync_transforms_to_usd()`. The `SimulationContext.render()` method
calls `physics_manager.pre_render()` before updating visualizers and
cameras, ensuring transforms are flushed exactly once per render frame.
The key insight is that the renderer only reads scene transforms during
`render()`, not during `step()`. By deferring the Fabric write and
hierarchy update to render time, we eliminate redundant work when
multiple physics substeps run per render frame. For 2 substeps per
frame, this cuts the hierarchy update count in half.
### Key code paths
- `_simulate()` → `_mark_transforms_dirty()` (just sets a flag)
- `SimulationContext.render()` → `PhysicsManager.pre_render()` →
`NewtonManager.sync_transforms_to_usd()` (runs once, clears the flag)
<img width="2174" height="765" alt="newton-rtx-dirty"
src="https://github.com/user-attachments/assets/eae6dbd9-7936-492e-a922-4fff5c0d7861"
/>
---
## Optimization 2 — CUDA Graph Capture (Relaxed Mode): ~144 ms per frame
### Problem
Looking at the physics steps in the trace, the GPU is underutilized.
Each Warp kernel launch (collision detection, constraint solve,
integration, FK evaluation) incurs a round-trip to the CPU via Python —
launch overhead, GIL acquisition, and driver calls. For a simulation
with many small kernels per substep, this CPU-side overhead becomes the
bottleneck while the GPU sits idle between dispatches.
Newton already supported CUDA graphs (pre-recording a sequence of kernel
launches and replaying them with a single driver call), but CUDA graph
capture was **disabled when RTX rendering was active**. The original
code had:
```python
use_cuda_graph = cfg.use_cuda_graph and (cls._usdrt_stage is None)
```
This was necessary because RTX's background threads use CUDA's legacy
stream (stream 0) for async operations like `cudaImportExternalMemory`.
Warp's standard `ScopedCapture()` uses
`cudaStreamCaptureModeThreadLocal` on a blocking stream, which
implicitly synchronizes with legacy stream 0. If RTX ops happen during
capture, the CUDA runtime raises error 906
(`cudaErrorStreamCaptureImplicit`).
### Solution
A **deferred, relaxed-mode CUDA graph capture** strategy that is
compatible with RTX:
**Deferral:** Graph capture is postponed from `initialize_solver()` to
the first `step()` call. By that time, RTX has finished its
initialization (all `cudaImportExternalMemory` calls are done) and is
idle between render frames, providing a clean capture window.
```python
# In initialize_solver():
cls._graph = None
cls._graph_capture_pending = True
# In step():
if cls._graph_capture_pending:
cls._graph = cls._capture_relaxed_graph(device)
```
**Relaxed-mode capture** (`_capture_relaxed_graph`): This method works
around two conflicting requirements:
1. **RTX compatibility**: RTX threads use legacy stream 0. A blocking
stream (Warp's default) implicitly syncs with it, causing capture
failures. Solution: create a **non-blocking stream**
(`cudaStreamNonBlocking = 0x01`) that has no implicit synchronization
with stream 0.
2. **Warp compatibility**: `mujoco_warp` internally calls
`wp.capture_while`, which checks Warp's `device.captures` registry to
decide whether to insert a conditional graph node or synchronize
eagerly. Without a registered capture, it calls `wp.synchronize_stream`
on the capturing stream — which is illegal inside graph capture.
Solution: call `wp.capture_begin(external=True, stream=fresh_stream)` to
register the capture in Warp's tracking without calling
`cudaStreamBeginCapture` again (already done externally).
The capture sequence:
1. **Warmup run** — execute `_simulate_physics_only()` eagerly to
pre-allocate all MuJoCo-Warp scratch buffers (allocations are forbidden
inside graph capture).
2. **Create a non-blocking CUDA stream** via
`cudaStreamCreateWithFlags(..., NonBlocking)`.
3. **Begin capture** with `cudaStreamBeginCapture(...,
cudaStreamCaptureModeRelaxed)` — relaxed mode allows other streams to
operate freely during capture.
4. **Register with Warp** via `wp.capture_begin(external=True,
stream=...)`.
5. **Record physics kernels** — `_simulate_physics_only()` inside
`wp.ScopedStream(fresh_stream)`.
6. **Finalize** — `wp.capture_end()` then `cudaStreamEndCapture()` to
obtain the graph.
**Physics-only capture:** `_simulate_physics_only()` was factored out of
`_simulate()` to exclude Fabric sync operations
(`wp.synchronize_device`, `wp.fabricarray`) that are incompatible with
graph capture. After graph replay, `step()` marks transforms dirty, and
`pre_render()` handles the Fabric sync eagerly.
The ctypes binding to `libcudart.so` is used directly because Warp's
`ScopedCapture` doesn't expose control over capture mode or stream type.
<img width="2168" height="745" alt="newton-rtx-cuda-graph"
src="https://github.com/user-attachments/assets/eecfde04-41f4-488e-97e4-4d87cf617830"
/>
---
## Optimization 3 — GPU Transform Hierarchy via cubric: ~60 ms per frame
### Problem
Even with the dirty-flag pattern reducing hierarchy updates to once per
render frame, the `update_world_xforms()` call is still a **CPU-side
tree walk** over the entire Fabric scene graph. For scenes with
thousands of prims (typical in multi-environment RL), this CPU hierarchy
propagation is a significant bottleneck.
The PhysX backend avoids this problem by using **cubric** — a
GPU-accelerated transform hierarchy library. cubric runs the
parent-child transform propagation entirely on the GPU via
`IAdapter::compute()`, which is dramatically faster than the CPU walk.
However, cubric has no Python bindings.
### Solution
**Pure-Python ctypes bindings to cubric's Carbonite interface**
(`_cubric.py`), allowing Newton to use the same GPU hierarchy
propagation that PhysX uses.
cubric is implemented as a Carbonite plugin and exposes its API through
the `omni::cubric::IAdapter` interface. The bindings work by:
1. **Acquiring the Carbonite Framework** — `libcarb.so`'s
`acquireFramework()` returns the singleton `Framework*`.
2. **Acquiring the IAdapter interface** — calling
`tryAcquireInterfaceWithClient()` with the interface descriptor
`omni::cubric::IAdapter` version `0.1`.
3. **Wrapping function pointers** — the `IAdapter` struct is a C++
vtable-like struct with function pointers at known offsets.
Each function pointer is read from the struct at its byte offset and
wrapped with `ctypes.CFUNCTYPE` to make it callable from Python.
**Integration in `sync_transforms_to_usd()`:**
The sync method now mirrors PhysX's `ScopedUSDRT` pattern:
1. **Pause Fabric change tracking** — `track_world_xform_changes(False)`
and `track_local_xform_changes(False)`. This is critical: `SelectPrims`
with `ReadWrite` access internally calls `getAttributeArrayGpu`, which
marks Fabric buffers dirty. If tracking is still active, the hierarchy
records the change and Kit's `updateWorldXforms` will do an expensive
connectivity rebuild every frame.
2. **Write transforms** — the existing Warp kernel writes Newton body
poses to Fabric's `omni:fabric:worldMatrix`.
3. **Resume tracking** — re-enable change tracking (in a `finally` block
for safety).
4. **Run cubric compute** — `IAdapter::compute()` with `eRigidBody |
eForceUpdate` options and `eAll` dirty mode. The `eRigidBody` flag tells
cubric to use **inverse propagation** on prims tagged with
`PhysicsRigidBodyAPI` (preserve the world matrix that Newton wrote,
derive the local transform) and **forward propagation** on everything
else (propagate parent transforms to children). `eForceUpdate` bypasses
cubric's change-listener dirty check since we know transforms have
changed.
The adapter is lazily created on the first `sync_transforms_to_usd()`
call rather than during `initialize_solver()`, to avoid startup-ordering
issues with the cubric plugin.
When cubric is unavailable (e.g., plugin not loaded, CPU-only), the code
falls back gracefully to the CPU `update_world_xforms()` path.
```
sync_transforms_to_usd():
┌─────────────────────────────────┐
│ Pause Fabric change tracking │
├─────────────────────────────────┤
│ SelectPrims (ReadWrite) │
│ wp.launch(_set_fabric_transforms) │ ← GPU: write Newton poses to Fabric
│ wp.synchronize_device() │
├─────────────────────────────────┤
│ cubric IAdapter::compute() │ ← GPU: propagate hierarchy
├─────────────────────────────────┤
│ Resume Fabric change tracking │
└─────────────────────────────────┘
```
A future Kit release is expected to ship official Python bindings for
cubric, at which point the ctypes approach can be replaced.
The result is a frame time of **~60 ms** — slightly faster than PhysX's
**~65 ms** on the same scene.
<img width="2169" height="896" alt="newton-rtx-cubric"
src="https://github.com/user-attachments/assets/1474b806-fe82-44be-add3-324971ec37a0"
/>
---
## Summary
| Optimization | Frame Time | Speedup vs. Baseline | Key Technique |
|---|---|---|---|
| Baseline | ~323 ms | — | Sync + hierarchy after every substep |
| Dirty-flag deferred sync | ~244 ms | 1.3× | Sync once per render
frame, not per substep |
| CUDA graph (relaxed mode) | ~144 ms | 2.2× | Eliminate per-kernel CPU
launch overhead |
| cubric GPU hierarchy | ~60 ms | 5.4× | GPU hierarchy propagation via
ctypes bindings |
All four optimizations are complementary and stack on top of each other.
The final result matches or slightly beats the PhysX rendering path (~65
ms) while using Newton as the physics backend.
*Co-developed with Toby Jones (NVIDIA).*1 parent dc36792 commit e7f34eb
4 files changed
Lines changed: 596 additions & 44 deletions
File tree
- source
- isaaclab_newton/isaaclab_newton/physics
- isaaclab/isaaclab
- physics
- sim
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
265 | 265 | | |
266 | 266 | | |
267 | 267 | | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
268 | 279 | | |
269 | 280 | | |
270 | 281 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
669 | 669 | | |
670 | 670 | | |
671 | 671 | | |
| 672 | + | |
672 | 673 | | |
673 | 674 | | |
674 | 675 | | |
| |||
Lines changed: 273 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
0 commit comments