Fixes NaN after env reset by syncing body_q before collision detection#5176
Fixes NaN after env reset by syncing body_q before collision detection#5176AntoineRichard merged 5 commits intoisaac-sim:developfrom
Conversation
Fixes pre-commit ruff check failure.
Greptile SummaryFixes NaN explosions during Newton training on rough terrain by syncing Confidence Score: 5/5Safe to merge — the fix is logically sound and the regression test validates the core path. The dirty-flag approach correctly guards eval_fk to only run when needed and only when the collision pipeline is active. All position-write methods properly call invalidate_fk(). The regression test confirms body_q is fresh at collide() time. The only gap is CUDA graph coverage (test uses use_cuda_graph=False), but the code path is theoretically correct since both eval_fk and the captured graph operate on the same CUDA stream and buffer. All remaining findings are P2. test_articulation.py — CUDA graph path not covered by the regression test Important Files Changed
Sequence DiagramsequenceDiagram
participant A as Articulation
participant NM as NewtonManager
participant FK as eval_fk
participant CP as CollisionPipeline
A->>NM: write_root_pose_to_sim_index()
NM->>NM: _fk_dirty = True
Note over NM: joint_q updated, body_q stale
NM->>NM: step()
alt _fk_dirty AND _needs_collision_pipeline
NM->>FK: eval_fk(model, joint_q, joint_qd, state_0)
FK-->>NM: body_q updated
NM->>NM: _fk_dirty = False
end
NM->>CP: collide(state_0, contacts)
Note over CP: reads fresh body_q
CP-->>NM: contacts computed
NM->>NM: solver.step(...)
Reviews (1): Last reviewed commit: "style: remove unused variable `state` (r..." | Re-trigger Greptile |
| use_mujoco_contacts=False, | ||
| ), | ||
| num_substeps=1, | ||
| use_cuda_graph=False, |
There was a problem hiding this comment.
CUDA graph path not covered by the regression test
The test hardcodes use_cuda_graph=False, so the fix is only validated in eager-execution mode. When CUDA graphs are enabled the eager eval_fk call in step() runs before wp.capture_launch, which should work correctly because both operate on the same CUDA stream and the same state_0.body_q buffer — but this behaviour is untested. Consider adding a second parameterised case with use_cuda_graph=True, or at least a comment explaining why the CUDA-graph path can be skipped.
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
This PR fixes a critical NaN bug where body_q (body world transforms) became stale after env resets, causing the collision pipeline to generate contacts at wrong positions. The fix adds an invalidate_fk() flag that triggers forward kinematics (eval_fk) before collision detection in step(). The implementation is correct and well-tested.
Design Assessment
Design is sound. The dirty flag pattern is the right approach:
-
Why not call eval_fk directly in each write method? Multiple writes per reset (root pose + joint positions) would trigger redundant FK calls. The dirty flag correctly defers FK to a single call in
step(). -
Why not put eval_fk inside the CUDA graph? The FK is only needed conditionally (after resets), but CUDA graphs capture unconditionally. Running eval_fk eagerly before
wp.capture_launchis correct — stream ordering guarantees the FK completes before the graph'scollide()readsbody_q. -
The guard
if cls._fk_dirty and cls._needs_collision_pipelineis correct: MuJoCo's internal contacts mode (use_mujoco_contacts=True) handles FK internally, so the Newton collision pipeline path is the only one that needs this fix.
The fix follows existing patterns: ArticulationData already uses timestamp-based lazy FK invalidation for data access; this PR adds the complementary mechanism for physics stepping.
Architecture Impact
No cross-module impact. Changes are self-contained within isaaclab_newton:
NewtonManager.invalidate_fk()is a new classmethod, not breaking any existing API- All 6 articulation write methods that modify
joint_q/root pose now callinvalidate_fk() - Deprecated wrapper methods (
write_root_state_to_sim, etc.) delegate to the updated methods, so they're transitively covered
Implementation Verdict
Ship it — Clean implementation, no issues found after thorough review.
Test Coverage
✅ Excellent regression test. The test:
- Uses
use_mujoco_contacts=Falseto activate the Newton collision pipeline - Teleports env 0 by 10m (large enough to detect stale body_q)
- Patches
_simulate_physics_onlyto capturebody_qat the moment beforecollide()runs - Asserts
body_qmatchesjoint_qwithin 1cm tolerance - Would fail with ~10m mismatch without the fix
The test correctly captures body_q at the start of _simulate_physics_only, which is AFTER the new eval_fk in step() has run, validating the fix works.
CI Status
✅ Pre-commit passing (after linter fix pushed by review bot)
⏳ Other checks pending/in-progress
Findings
No blocking issues. The unused variable state (ruff F841) has been fixed in commit 36b7659.
Review based on commit 36b7659
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
This PR fixes a real, critical NaN bug where body_q was stale after env resets, causing the Newton collision pipeline to generate contacts for the wrong body positions. The fix adds an invalidate_fk() flag pattern that triggers eval_fk in step() before collide() runs. The approach is sound, the implementation is clean, and the regression test is well-designed.
Design Assessment
Problem: After env resets, articulation write methods (write_root_pose_to_sim_*, write_joint_position_to_sim_*) update joint_q but body_q remains stale. The collision pipeline reads body_q for broadphase/narrowphase, generating contacts for the pre-reset positions. The MuJoCo Warp solver then computes constraint Jacobians from the new qpos but old contacts — a geometry mismatch that overflows the solver gradient, producing NaN.
This PR's approach: A dirty flag (_fk_dirty) set by articulation write methods, checked in step() before the simulation (which starts with collide()). This is the correct fix.
Alternatives considered:
- Run
eval_fkinside each write method — Simpler per-call, but wasteful: multiple writes during a single reset would trigger multiple redundant FK evaluations. The dirty-flag approach correctly batches them. - Always run
eval_fkat the start ofstep()— Simpler (no flag needed), but adds unnecessary GPU work on steps where no writes happened (the common case during training). The conditional check is better. - Run
eval_fkinside_simulate_physics_onlybeforecollide()— Would work for the eager path but would bloat the CUDA graph capture (FK would be captured inside the graph, running unconditionally every step).
The chosen approach (dirty flag checked before graph launch / eager simulate) correctly separates the conditional eager FK from the captured physics pipeline. Design is sound.
Architecture Impact
The invalidate_fk() method is a new public classmethod on NewtonManager. It's called from the Newton Articulation class only. No cross-module breakage — the change is self-contained within isaaclab_newton.
However: RigidObject and RigidObjectCollection have the same write methods (write_root_link_pose_to_sim_*, write_body_link_pose_to_sim_*) that also modify joint_q via the shared kernels but do NOT call invalidate_fk(). If rigid objects are used with the Newton collision pipeline (use_mujoco_contacts=False), they would exhibit the same stale body_q bug. See 🟡 Warning below.
Implementation Verdict
Minor fixes needed — Correct approach, 2 issues to address.
Test Coverage
The regression test is well-designed:
- ✅ Sets up the Newton collision pipeline (
use_mujoco_contacts=False) - ✅ Teleports env 0 by 10m (simulating a reset), which creates a large gap between stale
body_qand newjoint_q - ✅ Patches
_simulate_physics_onlyto capturebody_qat the correct interception point (aftereval_fkinstep(), beforecollide()) - ✅ Asserts the position difference is < 1cm (would be ~10m without the fix)
- ✅ Deterministic — no random seeds or timing dependencies
One gap: the test uses use_cuda_graph=False. The CUDA graph code path (where eval_fk runs eagerly before wp.capture_launch) is untested. This is acceptable since CUDA graph testing requires a GPU capture environment, but worth noting.
CI Status
pre-commit ✅ passed. Build/test checks still pending (Docker image build in progress).
Findings
🟡 Warning: RigidObject/RigidObjectCollection missing invalidate_fk() — same bug pattern
RigidObject.write_root_link_pose_to_sim_index/mask and RigidObjectCollection.write_body_link_pose_to_sim_index/mask write to joint_q via the same shared kernels but don't call SimulationManager.invalidate_fk(). If a scene uses rigid objects with the Newton collision pipeline (use_mujoco_contacts=False), resetting rigid object poses would leave body_q stale — the same bug this PR fixes for articulations. This doesn't block this PR (the articulation fix is correct and urgent), but should be addressed in a follow-up.
🔵 Improvement: CHANGELOG RST heading level inconsistency
Minor: the existing 0.5.9 entries use ~~~~~~~~~~~~~~~~~~ (18 tildes for 19-char title). The new 0.5.10 entry uses ~~~~~~~~~~~~~~~~~~~ (19 tildes for 19-char title). Both are valid RST, but the length differs from the existing convention. Not blocking.
| if cls._fk_dirty and cls._needs_collision_pipeline: | ||
| eval_fk(cls._model, cls._state_0.joint_q, cls._state_0.joint_qd, cls._state_0, None) | ||
| cls._fk_dirty = False | ||
|
|
There was a problem hiding this comment.
The unconditional cls._fk_dirty = False after the conditional eval_fk is correct — it clears the flag regardless of _needs_collision_pipeline. If collision pipeline is disabled, the stale body_q doesn't matter (no collide() will read it), so clearing the flag without running FK is the right behavior.
Good.
| # Ensure body_q is up-to-date before collision detection. | ||
| # After env resets, joint_q is written but body_q (used by | ||
| # broadphase/narrowphase) is stale until FK runs. | ||
| if cls._fk_dirty and cls._needs_collision_pipeline: |
There was a problem hiding this comment.
This eval_fk runs eagerly outside the CUDA graph. When graphs are enabled, the execution order is:
- Eager
eval_fk→ updatesbody_qin_state_0 wp.capture_launch(graph)→ replays graphedcollide()which reads the now-currentbody_q
This is correct because CUDA graphs replay on the same memory pointers, so the eager write is visible to the graphed kernel. The FK call is a single Warp kernel launch, so the overhead when _fk_dirty is True is minimal.
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
This PR fixes a NaN explosion during Newton training caused by stale body_q after env resets. Articulation write methods update joint_q but never recomputed body_q via forward kinematics, causing the collision pipeline to use wrong body positions. The fix adds a _fk_dirty flag on NewtonManager that write methods set, and step() conditionally runs eval_fk() before collision detection.
Design Assessment
Design is sound. The two-level invalidation approach is well-designed:
- Per-instance
_fk_timestamp = -1.0(already existed) ensures lazy FK recomputation when body poses are queried through the data API. - Global
_fk_dirtyflag (new) ensures eager FK recomputation before collision detection instep().
The fix correctly places the eval_fk call in step() rather than inside _simulate_physics_only(), which avoids including it in CUDA graph captures (the conditional if cls._fk_dirty is a Python-level branch that can't be captured). The flag is unconditionally cleared after the check, which is correct — when no collision pipeline is active, stale body_q doesn't cause collision issues, and the per-instance _fk_timestamp handles the lazy path.
Alternatives considered:
- Running
eval_fkunconditionally instep()before every simulation: wasteful — FK is O(bodies) and unnecessary when no writes happened. - Running
eval_fkinside the write methods themselves: would cause redundant FK calls when multiple articulations reset in the same step (common in RL). - The chosen approach (dirty flag + single FK call) is the right design.
Architecture Impact
No cross-module impact beyond the changes. The invalidate_fk() method is called from Articulation (which already imports NewtonManager as SimulationManager) and consumed by NewtonManager.step(). All 6 write methods that modify joint_q or root transforms are correctly instrumented. Rigid objects are unaffected — they write directly to body_q and don't need FK.
Implementation Verdict
Minor fixes needed — Correct approach, 1 issue to address.
Test Coverage
The regression test is well-designed: it teleports env 0 by 10m, patches _simulate_physics_only to capture body_q at the moment before collide() runs, and asserts that body_q matches joint_q within 1cm tolerance. The test uses use_mujoco_contacts=False to activate the Newton collision pipeline and use_cuda_graph=False for the eager path. The test would fail without the fix (10m mismatch) and passes with it. One issue: missing CI marker (see inline comment).
CI Status
pre-commit ✅ | Some checks still in progress (Build Base Docker Image, license-check, Build Latest Docs)
Findings
🔵 Improvement: test_articulation.py:2409 — Missing @pytest.mark.isaacsim_ci marker. Every other test in this file has this marker. Without it, this test may not run in CI and could silently break without anyone noticing.
AntoineRichard
left a comment
There was a problem hiding this comment.
I think this is the right way of doing this. We could make the eval fk specific to a given robot but it adds overhead, and I'm not sure we'd be gaining much perf in doing so.
LGTM! Thanks Octi!
Description
This is an important fix for locomotion rough terrain in newton to work
Fixes # (issue)
Type of change
Screenshots
Please attach before and after screenshots of the change if applicable.
Checklist
pre-commitchecks with./isaaclab.sh --formatconfig/extension.tomlfileCONTRIBUTORS.mdor my name already exists there