Skip to content

Fixes NaN after env reset by syncing body_q before collision detection#5176

Merged
AntoineRichard merged 5 commits intoisaac-sim:developfrom
ooctipus:fix/stale_fk
Apr 7, 2026
Merged

Fixes NaN after env reset by syncing body_q before collision detection#5176
AntoineRichard merged 5 commits intoisaac-sim:developfrom
ooctipus:fix/stale_fk

Conversation

@ooctipus
Copy link
Copy Markdown
Collaborator

@ooctipus ooctipus commented Apr 5, 2026

Description

This is an important fix for locomotion rough terrain in newton to work

  • Fixed NaN explosions during Newton training caused by stale body_q after env resets. Articulation write methods (write_root_pose_to_sim_, write_joint_position_to_sim_) updated joint_q but never recomputed body_q via forward kinematics. The collision pipeline read the old body_q, generating contacts for the wrong position. The MuJoCo Warp solver then computed constraint Jacobians from new qpos but old contacts — a geometry mismatch that overflowed the solver gradient, producing NaN.
  • Added NewtonManager.invalidate_fk() flag set by articulation write methods, checked in step() before collision detection to conditionally run eval_fk.
  • Added regression test that patches _simulate_physics_only to capture body_q at collide() time and asserts consistency with joint_q. Fails with 10m mismatch without the fix.

Fixes # (issue)

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (existing functionality will not work without user modification)
  • Documentation update

Screenshots

Please attach before and after screenshots of the change if applicable.

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

@ooctipus ooctipus requested a review from AntoineRichard April 5, 2026 05:15
@github-actions github-actions bot added bug Something isn't working isaac-lab Related to Isaac Lab team labels Apr 5, 2026
ooctipus and others added 2 commits April 4, 2026 22:18
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 5, 2026

Greptile Summary

Fixes NaN explosions during Newton training on rough terrain by syncing body_q before collision detection after env resets. The root cause was that articulation write methods updated joint_q but never recomputed body_q via forward kinematics, causing the collision pipeline to read stale positions and the MuJoCo Warp solver to overflow on the geometry mismatch. The fix adds a _fk_dirty dirty flag set by all position-write methods and cleared in step() after a conditional eval_fk call.

Confidence Score: 5/5

Safe to merge — the fix is logically sound and the regression test validates the core path.

The dirty-flag approach correctly guards eval_fk to only run when needed and only when the collision pipeline is active. All position-write methods properly call invalidate_fk(). The regression test confirms body_q is fresh at collide() time. The only gap is CUDA graph coverage (test uses use_cuda_graph=False), but the code path is theoretically correct since both eval_fk and the captured graph operate on the same CUDA stream and buffer. All remaining findings are P2.

test_articulation.py — CUDA graph path not covered by the regression test

Important Files Changed

Filename Overview
source/isaaclab_newton/config/extension.toml Bumps version to 0.5.10 to match the new changelog entry
source/isaaclab_newton/docs/CHANGELOG.rst Adds 0.5.10 Fixed entry documenting the stale body_q NaN bug and the invalidate_fk() solution
source/isaaclab_newton/isaaclab_newton/physics/newton_manager.py Adds _fk_dirty class flag, invalidate_fk() classmethod, and a guard in step() that runs eval_fk before collide() when the flag is set
source/isaaclab_newton/isaaclab_newton/assets/articulation/articulation.py All four root-pose write variants and both joint-position write variants now call SimulationManager.invalidate_fk() to mark body_q stale
source/isaaclab_newton/test/assets/test_articulation.py Adds regression test verifying body_q freshness at collide() time via monkey-patching _simulate_physics_only; only covers the non-CUDA-graph path

Sequence Diagram

sequenceDiagram
    participant A as Articulation
    participant NM as NewtonManager
    participant FK as eval_fk
    participant CP as CollisionPipeline

    A->>NM: write_root_pose_to_sim_index()
    NM->>NM: _fk_dirty = True
    Note over NM: joint_q updated, body_q stale

    NM->>NM: step()
    alt _fk_dirty AND _needs_collision_pipeline
        NM->>FK: eval_fk(model, joint_q, joint_qd, state_0)
        FK-->>NM: body_q updated
        NM->>NM: _fk_dirty = False
    end
    NM->>CP: collide(state_0, contacts)
    Note over CP: reads fresh body_q
    CP-->>NM: contacts computed
    NM->>NM: solver.step(...)
Loading

Reviews (1): Last reviewed commit: "style: remove unused variable `state` (r..." | Re-trigger Greptile

use_mujoco_contacts=False,
),
num_substeps=1,
use_cuda_graph=False,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 CUDA graph path not covered by the regression test

The test hardcodes use_cuda_graph=False, so the fix is only validated in eager-execution mode. When CUDA graphs are enabled the eager eval_fk call in step() runs before wp.capture_launch, which should work correctly because both operate on the same CUDA stream and the same state_0.body_q buffer — but this behaviour is untested. Consider adding a second parameterised case with use_cuda_graph=True, or at least a comment explaining why the CUDA-graph path can be skipped.

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This PR fixes a critical NaN bug where body_q (body world transforms) became stale after env resets, causing the collision pipeline to generate contacts at wrong positions. The fix adds an invalidate_fk() flag that triggers forward kinematics (eval_fk) before collision detection in step(). The implementation is correct and well-tested.

Design Assessment

Design is sound. The dirty flag pattern is the right approach:

  1. Why not call eval_fk directly in each write method? Multiple writes per reset (root pose + joint positions) would trigger redundant FK calls. The dirty flag correctly defers FK to a single call in step().

  2. Why not put eval_fk inside the CUDA graph? The FK is only needed conditionally (after resets), but CUDA graphs capture unconditionally. Running eval_fk eagerly before wp.capture_launch is correct — stream ordering guarantees the FK completes before the graph's collide() reads body_q.

  3. The guard if cls._fk_dirty and cls._needs_collision_pipeline is correct: MuJoCo's internal contacts mode (use_mujoco_contacts=True) handles FK internally, so the Newton collision pipeline path is the only one that needs this fix.

The fix follows existing patterns: ArticulationData already uses timestamp-based lazy FK invalidation for data access; this PR adds the complementary mechanism for physics stepping.

Architecture Impact

No cross-module impact. Changes are self-contained within isaaclab_newton:

  • NewtonManager.invalidate_fk() is a new classmethod, not breaking any existing API
  • All 6 articulation write methods that modify joint_q/root pose now call invalidate_fk()
  • Deprecated wrapper methods (write_root_state_to_sim, etc.) delegate to the updated methods, so they're transitively covered

Implementation Verdict

Ship it — Clean implementation, no issues found after thorough review.

Test Coverage

Excellent regression test. The test:

  • Uses use_mujoco_contacts=False to activate the Newton collision pipeline
  • Teleports env 0 by 10m (large enough to detect stale body_q)
  • Patches _simulate_physics_only to capture body_q at the moment before collide() runs
  • Asserts body_q matches joint_q within 1cm tolerance
  • Would fail with ~10m mismatch without the fix

The test correctly captures body_q at the start of _simulate_physics_only, which is AFTER the new eval_fk in step() has run, validating the fix works.

CI Status

✅ Pre-commit passing (after linter fix pushed by review bot)
⏳ Other checks pending/in-progress

Findings

No blocking issues. The unused variable state (ruff F841) has been fixed in commit 36b7659.


Review based on commit 36b7659

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This PR fixes a real, critical NaN bug where body_q was stale after env resets, causing the Newton collision pipeline to generate contacts for the wrong body positions. The fix adds an invalidate_fk() flag pattern that triggers eval_fk in step() before collide() runs. The approach is sound, the implementation is clean, and the regression test is well-designed.

Design Assessment

Problem: After env resets, articulation write methods (write_root_pose_to_sim_*, write_joint_position_to_sim_*) update joint_q but body_q remains stale. The collision pipeline reads body_q for broadphase/narrowphase, generating contacts for the pre-reset positions. The MuJoCo Warp solver then computes constraint Jacobians from the new qpos but old contacts — a geometry mismatch that overflows the solver gradient, producing NaN.

This PR's approach: A dirty flag (_fk_dirty) set by articulation write methods, checked in step() before the simulation (which starts with collide()). This is the correct fix.

Alternatives considered:

  1. Run eval_fk inside each write method — Simpler per-call, but wasteful: multiple writes during a single reset would trigger multiple redundant FK evaluations. The dirty-flag approach correctly batches them.
  2. Always run eval_fk at the start of step() — Simpler (no flag needed), but adds unnecessary GPU work on steps where no writes happened (the common case during training). The conditional check is better.
  3. Run eval_fk inside _simulate_physics_only before collide() — Would work for the eager path but would bloat the CUDA graph capture (FK would be captured inside the graph, running unconditionally every step).

The chosen approach (dirty flag checked before graph launch / eager simulate) correctly separates the conditional eager FK from the captured physics pipeline. Design is sound.

Architecture Impact

The invalidate_fk() method is a new public classmethod on NewtonManager. It's called from the Newton Articulation class only. No cross-module breakage — the change is self-contained within isaaclab_newton.

However: RigidObject and RigidObjectCollection have the same write methods (write_root_link_pose_to_sim_*, write_body_link_pose_to_sim_*) that also modify joint_q via the shared kernels but do NOT call invalidate_fk(). If rigid objects are used with the Newton collision pipeline (use_mujoco_contacts=False), they would exhibit the same stale body_q bug. See 🟡 Warning below.

Implementation Verdict

Minor fixes needed — Correct approach, 2 issues to address.

Test Coverage

The regression test is well-designed:

  • ✅ Sets up the Newton collision pipeline (use_mujoco_contacts=False)
  • ✅ Teleports env 0 by 10m (simulating a reset), which creates a large gap between stale body_q and new joint_q
  • ✅ Patches _simulate_physics_only to capture body_q at the correct interception point (after eval_fk in step(), before collide())
  • ✅ Asserts the position difference is < 1cm (would be ~10m without the fix)
  • ✅ Deterministic — no random seeds or timing dependencies

One gap: the test uses use_cuda_graph=False. The CUDA graph code path (where eval_fk runs eagerly before wp.capture_launch) is untested. This is acceptable since CUDA graph testing requires a GPU capture environment, but worth noting.

CI Status

pre-commit ✅ passed. Build/test checks still pending (Docker image build in progress).

Findings

🟡 Warning: RigidObject/RigidObjectCollection missing invalidate_fk() — same bug pattern

RigidObject.write_root_link_pose_to_sim_index/mask and RigidObjectCollection.write_body_link_pose_to_sim_index/mask write to joint_q via the same shared kernels but don't call SimulationManager.invalidate_fk(). If a scene uses rigid objects with the Newton collision pipeline (use_mujoco_contacts=False), resetting rigid object poses would leave body_q stale — the same bug this PR fixes for articulations. This doesn't block this PR (the articulation fix is correct and urgent), but should be addressed in a follow-up.

🔵 Improvement: CHANGELOG RST heading level inconsistency

Minor: the existing 0.5.9 entries use ~~~~~~~~~~~~~~~~~~ (18 tildes for 19-char title). The new 0.5.10 entry uses ~~~~~~~~~~~~~~~~~~~ (19 tildes for 19-char title). Both are valid RST, but the length differs from the existing convention. Not blocking.

if cls._fk_dirty and cls._needs_collision_pipeline:
eval_fk(cls._model, cls._state_0.joint_q, cls._state_0.joint_qd, cls._state_0, None)
cls._fk_dirty = False

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unconditional cls._fk_dirty = False after the conditional eval_fk is correct — it clears the flag regardless of _needs_collision_pipeline. If collision pipeline is disabled, the stale body_q doesn't matter (no collide() will read it), so clearing the flag without running FK is the right behavior.

Good.

# Ensure body_q is up-to-date before collision detection.
# After env resets, joint_q is written but body_q (used by
# broadphase/narrowphase) is stale until FK runs.
if cls._fk_dirty and cls._needs_collision_pipeline:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This eval_fk runs eagerly outside the CUDA graph. When graphs are enabled, the execution order is:

  1. Eager eval_fk → updates body_q in _state_0
  2. wp.capture_launch(graph) → replays graphed collide() which reads the now-current body_q

This is correct because CUDA graphs replay on the same memory pointers, so the eager write is visible to the graphed kernel. The FK call is a single Warp kernel launch, so the overhead when _fk_dirty is True is minimal.

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This PR fixes a NaN explosion during Newton training caused by stale body_q after env resets. Articulation write methods update joint_q but never recomputed body_q via forward kinematics, causing the collision pipeline to use wrong body positions. The fix adds a _fk_dirty flag on NewtonManager that write methods set, and step() conditionally runs eval_fk() before collision detection.

Design Assessment

Design is sound. The two-level invalidation approach is well-designed:

  1. Per-instance _fk_timestamp = -1.0 (already existed) ensures lazy FK recomputation when body poses are queried through the data API.
  2. Global _fk_dirty flag (new) ensures eager FK recomputation before collision detection in step().

The fix correctly places the eval_fk call in step() rather than inside _simulate_physics_only(), which avoids including it in CUDA graph captures (the conditional if cls._fk_dirty is a Python-level branch that can't be captured). The flag is unconditionally cleared after the check, which is correct — when no collision pipeline is active, stale body_q doesn't cause collision issues, and the per-instance _fk_timestamp handles the lazy path.

Alternatives considered:

  • Running eval_fk unconditionally in step() before every simulation: wasteful — FK is O(bodies) and unnecessary when no writes happened.
  • Running eval_fk inside the write methods themselves: would cause redundant FK calls when multiple articulations reset in the same step (common in RL).
  • The chosen approach (dirty flag + single FK call) is the right design.

Architecture Impact

No cross-module impact beyond the changes. The invalidate_fk() method is called from Articulation (which already imports NewtonManager as SimulationManager) and consumed by NewtonManager.step(). All 6 write methods that modify joint_q or root transforms are correctly instrumented. Rigid objects are unaffected — they write directly to body_q and don't need FK.

Implementation Verdict

Minor fixes needed — Correct approach, 1 issue to address.

Test Coverage

The regression test is well-designed: it teleports env 0 by 10m, patches _simulate_physics_only to capture body_q at the moment before collide() runs, and asserts that body_q matches joint_q within 1cm tolerance. The test uses use_mujoco_contacts=False to activate the Newton collision pipeline and use_cuda_graph=False for the eager path. The test would fail without the fix (10m mismatch) and passes with it. One issue: missing CI marker (see inline comment).

CI Status

pre-commit ✅ | Some checks still in progress (Build Base Docker Image, license-check, Build Latest Docs)

Findings

🔵 Improvement: test_articulation.py:2409 — Missing @pytest.mark.isaacsim_ci marker. Every other test in this file has this marker. Without it, this test may not run in CI and could silently break without anyone noticing.

Copy link
Copy Markdown
Collaborator

@AntoineRichard AntoineRichard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the right way of doing this. We could make the eval fk specific to a given robot but it adds overhead, and I'm not sure we'd be gaining much perf in doing so.
LGTM! Thanks Octi!

@AntoineRichard AntoineRichard merged commit 151b054 into isaac-sim:develop Apr 7, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants