[Test] Stabilize RTX MDL shader-warmup flake in rendering correctness tests#5687
[Test] Stabilize RTX MDL shader-warmup flake in rendering correctness tests#5687hujc7 wants to merge 6 commits into
Conversation
… tests `test_shadow_hand_vision_presets.py::test_camera_renders_not_empty` and the three helper-driven rendering tests in `rendering_test_utils.py` (`rendering_test_shadow_hand`/`_cartpole`/`_dexsuite_kuka`) intermittently fail with "Camera output 'simple_shading_*_mdl' is all zeros or all inf after stepping" on cold-cache CI runners. Root cause is RTX MDL shader compile timing: a single `env.step()` returns a still-zero framebuffer for shader variants that haven't finished compiling. The variant that flakes (constant_diffuse / diffuse_mdl / full_mdl) shifts between runs depending on which MDL pipeline finishes first. Step the env 10 frames before reading the camera tensor — empirically enough across the MDL variants that flake. Adds ~1-2 s per parametrize variant; cheap relative to the cost of a retry. Affected jobs: - `isaaclab_tasks [2/3]` (test_camera_renders_not_empty) - `rendering-correctness-kitless` (test_rendering_*_kitless[*-simple_shading_*])
There was a problem hiding this comment.
Summary
This PR correctly addresses an RTX MDL shader-warmup race condition in the rendering correctness tests. The root cause analysis is solid: asynchronous shader compilation can leave the framebuffer zeroed when tests read camera tensors too early after environment creation.
Key Observations
✅ Strengths
-
Well-designed helper function: The
_warmup_render()abstraction is clean and consolidates the warmup logic in a single place, making future tuning straightforward. -
Good documentation: Both the docstring and inline comments explain why 10 steps are needed, including the specific MDL shader variants affected. This context will help future maintainers.
-
Appropriate placement: Calling
_warmup_render()aftermaybe_save_stage()but beforevalidate_camera_outputs()is the right sequencing. -
Minimal scope: The change is surgical — it only touches the test infrastructure that was flaking, with no production code impact.
💭 Observations (not blockers)
-
Magic number 10: The warmup step count is empirically determined. Consider adding a brief comment in the fixture (
render_correctness_env) referencing_warmup_renderso readers know both sites use the same rationale. -
No explicit warmup validation: The helper assumes stepping 10 times is sufficient but does not verify shaders actually compiled. This is likely fine for a test stabilization fix, but if flakes persist, a future enhancement could add a pixel-value sanity check after warmup.
-
Changelog fragment: Well-written and follows the project format.
Verdict
LGTM 👍 — This is a pragmatic fix for a well-understood CI flake. The tradeoff of ~1-2s additional wall time per variant is absolutely worth the CI stability gain.
Update (08cd6c9): Excellent refinement! The switch from env.step() to sim.render() + scene.update() is a meaningful improvement:
- ✅ No physics advancement: The warmup no longer steps physics state, so golden reference images captured at the post-init state remain valid without recapture
- ✅ Better semantic fit: The new approach mirrors
DirectRLEnvCfg.num_rerenders_on_reset/wait_for_texturespatterns already established in the codebase - ✅ Clearer naming:
num_passesbetter describes what's happening vsnum_steps - ✅ Improved documentation: The expanded docstring now explicitly calls out the physics-preserving design and references the relevant config attributes
The changelog update accurately reflects the new approach, differentiating the render-only warmup (for golden-image tests) from the step-based warmup (for non-zero pixel checks where physics drift is acceptable).
LGTM 👍 — Ready to merge.
Update (c359111): Significant scope reduction — the PR now takes a simpler approach:
- ❌ Removed
_warmup_render()helper entirely - ❌ Removed warmup calls from
rendering_test_shadow_hand,rendering_test_cartpole, andrendering_test_dexsuite_kuka - ✅ Retained only the step-based warmup for
test_camera_renders_not_empty(which has no golden comparison) - 📝 Updated changelog to reflect the reduced scope, noting the golden-image tests already use
@flaky(max_runs=3)
Assessment:
This is a reasonable decision. The tradeoffs:
| Approach | Pros | Cons |
|---|---|---|
| New (flaky retries) | Zero additional wall time; existing mechanism; simpler code | May still fail 3x in rare cold-cache scenarios |
| Previous (warmup) | Deterministic fix | ~1-2s added per test; more invasive |
The author is betting that @flaky(max_runs=3) provides sufficient coverage for the golden-image tests, since shader compilation is a one-time cold-cache event. If any retry succeeds (after shaders compile), the test passes. This is pragmatic.
LGTM 👍 — The reduced scope is justified. The only test that truly needed explicit warmup was test_camera_renders_not_empty (which lacked any retry mechanism), and that fix is retained.
Update (39e6bb7): Nice iteration! The warmup now uses adaptive polling instead of a fixed step count:
- ✅ Polling with early exit: Instead of always stepping 10 times, the fixture now polls up to
_MAX_WARMUP_STEPS = 60but exits as soon as all camera outputs havemax() > 0.2(excluding inf values) - ✅ Handles all output types: The loop checks
env._tiled_camera.data.output.values()so all data-type tensors (RGB, depth, etc.) are validated - ✅ Better worst-case coverage: 60 steps provides headroom for pathological cold-cache scenarios while the early-exit keeps typical runs fast
- ✅ Improved changelog: Updated to describe the polling approach with the 60-step cap
Minor observations (not blockers):
-
Threshold
0.2: The pixel value threshold is empirically chosen. It's reasonable but could be documented (e.g., why not 0.0 or 0.1?). For MDL shaders, fully black pixels return 0.0, so 0.2 is safely above that. -
Direct camera access:
env._tiled_camerais a private attribute. If the API changes, this will break. For a test fixture this is acceptable since tests can be updated alongside implementation changes.
LGTM 👍 — The adaptive approach is strictly better than the fixed count: faster in the common case, more robust in edge cases. Ready to merge.
Update (3de115a): Great consolidation! The warmup logic is now properly refactored:
Changes:
-
New helper
_warmup_render_until_nonzero()inrendering_test_utils.py:- Extracted from the fixture into a reusable function
- Uses
sim.render() + scene.update(dt)instead ofenv.step()— physics state is NOT advanced - 30-pass cap (vs 60 in the fixture version)
- Same
0.2threshold logic with inf filtering - Flexible camera detection: checks
_tiled_camerafirst, falls back toscene.sensors.get('base_camera') - Excellent docstring explaining the MDL async compilation race and why render-only passes preserve golden validity
-
Applied to all three golden-image test helpers:
rendering_test_shadow_hand()rendering_test_cartpole()rendering_test_dexsuite_kuka()
-
Expanded changelog:
- Now documents both approaches: step-based for
test_camera_renders_not_empty(60-step cap) and render-only for golden tests (30-pass cap) - Clearly explains why physics-preserving warmup is needed for golden comparisons
- Now documents both approaches: step-based for
Assessment:
This is the right design:
| Test Type | Warmup Method | Physics Impact | Why |
|---|---|---|---|
test_camera_renders_not_empty |
env.step() (60 cap) |
Yes | Only checks non-zero; goldens don't matter |
| Golden-image tests | sim.render() + scene.update() (30 cap) |
No | Must match existing reference frames |
The dual approach respects the invariants each test type requires. Render-only warmup for golden tests means no golden recapture is needed.
LGTM 👍 — Clean, well-documented, and production-ready. Ready to merge.
Update (eb23b0c): Excellent final polish! Key improvements:
-
Public API exposure:
_warmup_render_until_nonzero→warmup_render_until_nonzero— the underscore removal signals this is now a supported test utility -
Multi-sensor support: The helper now iterates over ALL sensors in
env.scene.sensorsinstead of hardcoded_tiled_camera/base_cameralookups. This is more robust and future-proof for envs with multiple cameras -
Wrapped env handling: Added
getattr(env, "unwrapped", env)to correctly access scene/sim on wrapped environments -
Extended coverage: Now also applied to
test_rendering_registered_tasks.py, which was previously missing warmup and could flake on cold-cache runs -
Changelog accuracy: Updated to reflect the broader scope (four rendering helpers + registered tasks test)
The nested loop structure properly handles early-exit semantics and the defensive getattr chains ensure graceful no-ops for envs without standard sensor structures.
LGTM 👍 — This is the complete, polished solution. Ready to merge.
|
does not seem working. need to change golden file. |
|
Reopening. Going back at this with the warmup + golden regeneration approach, validated locally before pushing. |
The previous warmup used env.step() which advances physics and breaks golden-image comparison in the rendering correctness tests. Switch to sim.render() + scene.update() so shader compile finishes without moving the scene state, mirroring num_rerenders_on_reset in the env classes.
The 10-pass sim.render warmup in the three goldenfile helpers (rendering_test_shadow_hand / _cartpole / _dexsuite_kuka) caused test_rendering_dexsuite_kuka and *_kitless variants to fail with large pixel diffs vs the existing goldens — the warmup exposes a different post-loading scene state for the dexsuite_kuka asset set. Keep only the test_camera_renders_not_empty change (no golden compare, env.step warmup is safe there). The goldenfile helpers already use flaky(max_runs=3) for occasional MDL hiccups.
10 frames was not enough — CI still flaked on test_camera_renders_not_empty[physx-isaacsim_rtx-simple_shading_diffuse_mdl]. Poll until every camera output tensor has a non-zero max with a 60-step cap, exiting early once all outputs are ready.
Greptile SummaryThis PR adds a poll-until-non-zero MDL shader warmup to the rendering correctness tests to eliminate intermittent "Camera output is all zeros or all inf" failures on cold-cache CI runners, where the GPU returns a still-zero framebuffer before RTX MDL materials finish compiling.
Confidence Score: 5/5Safe to merge; the warmup logic is correct, early-exits cleanly, and does not advance physics state, so existing golden images remain valid. The change is narrowly scoped to test infrastructure: it adds a polling loop around render calls with a bounded cap, uses the same threshold already in the test assertions, and all three rendering helpers are updated consistently. No production code is touched. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Env constructed] --> B{warmup_render_until_nonzero}
B --> C[Check all sensor tensors\nfinite.max > 0.2?]
C -- All ready --> D[Return immediately\nstate at first-non-zero frame]
C -- Not ready --> E[sim.render + scene.update\nno physics advance]
E --> F{Pass count < 30?}
F -- Yes --> C
F -- No cap hit --> G[Return silently\nno warning emitted]
D --> H[validate_camera_outputs\nor golden compare]
G --> H
Reviews (3): Last reviewed commit: "Apply warmup to registered-tasks renderi..." | Re-trigger Greptile |
| _MAX_WARMUP_STEPS = 60 | ||
| for _ in range(_MAX_WARMUP_STEPS): | ||
| env.step(actions) | ||
| outputs_ready = True | ||
| for output in env._tiled_camera.data.output.values(): | ||
| tensor = output.torch | ||
| finite = torch.where(torch.isinf(tensor), torch.zeros_like(tensor), tensor) | ||
| if finite.max() <= 0.2: | ||
| outputs_ready = False | ||
| break | ||
| if outputs_ready: | ||
| break |
There was a problem hiding this comment.
No diagnostic when warmup cap is reached
If all 60 steps complete without any output reaching the threshold, the fixture silently yields and the test fails with the standard "Camera output is all zeros or all inf" assertion error — with no indication that the warmup cap was hit. Adding a warnings.warn or print on cap exhaustion would make it immediately clear in CI logs that the problem is slow shader compilation rather than a deeper rendering bug.
The three rendering_test_utils helpers (rendering_test_shadow_hand / _cartpole / _dexsuite_kuka) now drive sim.render() + scene.update() passes until every camera-output tensor has a non-zero max, with a 30-pass cap and early-exit. Physics is not advanced (no env.step) so the existing goldens at the post-init "first non-zero frame" stay valid. A previous fixed 10-pass over-rendered and broke dexsuite_kuka goldens at 91% pixel diff.
Extend warmup_render_until_nonzero to iterate every sensor in env.scene.sensors (instead of hardcoded _tiled_camera / base_camera) and call it from test_rendering_registered_tasks.py. Also rename the helper from _warmup_render_until_nonzero (module-private) to warmup_render_until_nonzero now that it has cross-module callers.
Description
The rendering-correctness tests intermittently fail on cold-cache CI runners with "Camera output is all zeros or all inf" for
simple_shading_*_mdlandsimple_shading_constant_diffusevariants — the GPU returns a still-zero framebuffer because the RTX MDL material has not finished compiling by the time the test reads the camera tensor.Add a poll-until-non-zero warmup that exits as soon as every camera-output tensor has a non-zero max:
test_shadow_hand_vision_presets.py::test_camera_renders_not_empty— polls viaenv.step()with a 60-step cap. (This test has noflakyretry, so a single-step fixture is the actual failure surface.)rendering_test_utils.warmup_render_until_nonzero— iterates every sensor inenv.scene.sensorsand polls viasim.render()+scene.update()with a 30-pass cap. Called from the four rendering helpers inrendering_test_utils.py(rendering_test_shadow_hand/_cartpole/_dexsuite_kuka) and fromtest_rendering_registered_tasks.py. Mirrors the pattern already inDirectRLEnvCfg.num_rerenders_on_reset/wait_for_textures; physics state is not advanced, so the existing goldens at the post-init "first non-zero frame" stay valid.An earlier attempt at a fixed-pass warmup over-rendered in the goldenfile helpers and broke
dexsuite_kukagoldens at 91% pixel diff — early-exit on first-non-zero avoids that by landing at the same frame the goldens were captured at.Type of change
Checklist