Skip to content

Improves hanging flakiness in CI tests#5479

Merged
kellyguo11 merged 12 commits intoisaac-sim:developfrom
kellyguo11:test-articulation-timeout
May 5, 2026
Merged

Improves hanging flakiness in CI tests#5479
kellyguo11 merged 12 commits intoisaac-sim:developfrom
kellyguo11:test-articulation-timeout

Conversation

@kellyguo11
Copy link
Copy Markdown
Contributor

@kellyguo11 kellyguo11 commented May 3, 2026

Description

  • Increase the CI startup-hang grace period from 45s to 120s so slow but valid Kit startup is not killed prematurely.
  • Make SurfaceGripper fail fast on non-CPU simulation backends before loading the surface gripper extension.
  • Skip the CI-only SurfaceGripperView CPU initialization path that can deadlock, while keeping CUDA fail-fast coverage.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

@kellyguo11 kellyguo11 requested a review from hhansen-bdai as a code owner May 3, 2026 17:15
@kellyguo11 kellyguo11 marked this pull request as draft May 3, 2026 17:15
@github-actions github-actions Bot added isaac-lab Related to Isaac Lab team infrastructure labels May 3, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This PR introduces temporary debugging instrumentation to investigate intermittent CI timeouts in PhysX articulation tests. It adds verbose print statements to test_external_force_on_single_body_at_position, narrows CI execution to only 3 specific test files, disables timeout retries, and significantly reduces some test timeouts. This is explicitly marked "DO NOT MERGE" — it's a debug-focused investigation branch, not a production change.

Architecture Impact

High-impact CI behavior change: When FOCUS_ARTICULATION_HANG_DEBUG = True (hardcoded), the entire CI test suite is reduced to only 3 test files. This will cause all other tests to be skipped in any CI run using this branch. The conftest.py changes affect the global test collection logic, meaning this would break normal CI if merged.

Implementation Verdict

Needs rework — While appropriate as a temporary debug branch, several issues need attention even for investigation purposes, and the PR absolutely cannot be merged in its current state.

Test Coverage

This is a debug instrumentation PR, not a feature or bug fix, so new test coverage is not applicable. However, the changes reduce test coverage by limiting CI to only 3 files.

CI Status

No CI checks available yet. When CI runs, it will only execute test_articulation.py, test_surface_gripper.py, and test_non_headless_launch.py due to the hardcoded debug focus mode.

Findings

🔴 Critical: tools/conftest.py:51-63 — Debug mode hardcoded to True will break CI for all other tests

FOCUS_ARTICULATION_HANG_DEBUG = True

This hardcoded True value combined with the logic at line 687-692 forces filter_pattern to only include 3 test files, skipping the entire rest of the test suite. If this PR were merged (despite the title), CI would effectively stop testing 99%+ of the codebase. The constant should at minimum be controlled by an environment variable for safety:

FOCUS_ARTICULATION_HANG_DEBUG = os.environ.get("FOCUS_ARTICULATION_HANG_DEBUG", "false") == "true"

🔴 Critical: tools/test_settings.py:64 — Timeout reduced from 3000s to 300s (10x reduction) is likely too aggressive

"test_surface_gripper.py": 300,

This is one of the tests being debugged for timeout issues. Reducing its timeout by 10x while investigating hangs seems counterproductive — it will cause the test to timeout faster and potentially mask the actual hang location. If the test legitimately takes longer, this will cause false failures.

🟡 Warning: source/isaaclab_physx/test/assets/test_articulation.py:1021-1027 — Debug prints inside tight loop will significantly slow test execution

for step_i in range(100):
    # ...
    print(f"{debug_prefix}: outer loop {outer_i} step {step_i} before write_data_to_sim", flush=True)
    # ...
    print(f"{debug_prefix}: outer loop {outer_i} step {step_i} before sim.step", flush=True)
    # ...
    print(f"{debug_prefix}: outer loop {outer_i} step {step_i} after sim.step", flush=True)
    # ...
    print(f"{debug_prefix}: outer loop {outer_i} step {step_i} after articulation.update", flush=True)

With num_articulations in [1, 2] and device in ["cuda:0", "cpu"], this creates 4 test instances × 5 outer loops × 100 inner steps × 4 prints = 8000 print statements with flush=True. This I/O overhead could materially affect timing characteristics and potentially mask or move the hang location. Consider printing only every N steps (e.g., if step_i % 25 == 0).

🟡 Warning: tools/conftest.py:51 — TIMEOUT_RETRIES set to 0 may cause flaky test failures

TIMEOUT_RETRIES = 0

Disabling retries is intentional for debugging, but combined with the reduced timeouts, this increases the chance of false-positive failures. The comment and variable should document this is temporary.

🔵 Improvement: source/isaaclab_physx/test/assets/test_articulation.py:969 — Loop variable shadowing

for outer_i in range(5):
    # ...
    for i in range(num_articulations):
        assert articulation.data.root_pos_w.torch[i, 2].item() < 0.2

The inner loop uses i which previously was the outer loop variable (changed to outer_i). This is fine, but the inner loop variable i should perhaps be renamed to env_idx or similar for clarity in the debug output context.

🔵 Improvement: tools/conftest.py:368-372 — Debug branch adds -k filter only for articulation test but runs all tests in other focused files

if "test_articulation.py" in normalized_test_file:
    cmd.extend(["-k", FOCUS_ARTICULATION_HANG_TEST_EXPR])

This means test_surface_gripper.py and test_non_headless_launch.py will run all their tests, while test_articulation.py only runs test_external_force_on_single_body_at_position. If the hang is in one of the other files, all tests in those files will run. Consider whether this is the intended behavior or if similar filtering should apply.

🔵 Improvement: tools/test_settings.py:20 — Timeout reduced from 1500s to 1000s without justification

"test_articulation.py": 1000,

The timeout was reduced by 500s. While 1000s is still substantial, the change isn't explained. If this is part of the debug investigation (expecting faster failure), it should be documented.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 3, 2026

Greptile Summary

This PR addresses CI flakiness caused by two distinct root causes: Kit startup being killed before it finishes on slow workers, and SurfaceGripperView initialization deadlocking on CPU in the CI environment.

  • STARTUP_DEADLINE in conftest.py is raised from 45 s to 120 s to give legitimately slow cold-worker Kit startups enough time before being declared a hang.
  • surface_gripper.py moves the CPU device guard before enable_extension(...), so a non-CPU caller raises immediately without touching the extension loader.
  • test_initialization in test_surface_gripper.py is unconditionally skipped in CI via a new _RUNNING_CI flag, leaving only the CUDA fail-fast test as live CI coverage.

Confidence Score: 5/5

Safe to merge — all three changes are narrowly scoped fixes with no regressions on the happy path.

The surface-gripper reordering is a one-line-swap with a clear before/after; the startup deadline bump is a constant change; and the CI skip is guarded behind an env-var flag rather than touching production logic. No existing passing tests are broken, and the only coverage removed is a test that was actively causing deadlocks in CI.

The _RUNNING_CI expression in test_surface_gripper.py has a minor inconsistency worth tidying, but no file requires a blocking second look.

Important Files Changed

Filename Overview
source/isaaclab_physx/isaaclab_physx/assets/surface_gripper/surface_gripper.py Moves enable_extension and the extension import to after the CPU device guard, so non-CPU callers raise immediately without triggering a potentially blocking extension load.
source/isaaclab_physx/test/assets/test_surface_gripper.py Adds a _RUNNING_CI flag to skip the CPU test_initialization in CI (due to a known deadlock), while keeping the CUDA fail-fast test active. The GITLAB_CI branch of the flag is not normalised to a bool.
tools/conftest.py Increases STARTUP_DEADLINE from 45 s to 120 s to accommodate legitimately slow Kit initialisation on cold CI workers; no logic changes.
source/isaaclab_physx/changelog.d/test-articulation-timeout.rst New changelog entry accurately describing the surface gripper fail-fast fix.

Sequence Diagram

sequenceDiagram
    participant CI as CI Runner
    participant CF as conftest.py
    participant SG as SurfaceGripper._initialize_impl()
    participant EXT as isaacsim.robot.surface_gripper

    Note over CI,CF: STARTUP_DEADLINE: 45s to 120s
    CI->>CF: launch test subprocess
    CF->>CF: wait for AppLauncher init complete
    alt startup within 120s
        CF->>CF: continue test run
    else no signal after 120s
        CF->>CI: kill + report startup_hang
    end

    Note over SG,EXT: surface_gripper.py reordering
    SG->>SG: check self._device != cpu
    alt device is CUDA
        SG-->>CI: raise Exception fast fail no extension load
    else device is CPU
        SG->>EXT: enable_extension
        EXT-->>SG: GripperView loaded
        SG->>SG: continue initialisation
    end

    Note over CI,SG: test_surface_gripper.py in CI
    CI->>CI: _RUNNING_CI = True
    CI->>SG: test_initialization SKIPPED deadlock risk
    CI->>SG: test_raise_error_if_not_cpu cuda runs expects Exception
Loading

Reviews (2): Last reviewed commit: "add changelog fragment" | Re-trigger Greptile

Comment thread tools/test_settings.py Outdated
"test_shadow_hand_vision_presets.py": 5000,
"test_environments_newton.py": 5000,
"test_surface_gripper.py": 3000,
"test_surface_gripper.py": 300,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Aggressive timeout reduction for test_surface_gripper.py

test_surface_gripper.py has been cut from 3000 s to 300 s (10×). Because test_surface_gripper.py is also in FOCUS_ARTICULATION_HANG_TEST_PATHS, it will be scheduled and run in the debug CI pass — with TIMEOUT_RETRIES = 0. If the surface-gripper suite legitimately takes more than 300 s (the original 3000 s allocation strongly implies it can), every run will produce a spurious hard-timeout failure that is indistinguishable from the hang under investigation, making diagnosis harder rather than easier.

Comment thread tools/conftest.py Outdated
TIMEOUT_RETRIES = 0
"""Number of times to retry a test that reaches its hard timeout before giving up."""

FOCUS_ARTICULATION_HANG_DEBUG = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 FOCUS_ARTICULATION_HANG_DEBUG has no env-variable escape hatch

FOCUS_ARTICULATION_HANG_DEBUG = True is hardcoded with no way to override it at runtime (e.g., via os.environ.get("FOCUS_ARTICULATION_HANG_DEBUG", "false") == "true"). When this flag is True it unconditionally overrides filter_pattern in pytest_sessionstart, silently ignoring any TEST_FILTER_PATTERN env var or --include-pattern CLI option that a CI job already supplies. Any CI workflow that is not the debug job will inadvertently run only the 3 focus files until this constant is reverted manually.

Comment thread tools/conftest.py Outdated
Comment on lines 51 to 52
TIMEOUT_RETRIES = 0
"""Number of times to retry a test that reaches its hard timeout before giving up."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 TIMEOUT_RETRIES = 0 affects all tests, not just focus tests

Setting retries to 0 removes the safety net for every test in the session, not only the three focus files. Tests that are intermittently slow (e.g., shader-cache cold starts, resource-contended CI runners) and previously relied on the two retries will now fail permanently. Consider scoping the no-retry behavior to FOCUS_ARTICULATION_HANG_TEST_PATHS entries only, so other tests retain their retry budget.

Comment on lines 1016 to 1022
body_ids=body_ids,
is_global=is_global,
)
print(f"{debug_prefix}: outer loop {outer_i} after set/add wrench", flush=True)
# perform simulation
for _ in range(100):
for step_i in range(100):
# apply action to the articulation
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Per-step prints produce ~1 000+ lines per test parameterization

Two print calls are placed inside the 100-step inner loop (before write_data_to_sim and after articulation.update). With -s capturing disabled and 5 outer iterations, that is at minimum 1 000 lines per (device, num_articulations) combination. Across all parameterized combinations this becomes very verbose. Moving these traces outside the inner loop, or gating them with a step_i % 10 == 0 condition, would preserve diagnostic value without flooding logs.

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This is a follow-up review. The new commits add debug instrumentation to test_non_headless_launch.py and the Newton articulation test file, and switch from print() to sys.__stdout__.write() for bypass-buffering debug output. The core debug scaffolding remains unchanged.

Previous Concerns Status

  • TIMEOUT_RETRIES = 0: Still set to 0, but this is intentional for debugging — acknowledged.
  • Hardcoded FOCUS_ARTICULATION_HANG_DEBUG = True: Still hardcoded, but given "[DO NOT MERGE]" title, this is acceptable for the debug branch.
  • Aggressive timeout reductions: Still present (300s for surface_gripper, 1000s for articulation).
  • Debug prints in tight loop: Now using sys.__stdout__.write() with explicit flush(), which is marginally better than print(flush=True) for bypassing Python buffering, but the I/O overhead concern in the 100-step inner loop remains.

New Findings

🔵 Improvement: Consistent debug instrumentation pattern across files
The new commits add sys.__stdout__.write() debug markers to test_non_headless_launch.py and test_surface_gripper.py, which is consistent with the debug investigation approach. The Newton articulation test now mirrors the PhysX version's detailed step-by-step logging. This is appropriate for the debugging purpose.

No new critical or warning-level issues introduced. The PR remains a debug-only branch as indicated by the title.

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This is a follow-up review. The new commit (7296571) makes no substantive changes — the diff between the previous review and this commit shows identical content. The debug instrumentation, FOCUS_ARTICULATION_HANG_DEBUG = True, TIMEOUT_RETRIES = 0, and reduced timeouts all remain as previously noted.

Status

All previous observations remain valid and unchanged. No new issues introduced. The PR continues to be appropriate for its stated "[DO NOT MERGE]" debug investigation purpose.

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Follow-up Review

No substantive changes between the previous review commit and c9a364b — the debug instrumentation remains identical. The PR continues to serve its stated "[DO NOT MERGE]" debug investigation purpose with FOCUS_ARTICULATION_HANG_DEBUG = True, disabled timeout retries, and verbose per-step logging to diagnose the intermittent articulation hang.

CI Status Note

The widespread CI failures (isaaclab_physx, isaaclab_newton, etc.) are expected since FOCUS_ARTICULATION_HANG_DEBUG = True restricts test execution to only the articulation-related debug files, causing normal test discovery to be skipped.

Verdict: Appropriate for debug investigation branch. Do not merge to develop.

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Follow-up Review

The new commit (0f09c3b) adds one minor change: moving the enable_extension() and from isaacsim.robot.surface_gripper import GripperView import after the CPU device check in surface_gripper.py. This is a reasonable improvement — it avoids loading the extension unnecessarily when the device check will fail anyway.

🔵 Improvement: source/isaaclab_physx/isaaclab_physx/assets/surface_gripper/surface_gripper.py:446-451 — The reordering is sensible, but the exception type should be RuntimeError (not bare Exception) to follow Isaac Lab conventions for configuration/initialization errors.

Verdict: No new concerns. Debug investigation branch remains appropriate for its stated "[DO NOT MERGE]" purpose.

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Follow-up Review

The new commit (afb26fb) adds minor changes: a sys import move in test_non_headless_launch.py and debug logging statements. The previous concern about using bare Exception instead of RuntimeError in surface_gripper.py (lines 448-451) has NOT been addressed — the code still raises Exception rather than RuntimeError.

No new issues introduced. This debug branch remains appropriate for its "[DO NOT MERGE]" investigation purpose.

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Follow-up Review

The new commit (1488ff4) adds more debug logging to test_surface_gripper.py and includes a @pytest.mark.skipif to skip test_initialization in CI environments due to a known deadlock issue.

Previous concern NOT addressed: The surface_gripper.py still raises bare Exception (lines 448-451) instead of RuntimeError.

New observation: The skipif logic on line 175 is reasonable for CI stability during debugging, but note that os.environ.get("GITLAB_CI") returns a string (or None), not a boolean — it should be compared to a truthy value like the others. However, since any non-empty string is truthy in Python's boolean context, this works correctly in practice.

No blocking new issues. This remains an appropriate debug branch for its stated investigation purpose.

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Follow-up Review

The new commit (4a15941) increases STARTUP_DEADLINE from 45s to 120s in tools/conftest.py and updates the docstring accordingly.

Previous concern NOT addressed: The surface_gripper.py still raises bare Exception (lines 448-451) instead of RuntimeError. This remains a code quality issue.

New changes look reasonable: The increased startup deadline (120s) is a sensible adjustment for cold CI workers where Kit startup can legitimately exceed 60s. The docstring update accurately reflects the new behavior.

No new blocking issues introduced. This debug branch continues to be appropriate for its investigation purpose.

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Follow-up Review

The new commit (f741802) reorders code in surface_gripper.py to check self._device != "cpu" before calling enable_extension() and importing GripperView. This is a good fix — it ensures the CPU-only check fails fast without triggering the potentially deadlocking extension initialization.

Previous concern still NOT addressed: The code still raises bare Exception (line 448) instead of RuntimeError. Minor code quality issue.

New skipif decorator looks correct: The test now skips on CI environments to avoid the deadlock, while keeping the CUDA fail-fast coverage for catching device misuse. This is a reasonable workaround for debugging.

No new blocking issues. Ship it for the debug/investigation purpose of this branch.

@kellyguo11 kellyguo11 changed the title [DO NOT MERGE] debug timeout tests Improves hanging flakiness in CI tests May 5, 2026
@kellyguo11 kellyguo11 marked this pull request as ready for review May 5, 2026 17:51
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Follow-up Review

No new changes since the previous review (b46f04f is the same as f741802 referenced earlier). The reordering fix in surface_gripper.py correctly moves the CPU check before enable_extension() to fail fast.

Previous minor concern still open: The code raises bare Exception instead of RuntimeError at line 448. This is non-blocking.

CI Status: All substantive jobs are still pending. The fix is logically sound — ship it once CI passes.

@kellyguo11 kellyguo11 merged commit fa5959f into isaac-sim:develop May 5, 2026
55 of 56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

infrastructure isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant