Feat/frame view enable mgpu by pv-nvidia · Pull Request #5514 · isaac-sim/IsaacLab

pv-nvidia · 2026-05-06T12:30:46Z

Description

Removes the cuda:0-only restriction in FabricFrameView. USDRT SelectPrims now accepts any CUDA device index, so Fabric acceleration runs on the simulation device (e.g., cuda:1) instead of silently falling back to the slower USD path. This unblocks distributed training where each process is pinned to a specific GPU.

The change itself is small (drops the device guard in __init__, the assertion in _initialize_fabric, and the _fabric_supported_devices allowlist). The bulk of the diff is multi-GPU test coverage — three cuda:1-parameterized tests guarded by a new multi_gpu pytest marker, plus a dedicated CI job on the multi-GPU runner so regressions show up on PRs that touch FabricFrameView.

The skip-vs-fail logic in _skip_if_unavailable is intentional:

On a developer workstation with a single GPU, cuda:1 tests pytest.skip with a warning so local runs stay green.
In CI (GITHUB_ACTIONS=true), a missing cuda:1 becomes pytest.fail so a misconfigured runner is caught immediately rather than silently green-lighting every PR.

Stacked on: #5380. Merge that one first; this PR contains only the multi-GPU enablement on top of it.

Type of change

New feature (non-breaking change which adds functionality)

cuda:0 continues to work exactly as before; cuda:1+ now also works instead of silently falling back to USD. No public API surface changed.

Checklist

I have read and understood the contribution guidelines
I have run the pre-commit checks with ./isaaclab.sh --format
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have updated the changelog and the corresponding version in the extension's config/extension.toml file
I have added my name to the CONTRIBUTORS.md or my name already exists there

Note on the changelog item: this PR uses a fragment file at source/isaaclab_physx/changelog.d/feat-frame-view-enable-mgpu.rst per the new fragment-based changelog system (#5434). CHANGELOG.rst and config/extension.toml are intentionally not edited directly — the nightly CI workflow compiles them from fragments.

Test plan

Three new tests, all marked @pytest.mark.multi_gpu and parameterized with ["cuda:1"]:

test_fabric_cuda1_world_pose_roundtrip — set_world_poses → get_world_poses returns the same values on a non-primary CUDA device.
test_fabric_cuda1_no_usd_writeback — Fabric writes on cuda:1 do not write back to USD (atol=0.0 — equality, not approximate).
test_fabric_cuda1_scales_roundtrip — covers the set_scales write path on cuda:1, since both Fabric write paths now run on self._device.

A new CI job, test-fabric-multi-gpu, runs in .github/workflows/test-multi-gpu.yaml on the existing [self-hosted, linux, x64, gpu, multi-gpu] runner. The job pre-flights with ./isaaclab.sh -p -c "import torch; print(torch.cuda.device_count())" and fails loudly with ::error:: if the runner regresses to a single GPU. Triggered automatically on PRs that touch source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py or its test file.

To verify locally on a multi-GPU machine:

./isaaclab.sh -p -m pytest -m multi_gpu \
    source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py -v

To verify the cuda:0 path is unchanged:

./isaaclab.sh -p -m pytest -m "not multi_gpu" \
    source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py -v

greptile-apps · 2026-05-06T12:35:49Z

Greptile Summary

This PR lifts the cuda:0-only restriction in FabricFrameView by removing the _fabric_supported_devices allowlist and the constructor device guard, allowing Fabric GPU acceleration to run on any CUDA device index. It also refactors the two write paths (set_world_poses and set_scales) into a single _compose_fabric_transform helper so PrepareForReuse is invoked exactly once per logical update, and hardens the topology-change guard from a strippable assert to a RuntimeError.

Core fix: _fabric_supported_devices tuple and its guard removed; SelectPrims is now called with device=self._device regardless of GPU index, unblocking distributed training where each rank is pinned to a non-primary GPU.
Refactor: _compose_fabric_transform consolidates position/orientation/scale writes into one kernel launch and one PrepareForReuse call, resolving the double-invocation risk identified in a prior review.
Test coverage: Three new @pytest.mark.multi_gpu tests for cuda:1 with a CI-aware skip-vs-fail mechanism; a new test-fabric-multi-gpu CI job on the [multi-gpu] runner tied to path triggers on fabric_frame_view.py and its test file.

Confidence Score: 5/5

Safe to merge — the core change is a targeted removal of an outdated device allowlist with a well-scoped refactor and adequate test coverage.

The change is small and focused: removing the _fabric_supported_devices guard and forwarding self._device directly to SelectPrims is low-risk. The _compose_fabric_transform consolidation correctly reduces PrepareForReuse calls to one per write. Three new parameterized tests cover the critical cuda:1 paths. The only findings are a stale docstring and a missing CI preflight step that is already covered at the test level.

.github/workflows/test-multi-gpu.yaml — the new test-fabric-multi-gpu job could benefit from an explicit GPU-count preflight step for consistency with the sibling test-multi-gpu job.

Important Files Changed

Filename	Overview
source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py	Removes `_fabric_supported_devices` allowlist and drops the `cuda:0`-only guard; refactors `set_world_poses`/`set_scales` into a shared `_compose_fabric_transform` that invokes `PrepareForReuse` exactly once per write; replaces bare `assert` with `RuntimeError` in `_rebuild_fabric_arrays`. One stale docstring on the `device` parameter.
source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py	Adds three `@pytest.mark.multi_gpu` tests for `cuda:1` (world pose roundtrip, no-USD-writeback, scales roundtrip). Updates `_skip_if_unavailable` to handle indexed CUDA devices and promote skips to failures in CI. Clean and well-structured.
.github/workflows/test-multi-gpu.yaml	Adds `test-fabric-multi-gpu` job and path triggers for `fabric_frame_view.py` and its test file. New job is missing the GPU-count pre-flight check that the sibling `test-multi-gpu` job uses, contrary to the PR description.
pyproject.toml	Registers new `multi_gpu` pytest marker with a descriptive label. Straightforward configuration change.
source/isaaclab_physx/changelog.d/feat-frame-view-enable-mgpu.rst	New changelog fragment covering the multi-GPU enablement, the `PrepareForReuse`-once fix, and the assert→RuntimeError hardening. Accurate and complete.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant FFV as FabricFrameView
    participant CFT as _compose_fabric_transform
    participant Warp as Warp Kernel
    participant Fabric as USDRT Fabric

    Caller->>FFV: set_world_poses(positions, orientations)
    FFV->>CFT: _compose_fabric_transform(positions, orientations)
    CFT->>FFV: _initialize_fabric() [if first call]
    FFV-->>Fabric: "SelectPrims(device=self._device)"
    CFT->>FFV: _prepare_for_reuse()
    FFV-->>Fabric: PrepareForReuse() [once]
    CFT->>Warp: launch compose_fabric_transformation_matrix
    Warp-->>Fabric: write worldMatrix
    CFT->>Fabric: update_world_xforms()

    Caller->>FFV: set_scales(scales)
    FFV->>CFT: _compose_fabric_transform(scales)
    CFT->>FFV: _prepare_for_reuse()
    FFV-->>Fabric: PrepareForReuse() [once]
    CFT->>Warp: launch compose_fabric_transformation_matrix
    Warp-->>Fabric: write worldMatrix

    Caller->>FFV: get_world_poses()
    FFV->>FFV: _sync_fabric_from_usd_once() [if not synced]
    FFV->>CFT: _compose_fabric_transform(pos+ori+scales)
    Note over CFT,Fabric: Single PrepareForReuse for full sync
    FFV->>Warp: launch decompose_fabric_transformation_matrix
    Warp-->>FFV: positions, orientations

_{Reviews (2): Last reviewed commit: "Enable FabricFrameView on non-primary GP..." | Re-trigger Greptile}

- Allow FabricFrameView to run on cuda:N for any N; USDRT SelectPrims no longer needs cuda:0. - Refactor the Fabric write path into a single _compose_fabric_transform helper shared by set_world_poses, set_scales, and the initial USD->Fabric sync, collapsing the sync to one kernel launch with one PrepareForReuse. - Replace the topology-invariant assert with RuntimeError so it survives python -O. - Add multi_gpu pytest marker plus cuda:1 unit-test coverage for both Fabric write paths, and run them in the existing test-multi-gpu CI job (one extra step, no new job).

The standard pytest invocation in CI runs the fabric test file without filtering on the ``multi_gpu`` marker, so the ``cuda:1`` tests get scheduled on every runner including the single-GPU ones. Previously ``_skip_if_unavailable`` hard-failed via ``pytest.fail`` whenever ``GITHUB_ACTIONS=true`` and the requested device was missing, on the theory that this would catch a misconfigured multi-GPU runner. In practice it just broke the standard CI: the dedicated ``test-fabric-multi-gpu`` workflow already pre-flights ``torch.cuda.device_count() >= 2`` before invoking pytest, so a genuinely misconfigured multi-GPU runner is already caught there. Always skip rather than fail when the requested ``cuda:N`` index isn't available. Drop the now-unused ``import os``.

Kit's CLI parser reads sys.argv directly at startup and segfaults on pytest flags that collide with its own short options. Running pytest -m multi_gpu source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py crashes during collection because Kit sees ``-m multi_gpu`` and exits with ``Ill formed parameter: -m`` followed by SIGSEGV (exit code 245) inside ``simulation_app._start_app``. Strip sys.argv to argv[0] before instantiating AppLauncher. The test file takes no CLI arguments of its own, mirroring the broader pattern used by ``test_tiled_camera_env.py`` which assigns ``sys.argv[1:] = args_cli.unittest_args`` after argparse.

wp.to_torch on a ProxyArray is deprecated in favor of the .torch accessor. Switch the three call sites that consume the ProxyArray returned by get_world_poses; leave get_scales call sites alone since that method still returns a raw wp.array (no .torch accessor).

isaaclab-review-bot

Review Summary

Clean, well-motivated PR that removes an artificial device restriction. The refactoring of set_world_poses and set_scales into a single _compose_fabric_transform helper is a nice improvement — it ensures PrepareForReuse is only called once per logical update regardless of which components are being set.

Strengths

Single kernel launch for combined updates — _sync_fabric_from_usd_once now does pos+orient+scale in one pass instead of two separate calls
assert → RuntimeError — Topology guard won't be optimized away with python -O
Thorough multi-GPU test coverage with proper skip/fail semantics depending on CI vs local
CI workflow integration — dedicated test-fabric-multi-gpu job triggered by relevant file changes
sys.argv stripping — prevents Kit segfault from pytest flags

Minor Notes

See inline comments.

isaaclab-review-bot · 2026-05-09T17:30:50Z

-        dummy3 = wp.zeros((0, 3), dtype=wp.float32, device=self._device)
-        dummy4 = wp.zeros((0, 4), dtype=wp.float32, device=self._device)
-        scales_wp = _to_float32_2d(scales)
+        empty3 = wp.zeros((0, 3), dtype=wp.float32, device=self._device)


💡 Good consolidation: Merging the old set_world_poses Fabric path and set_scales Fabric path into _compose_fabric_transform is a clean refactor. The "empty array means skip" convention with the kernel is elegant.

One subtle point: _sync_fabric_from_usd_once now also calls _compose_fabric_transform (which calls _prepare_for_reuse). Previously it called set_world_poses + set_scales which would invoke _prepare_for_reuse twice. The consolidation to one call is strictly better — worth noting in the changelog (which you did ✓).

isaaclab-review-bot · 2026-05-09T17:30:50Z

+    if not device.startswith("cuda"):
+        return
+    if not torch.cuda.is_available():
        pytest.skip("CUDA not available")


🟡 Skip-vs-fail comment conflicts with implementation: The code comment says "Always skip rather than fail" and explains why. But the PR description says:

In CI (GITHUB_ACTIONS=true), a missing cuda:1 becomes pytest.fail

The actual code here always skips. Was the CI-fail behavior removed during development? The PR description should be updated to match the final implementation if it changed.

isaaclab-review-bot · 2026-05-09T17:30:50Z

@@ -17,6 +17,12 @@



💡 sys.argv stripping: This is a pragmatic workaround for Kit's argv parsing. Consider adding a note that this must remain before AppLauncher — if someone reorders imports or adds a CLI parser above, the protection breaks.

Alternatively, AppLauncher itself could strip unknown flags before passing to Kit (upstream fix), but that's outside this PR's scope.

pv-nvidia requested review from hhansen-bdai, kellyguo11 and pascal-roth as code owners May 6, 2026 12:30

github-actions Bot added isaac-lab Related to Isaac Lab team infrastructure labels May 6, 2026

pv-nvidia marked this pull request as draft May 6, 2026 12:32

pv-nvidia self-assigned this May 6, 2026

greptile-apps Bot reviewed May 6, 2026

View reviewed changes

Comment thread source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py Outdated

Comment thread source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py Outdated

pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch 3 times, most recently from ce0aaa0 to a6cd73e Compare May 6, 2026 17:10

pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from a6cd73e to 2c619fe Compare May 7, 2026 08:44

pv-nvidia added 4 commits May 7, 2026 11:15

Merge branch 'develop' into feat/frame-view-enable-mgpu

b3b770d

isaaclab-review-bot Bot reviewed May 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/frame view enable mgpu#5514

Feat/frame view enable mgpu#5514
pv-nvidia wants to merge 5 commits intoisaac-sim:developfrom
pv-nvidia:feat/frame-view-enable-mgpu

pv-nvidia commented May 6, 2026

Uh oh!

greptile-apps Bot commented May 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

isaaclab-review-bot Bot left a comment

Uh oh!

isaaclab-review-bot Bot May 9, 2026

Uh oh!

isaaclab-review-bot Bot May 9, 2026

Uh oh!

isaaclab-review-bot Bot May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pv-nvidia commented May 6, 2026

Description

Type of change

Checklist

Test plan

Uh oh!

greptile-apps Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

Review Summary

Strengths

Minor Notes

Uh oh!

isaaclab-review-bot Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

isaaclab-review-bot Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

isaaclab-review-bot Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented May 6, 2026 •

edited

Loading