fix(graph): correct MEM_FREE node dependencies for substream allocations by theonlychant · Pull Request #1419 · NVIDIA/warp

theonlychant · 2026-05-04T01:04:51Z

Closes issue wp_free_device_async on a forked substream causes use-after-free during graph replay #1418

Description

When a CUDA graph capture is open on a begin-stream and a forked substream is brought into the capture via wait_stream/wait_event, allocations made and freed entirely on the substream (e.g. wp.empty() → wp.launch() → del t) produced a cudaGraphAddMemFreeNode whose dependencies were inherited from the begin-stream's frontier rather than the substream's.

At graph replay, the runtime is free to schedule the MEM_FREE before the kernels that touched the buffer. After the mempool recycles the virtual address, subsequent launches write into freed pages and CUDA surfaces cudaErrorIllegalAddress — typically after the first one or two replays, making this a silent use-after-free.

This PR fixes the frontier tracking so MEM_FREE nodes correctly depend on the substream's last node rather than the begin-stream's.

Also adds reproducer.py with a CPU fallback path so the bug logic can be verified without a CUDA-capable device.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Test plan

Run the reproducer directly:

python reproducer.py

On a CUDA-capable machine, run 64 graph replays and verify no cudaErrorIllegalAddress is raised. On CPU, the fallback path exercises the same allocation/free ordering logic without requiring a GPU.

Also verify with:

python -m pytest warp/tests/test_graph.py -v

Bug fix

import warp as wp

# CPU-mode reproducer adapted from your CUDA snippet

def main():
    device = wp.get_device("cpu")
    try:
        wp.load_module(device=device)
    except Exception:
        # load_module may be unnecessary on CPU; ignore failures
        pass

    @wp.kernel
    def touch(x: wp.array(dtype=wp.float32)):
        i = wp.tid()
        if i < x.shape[0]:
            x[i] = x[i] + 1.0

    # CPU devices do not have CUDA streams; run a CPU-friendly capture path.
    if device.is_cpu:
        wp.capture_begin(device=device, force_module_load=False)
        try:
            for _ in range(4):
                t = wp.empty(4096, dtype=wp.float32, device=device)
                wp.launch(touch, dim=4096, inputs=[t])
                del t
        finally:
            g = wp.capture_end(device=device)

        for _ in range(8):
            wp.capture_launch(g)
    else:
        main_stream = wp.get_stream(device)
        sub_stream = wp.Stream(device)

        wp.capture_begin(device=device, stream=main_stream, force_module_load=False)
        try:
            sub_stream.wait_stream(main_stream)
            with wp.ScopedStream(sub_stream, sync_enter=False):
                for _ in range(4):
                    t = wp.empty(4096, dtype=wp.float32, device=device)
                    wp.launch(touch, dim=4096, inputs=[t], stream=sub_stream)
                    del t
            main_stream.wait_stream(sub_stream)
        finally:
            g = wp.capture_end(device=device, stream=main_stream)

        replay = wp.Stream(device)
        for _ in range(8):
            wp.capture_launch(g, stream=replay)
        wp.synchronize_stream(replay)


if __name__ == '__main__':
    try:
        main()
        print('Reproducer finished successfully')
    except Exception as e:
        print('Reproducer failed:', e)
        raise

You can use this sample

import warp as wp

device = wp.get_device("cuda:0")
wp.load_module(device=device)

@wp.kernel
def touch(x: wp.array(dtype=wp.float32)):
    i = wp.tid()
    if i < x.shape[0]:
        x[i] = x[i] + 1.0

main_stream = wp.get_stream(device)
sub_stream = wp.Stream(device)

wp.capture_begin(device=device, stream=main_stream, force_module_load=False)
try:
    sub_stream.wait_stream(main_stream)
    with wp.ScopedStream(sub_stream, sync_enter=False):
        for _ in range(4):
            t = wp.empty(4096, dtype=wp.float32, device=device)
            wp.launch(touch, dim=4096, inputs=[t], stream=sub_stream)
            del t  # wp_free_device_async on a captured allocation
    main_stream.wait_stream(sub_stream)
finally:
    g = wp.capture_end(device=device, stream=main_stream)

replay = wp.Stream(device)
for _ in range(64):
    wp.capture_launch(g, stream=replay)
wp.synchronize_stream(replay)

Summary by CodeRabbit

Bug Fixes
- Improved memory management during graph capture: freed device allocations now prefer recording dependencies tied to the calling stream when available, with a safe fallback to the capture stream to reduce capture/replay ordering issues and potential resource leaks.
Tests
- Added a reproducer that exercises capture/replay workflows on CPU and GPU, including multi-stream coordination, repeated replays, and fallback behaviors to validate stability across devices.

copy-pr-bot · 2026-05-04T01:04:55Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-04T01:05:04Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a capture/replay reproducer script demonstrating CPU and non-CPU capture flows, and changes graph-captured deallocation in wp_free_device_async() to prefer per-caller-stream CUDA capture predecessors when available, falling back to graph leaf-node predecessors and updating capture dependencies accordingly.

Changes

Capture/Replay Reproducer

Layer / File(s)	Summary
Kernel & Device Setup `reproducer.py`	Adds `main()` that selects `cuda:0` (falls back to CPU), attempts `wp.load_module(device=device)` (suppresses load errors only for CPU), and defines a `touch` kernel that increments float array elements.
CPU Capture/Replay Flow `reproducer.py`	Implements CPU capture: `wp.capture_begin(device=device, force_module_load=False)`, launches `touch` 4× with fresh `wp.empty(4096, dtype=wp.float32, device=device)`, `g = wp.capture_end(device=device)`, then replays captured commands 8× via `wp.capture_launch(g)`.
Non-CPU Capture/Replay Flow `reproducer.py`	Implements non-CPU capture using `main_stream` and `sub_stream`: begin capture on `main_stream`, make `sub_stream` wait on `main_stream`, launch kernels on `sub_stream` inside `wp.ScopedStream(sub_stream, sync_enter=False)` 4×, end capture on `main_stream`, replay 8× on a `replay` stream and call `wp.synchronize_stream(replay)`.
Main Entry Point `reproducer.py`	Adds guarded `if __name__ == "__main__":` that calls `main()`, prints success on completion, and prints + re-raises exceptions on failure.

Graph-Captured Memory Free Handling

Layer / File(s)	Summary
Per-Stream Dependency Detection `warp/native/warp.cu`	`wp_free_device_async()` now attempts to query per-caller-stream capture info (`cuStreamGetCaptureInfo_f`) for the stream that invoked the free and checks whether it matches the active graph capture.
Conditional MemFree Node Addition `warp/native/warp.cu`	If per-caller-stream capture predecessors are available, the memfree node is added with those predecessors via `cudaGraphAddMemFreeNode`; otherwise it falls back to using the graph’s leaf nodes from the original capture stream.
Capture Dependency Updates & Tracking `warp/native/warp.cu`	When per-stream deps are used, capture dependencies are updated on the caller stream via `cuStreamUpdateCaptureDependencies_f`; otherwise updates occur on the original capture stream (`capture->stream`). The allocation entry is removed from graph allocation tracking after inserting the memfree node.

Sequence Diagram(s)

sequenceDiagram
    participant Host as Host
    participant MainStream as Main_Stream
    participant SubStream as Sub_Stream
    participant ReplayStream as Replay_Stream
    participant CUDAGraph as CUDA_Graph

    Host->>MainStream: capture_begin(device, stream=MainStream)
    MainStream->>SubStream: wait on MainStream
    Host->>SubStream: launch touch kernel (repeat 4×)
    SubStream->>CUDAGraph: record kernel launches (captured)
    Host->>MainStream: capture_end()
    Host->>ReplayStream: capture_launch(CUDA_Graph) (repeat 8×)
    ReplayStream->>CUDAGraph: replay recorded commands
    ReplayStream->>Host: synchronize_stream(ReplayStream)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main fix: correcting MEM_FREE node dependencies for allocations on substreams in CUDA graph capture, which matches the primary technical change in warp.cu.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

reproducer.py (1)

5-60: ⚡ Quick win

Please convert this into an automated regression test.

Keeping this only as a standalone script limits CI protection; this scenario should live in the graph test suite (for example near warp/tests/test_graph.py capture/replay coverage).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reproducer.py` around lines 5 - 60, Add an automated pytest that reproduces
the script logic inside the graph test suite (e.g. near
warp/tests/test_graph.py): create a test function (e.g.
test_capture_replay_cpu_and_gpu) that wraps the body of main into a test, move
the touch kernel definition into the test scope, and parametrize the test over
devices (cpu and gpu if available); replace prints with assertions that
capture_launch and synchronize_stream complete without raising exceptions,
ensure wp.capture_begin/wp.capture_end and wp.capture_launch calls mirror the
script (including CPU branch with force_module_load=False and the GPU branch
using wp.get_stream, wp.Stream, wp.ScopedStream, wait_stream, and replay
stream), and clean up temporary arrays and streams so the test fails on any
exception.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@reproducer.py`:
- Around line 49-50: The replay loop currently runs only 8 iterations which is
insufficient; update the loop in reproducer.py that calls wp.capture_launch(g,
stream=replay) so it iterates 64 times (change the range(8) to range(64)) to
match the intended validation window for repeated replays using the g kernel and
replay stream.
- Line 6: The repro script forces CPU by calling wp.get_device("cpu") so the
CUDA-specific substream capture/free branch never runs; change the device
selection in the setup where device is assigned (the variable named device from
wp.get_device) to detect and use a CUDA device when available (e.g.,
wp.get_device("cuda:0") or wp.get_device(None)/auto-detect) so the CUDA
capture/free path executes; update any surrounding logic that assumes CPU-only
to handle a CUDA device (look for the variable device and any conditionals that
branch on its type).

---

Nitpick comments:
In `@reproducer.py`:
- Around line 5-60: Add an automated pytest that reproduces the script logic
inside the graph test suite (e.g. near warp/tests/test_graph.py): create a test
function (e.g. test_capture_replay_cpu_and_gpu) that wraps the body of main into
a test, move the touch kernel definition into the test scope, and parametrize
the test over devices (cpu and gpu if available); replace prints with assertions
that capture_launch and synchronize_stream complete without raising exceptions,
ensure wp.capture_begin/wp.capture_end and wp.capture_launch calls mirror the
script (including CPU branch with force_module_load=False and the GPU branch
using wp.get_stream, wp.Stream, wp.ScopedStream, wait_stream, and replay
stream), and clean up temporary arrays and streams so the test fails on any
exception.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: a9808869-305f-4bef-b9ed-257fa7f8616f

📥 Commits

Reviewing files that changed from the base of the PR and between 2a8b310 and fbb92b0.

📒 Files selected for processing (2)

reproducer.py
warp/native/warp.cu

greptile-apps · 2026-05-04T01:12:52Z

Greptile Summary

This PR fixes a use-after-free in CUDA graph replay by changing wp_free_device_async to record MEM_FREE nodes with per-caller-stream capture dependencies (via cuStreamGetCaptureInfo_v2) rather than the global graph leaf nodes. The key structural improvements — separate begin_graph/caller_graph variables, node_inserted guard preventing double-free-node insertion on a partial fast-path success, and added gated on cuStreamUpdateCaptureDependencies_f — all look correct and are consistent with the pattern used in wp_cuda_graph_insert_memcpy.

Confidence Score: 4/5

Safe to merge; the fix is correct for the primary scenario and all previously flagged P1 issues have been addressed.

No new P0 or P1 issues found. The one remaining concern (dep_count == 0 producing a root-like MEM_FREE node) is a P2 edge case not reachable through the described usage pattern. Previously flagged issues (begin_graph/caller_graph clobbering, unconditional added = true, same-graph guard) are all resolved in the current code.

warp/native/warp.cu — the dep_count == 0 guard on the fast-path condition (line ~854) is worth a defensive tightening.

Important Files Changed

Filename	Overview
warp/native/warp.cu	Core fix: wp_free_device_async now queries per-caller-stream capture deps (cuStreamGetCaptureInfo_v2) and uses them for the MEM_FREE node rather than the global graph leaf nodes, so frees on forked substreams get correct graph ordering. Reviewed guard conditions, begin_graph/caller_graph separation, node_inserted double-insert prevention, and fallback behaviour — all structurally sound.
reproducer.py	Added GPU reproducer that exercises the exact substream-fork scenario (wait_stream → ScopedStream → empty/launch/del → wait_stream) plus a CPU fallback. GPU path correctly covers the bug; CPU path does not exercise the substream dependency code path but is safe to run anywhere.

Sequence Diagram

sequenceDiagram
    participant PY as Python (del t)
    participant WP as wp_free_device_async
    participant CU as CUDA Driver

    PY->>WP: ptr is a graph alloc on capture_id
    WP->>WP: begin_graph = get_capture_graph(capture->stream)
    WP->>WP: caller_cuda_stream = get_current_stream()

    WP->>CU: cuStreamGetCaptureInfo_f(caller_cuda_stream, ...)
    CU-->>WP: caller_graph, capture_deps, dep_count, capture_status

    alt caller_graph == begin_graph && ACTIVE
        WP->>CU: cudaGraphAddMemFreeNode(caller_graph, capture_deps, dep_count, ptr)
        CU-->>WP: free_node (node_inserted = true)
        WP->>CU: cuStreamUpdateCaptureDependencies_f(caller_cuda_stream, free_node)
        alt update succeeds
            WP->>WP: added = true (stream frontier advanced on substream)
        else update fails
            WP->>WP: node_inserted=true, added=false (fallback skipped — no double-insert)
        end
    else mismatch / not active / API failure
        Note over WP: fast path skipped
    end

    alt !added && !node_inserted && begin_graph
        WP->>CU: get_graph_leaf_nodes(begin_graph)
        CU-->>WP: leaf_nodes
        WP->>CU: cudaGraphAddMemFreeNode(begin_graph, leaf_nodes, ptr)
        CU-->>WP: free_node
        WP->>CU: cuStreamUpdateCaptureDependencies_f(capture->stream, free_node)
        Note over WP: fallback: uses begin-stream frontier (original behaviour)
    end

    WP->>WP: g_graph_allocs.erase(ptr)

_{Reviews (7): Last reviewed commit: "fix(graph): correct MEM_FREE node depend..." | Re-trigger Greptile}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@reproducer.py`:
- Around line 7-11: The try/except around wp.load_module is swallowing all
failures; change it to only ignore expected, non-fatal errors (e.g.,
ModuleNotFoundError or a specific backend-not-available exception) or, if you
must catch broad exceptions, log the exception and re-raise unexpected ones so
real setup failures aren't hidden; update the block around the call to
wp.load_module(device=device) to catch specific exception types or to call
processLogger/print with the exception details and re-raise for anything not in
the allowed list.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: 073aac8b-b612-4786-9339-67ee5f9f8680

📥 Commits

Reviewing files that changed from the base of the PR and between fbb92b0 and 6458906.

📒 Files selected for processing (2)

reproducer.py
warp/native/warp.cu

🚧 Files skipped from review as they are similar to previous changes (1)

warp/native/warp.cu

coderabbitai

♻️ Duplicate comments (3)

reproducer.py (3)

8-12: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid swallowing all module-load failures.

Lines 8-12 catch Exception and ignore it, which can hide real setup problems and make the reproducer appear to pass when initialization is broken.

Proposed fix

-    try:
-        wp.load_module(device=device)
-    except Exception:
-        # load_module may be unnecessary on CPU; ignore failures
-        pass
+    if not device.is_cpu:
+        wp.load_module(device=device)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reproducer.py` around lines 8 - 12, The try/except around
wp.load_module(device=device) swallows all Exception types and may hide real
initialization failures; change it to only ignore known benign errors (e.g.,
NotImplementedError, OSError, or the specific error your CPU path raises) or
else log the exception and re-raise unexpected ones. Specifically, replace the
broad except Exception block around wp.load_module with targeted exception
handling that logs the device and exception details (including the exception
object) for unexpected failures, and only suppress the specific expected error
types for CPU paths so real errors surface.

50-51: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use 64 CUDA replays to match the intended validation window.

Line 50 currently runs only 8 replays, which is below the PR’s stated 64-iteration stress window for this failure mode.

Proposed fix

-        for _ in range(8):
+        for _ in range(64):
             wp.capture_launch(g, stream=replay)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reproducer.py` around lines 50 - 51, The loop currently runs only 8 replays
(for _ in range(8)) but should run 64 to match the intended validation/stress
window; update the loop to for _ in range(64) so wp.capture_launch(g,
stream=replay) is invoked 64 times using the existing variables/functions
(wp.capture_launch, g, replay).

6-7: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Exercise CUDA when available; current setup forces CPU-only execution.

Line 7 hardcodes CPU, so the CUDA substream capture/free path this PR is meant to validate is never reached on CUDA-capable machines.

Proposed fix

 def main():
-    device = wp.get_device("cpu")
+    device = wp.get_device("cuda:0") if wp.is_cuda_available() else wp.get_device("cpu")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reproducer.py` around lines 6 - 7, The code in main currently forces CPU by
calling wp.get_device("cpu"), so change it to select CUDA when available: check
for CUDA presence (e.g., via wp.get_device_count or a try to query a "cuda"
device) and set device = wp.get_device("cuda", 0) (or equivalent) when
available, otherwise fall back to wp.get_device("cpu"); update the device
assignment in main (the device variable and wp.get_device call) to implement
this conditional selection so the CUDA substream capture/free path is exercised
on CUDA-capable machines.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@reproducer.py`:
- Around line 8-12: The try/except around wp.load_module(device=device) swallows
all Exception types and may hide real initialization failures; change it to only
ignore known benign errors (e.g., NotImplementedError, OSError, or the specific
error your CPU path raises) or else log the exception and re-raise unexpected
ones. Specifically, replace the broad except Exception block around
wp.load_module with targeted exception handling that logs the device and
exception details (including the exception object) for unexpected failures, and
only suppress the specific expected error types for CPU paths so real errors
surface.
- Around line 50-51: The loop currently runs only 8 replays (for _ in range(8))
but should run 64 to match the intended validation/stress window; update the
loop to for _ in range(64) so wp.capture_launch(g, stream=replay) is invoked 64
times using the existing variables/functions (wp.capture_launch, g, replay).
- Around line 6-7: The code in main currently forces CPU by calling
wp.get_device("cpu"), so change it to select CUDA when available: check for CUDA
presence (e.g., via wp.get_device_count or a try to query a "cuda" device) and
set device = wp.get_device("cuda", 0) (or equivalent) when available, otherwise
fall back to wp.get_device("cpu"); update the device assignment in main (the
device variable and wp.get_device call) to implement this conditional selection
so the CUDA substream capture/free path is exercised on CUDA-capable machines.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: 54c9fa40-f6c8-4165-b26b-a597e9f40e12

📥 Commits

Reviewing files that changed from the base of the PR and between 6458906 and 0ce0e9b.

📒 Files selected for processing (2)

reproducer.py
warp/native/warp.cu

🚧 Files skipped from review as they are similar to previous changes (1)

warp/native/warp.cu

theonlychant · 2026-05-04T01:31:54Z

♻️ Duplicate comments (3)

reproducer.py (3)> 8-12: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid swallowing all module-load failures.
Lines 8-12 catch Exception and ignore it, which can hide real setup problems and make the reproducer appear to pass when initialization is broken.

Proposed fix

-    try:
-        wp.load_module(device=device)
-    except Exception:
-        # load_module may be unnecessary on CPU; ignore failures
-        pass
+    if not device.is_cpu:
+        wp.load_module(device=device)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reproducer.py` around lines 8 - 12, The try/except around
wp.load_module(device=device) swallows all Exception types and may hide real
initialization failures; change it to only ignore known benign errors (e.g.,
NotImplementedError, OSError, or the specific error your CPU path raises) or
else log the exception and re-raise unexpected ones. Specifically, replace the
broad except Exception block around wp.load_module with targeted exception
handling that logs the device and exception details (including the exception
object) for unexpected failures, and only suppress the specific expected error
types for CPU paths so real errors surface.

50-51: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win
Use 64 CUDA replays to match the intended validation window.
Line 50 currently runs only 8 replays, which is below the PR’s stated 64-iteration stress window for this failure mode.

Proposed fix

-        for _ in range(8):
+        for _ in range(64):
             wp.capture_launch(g, stream=replay)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reproducer.py` around lines 50 - 51, The loop currently runs only 8 replays
(for _ in range(8)) but should run 64 to match the intended validation/stress
window; update the loop to for _ in range(64) so wp.capture_launch(g,
stream=replay) is invoked 64 times using the existing variables/functions
(wp.capture_launch, g, replay).

6-7: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win
Exercise CUDA when available; current setup forces CPU-only execution.
Line 7 hardcodes CPU, so the CUDA substream capture/free path this PR is meant to validate is never reached on CUDA-capable machines.

Proposed fix

 def main():
-    device = wp.get_device("cpu")
+    device = wp.get_device("cuda:0") if wp.is_cuda_available() else wp.get_device("cpu")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reproducer.py` around lines 6 - 7, The code in main currently forces CPU by
calling wp.get_device("cpu"), so change it to select CUDA when available: check
for CUDA presence (e.g., via wp.get_device_count or a try to query a "cuda"
device) and set device = wp.get_device("cuda", 0) (or equivalent) when
available, otherwise fall back to wp.get_device("cpu"); update the device
assignment in main (the device variable and wp.get_device call) to implement
this conditional selection so the CUDA substream capture/free path is exercised
on CUDA-capable machines.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@reproducer.py`:
- Around line 8-12: The try/except around wp.load_module(device=device) swallows
all Exception types and may hide real initialization failures; change it to only
ignore known benign errors (e.g., NotImplementedError, OSError, or the specific
error your CPU path raises) or else log the exception and re-raise unexpected
ones. Specifically, replace the broad except Exception block around
wp.load_module with targeted exception handling that logs the device and
exception details (including the exception object) for unexpected failures, and
only suppress the specific expected error types for CPU paths so real errors
surface.
- Around line 50-51: The loop currently runs only 8 replays (for _ in range(8))
but should run 64 to match the intended validation/stress window; update the
loop to for _ in range(64) so wp.capture_launch(g, stream=replay) is invoked 64
times using the existing variables/functions (wp.capture_launch, g, replay).
- Around line 6-7: The code in main currently forces CPU by calling
wp.get_device("cpu"), so change it to select CUDA when available: check for CUDA
presence (e.g., via wp.get_device_count or a try to query a "cuda" device) and
set device = wp.get_device("cuda", 0) (or equivalent) when available, otherwise
fall back to wp.get_device("cpu"); update the device assignment in main (the
device variable and wp.get_device call) to implement this conditional selection
so the CUDA substream capture/free path is exercised on CUDA-capable machines.

ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: 54c9fa40-f6c8-4165-b26b-a597e9f40e12

📥 Commits
Reviewing files that changed from the base of the PR and between 6458906 and 0ce0e9b.

📒 Files selected for processing (2)

reproducer.py
warp/native/warp.cu

🚧 Files skipped from review as they are similar to previous changes (1)

Fixed - - using separate begin_graph / caller_graph variables now, capture_status == ACTIVE check added, and the dep_count == 0 path correctly falls through to the fast path. Fixed reproducer now tries cuda:0 first and falls back to CPU, and load_module exceptions are only swallowed on CPU.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@reproducer.py`:
- Around line 13-24: The call to wp.load_module(device=device) happens before
the `@wp.kernel-decorated` function touch is registered, so load_module fails
(raising on CUDA); move the wp.load_module(...) call to after the kernel
definition/registration (i.e., after the touch kernel and any other `@wp.kernel`
decorators have executed), remove or adjust the try/except that swallows the
error for non-CPU paths so CUDA errors propagate appropriately, and ensure any
device-specific handling still checks device.is_cpu only when appropriate.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: 34f059ff-7a98-45e0-955e-d1e561d4a3bc

📥 Commits

Reviewing files that changed from the base of the PR and between 310b196 and f8938ba.

📒 Files selected for processing (2)

reproducer.py
warp/native/warp.cu

🚧 Files skipped from review as they are similar to previous changes (1)

warp/native/warp.cu

Captured cudaGraphAddMemFreeNode was inheriting the begin-stream's frontier instead of the substream's, causing use-after-free on replay. Also adds reproducer.py with CPU fallback for GPU-less environments. Signed-off-by: theonlychant <sacehenry@gmail.com>

theonlychant force-pushed the fix/cuda-graph-memfree-dep-tracking branch from fbb92b0 to fa6e0b8 Compare May 4, 2026 01:06

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

Comment thread reproducer.py Outdated

Comment thread reproducer.py

theonlychant force-pushed the fix/cuda-graph-memfree-dep-tracking branch from fa6e0b8 to 6458906 Compare May 4, 2026 01:09

greptile-apps Bot reviewed May 4, 2026

View reviewed changes

Comment thread warp/native/warp.cu Outdated

Comment thread warp/native/warp.cu Outdated

Comment thread reproducer.py

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

Comment thread reproducer.py Outdated

theonlychant force-pushed the fix/cuda-graph-memfree-dep-tracking branch from 6458906 to 0ce0e9b Compare May 4, 2026 01:16

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

theonlychant force-pushed the fix/cuda-graph-memfree-dep-tracking branch from 0ce0e9b to 310b196 Compare May 4, 2026 01:30

greptile-apps Bot reviewed May 4, 2026

View reviewed changes

Comment thread warp/native/warp.cu

theonlychant force-pushed the fix/cuda-graph-memfree-dep-tracking branch from 243cbf8 to f8938ba Compare May 4, 2026 02:01

greptile-apps Bot reviewed May 4, 2026

View reviewed changes

Comment thread warp/native/warp.cu

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

Comment thread reproducer.py Outdated

theonlychant force-pushed the fix/cuda-graph-memfree-dep-tracking branch 2 times, most recently from 15843f7 to e36218d Compare May 4, 2026 02:14

greptile-apps Bot reviewed May 4, 2026

View reviewed changes

Comment thread warp/native/warp.cu

theonlychant force-pushed the fix/cuda-graph-memfree-dep-tracking branch from e36218d to cb66d39 Compare May 4, 2026 02:22

theonlychant mentioned this pull request May 4, 2026

wp_free_device_async on a forked substream causes use-after-free during graph replay #1418

Open

Conversation

theonlychant commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Test plan

Bug fix

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

theonlychant commented May 4, 2026

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

theonlychant commented May 4, 2026 •

edited

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading

greptile-apps Bot commented May 4, 2026 •

edited

Loading