fix(graph): correct MEM_FREE node dependencies for substream allocations#1419
fix(graph): correct MEM_FREE node dependencies for substream allocations#1419theonlychant wants to merge 1 commit into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a capture/replay reproducer script demonstrating CPU and non-CPU capture flows, and changes graph-captured deallocation in ChangesCapture/Replay Reproducer
Graph-Captured Memory Free Handling
Sequence Diagram(s)sequenceDiagram
participant Host as Host
participant MainStream as Main_Stream
participant SubStream as Sub_Stream
participant ReplayStream as Replay_Stream
participant CUDAGraph as CUDA_Graph
Host->>MainStream: capture_begin(device, stream=MainStream)
MainStream->>SubStream: wait on MainStream
Host->>SubStream: launch touch kernel (repeat 4×)
SubStream->>CUDAGraph: record kernel launches (captured)
Host->>MainStream: capture_end()
Host->>ReplayStream: capture_launch(CUDA_Graph) (repeat 8×)
ReplayStream->>CUDAGraph: replay recorded commands
ReplayStream->>Host: synchronize_stream(ReplayStream)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
fbb92b0 to
fa6e0b8
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
reproducer.py (1)
5-60: ⚡ Quick winPlease convert this into an automated regression test.
Keeping this only as a standalone script limits CI protection; this scenario should live in the graph test suite (for example near
warp/tests/test_graph.pycapture/replay coverage).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@reproducer.py` around lines 5 - 60, Add an automated pytest that reproduces the script logic inside the graph test suite (e.g. near warp/tests/test_graph.py): create a test function (e.g. test_capture_replay_cpu_and_gpu) that wraps the body of main into a test, move the touch kernel definition into the test scope, and parametrize the test over devices (cpu and gpu if available); replace prints with assertions that capture_launch and synchronize_stream complete without raising exceptions, ensure wp.capture_begin/wp.capture_end and wp.capture_launch calls mirror the script (including CPU branch with force_module_load=False and the GPU branch using wp.get_stream, wp.Stream, wp.ScopedStream, wait_stream, and replay stream), and clean up temporary arrays and streams so the test fails on any exception.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@reproducer.py`:
- Around line 49-50: The replay loop currently runs only 8 iterations which is
insufficient; update the loop in reproducer.py that calls wp.capture_launch(g,
stream=replay) so it iterates 64 times (change the range(8) to range(64)) to
match the intended validation window for repeated replays using the g kernel and
replay stream.
- Line 6: The repro script forces CPU by calling wp.get_device("cpu") so the
CUDA-specific substream capture/free branch never runs; change the device
selection in the setup where device is assigned (the variable named device from
wp.get_device) to detect and use a CUDA device when available (e.g.,
wp.get_device("cuda:0") or wp.get_device(None)/auto-detect) so the CUDA
capture/free path executes; update any surrounding logic that assumes CPU-only
to handle a CUDA device (look for the variable device and any conditionals that
branch on its type).
---
Nitpick comments:
In `@reproducer.py`:
- Around line 5-60: Add an automated pytest that reproduces the script logic
inside the graph test suite (e.g. near warp/tests/test_graph.py): create a test
function (e.g. test_capture_replay_cpu_and_gpu) that wraps the body of main into
a test, move the touch kernel definition into the test scope, and parametrize
the test over devices (cpu and gpu if available); replace prints with assertions
that capture_launch and synchronize_stream complete without raising exceptions,
ensure wp.capture_begin/wp.capture_end and wp.capture_launch calls mirror the
script (including CPU branch with force_module_load=False and the GPU branch
using wp.get_stream, wp.Stream, wp.ScopedStream, wait_stream, and replay
stream), and clean up temporary arrays and streams so the test fails on any
exception.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Enterprise
Run ID: a9808869-305f-4bef-b9ed-257fa7f8616f
📒 Files selected for processing (2)
reproducer.pywarp/native/warp.cu
fa6e0b8 to
6458906
Compare
Greptile SummaryThis PR fixes a use-after-free in CUDA graph replay by changing Confidence Score: 4/5Safe to merge; the fix is correct for the primary scenario and all previously flagged P1 issues have been addressed. No new P0 or P1 issues found. The one remaining concern (dep_count == 0 producing a root-like MEM_FREE node) is a P2 edge case not reachable through the described usage pattern. Previously flagged issues (begin_graph/caller_graph clobbering, unconditional added = true, same-graph guard) are all resolved in the current code. warp/native/warp.cu — the dep_count == 0 guard on the fast-path condition (line ~854) is worth a defensive tightening. Important Files Changed
Sequence DiagramsequenceDiagram
participant PY as Python (del t)
participant WP as wp_free_device_async
participant CU as CUDA Driver
PY->>WP: ptr is a graph alloc on capture_id
WP->>WP: begin_graph = get_capture_graph(capture->stream)
WP->>WP: caller_cuda_stream = get_current_stream()
WP->>CU: cuStreamGetCaptureInfo_f(caller_cuda_stream, ...)
CU-->>WP: caller_graph, capture_deps, dep_count, capture_status
alt caller_graph == begin_graph && ACTIVE
WP->>CU: cudaGraphAddMemFreeNode(caller_graph, capture_deps, dep_count, ptr)
CU-->>WP: free_node (node_inserted = true)
WP->>CU: cuStreamUpdateCaptureDependencies_f(caller_cuda_stream, free_node)
alt update succeeds
WP->>WP: added = true (stream frontier advanced on substream)
else update fails
WP->>WP: node_inserted=true, added=false (fallback skipped — no double-insert)
end
else mismatch / not active / API failure
Note over WP: fast path skipped
end
alt !added && !node_inserted && begin_graph
WP->>CU: get_graph_leaf_nodes(begin_graph)
CU-->>WP: leaf_nodes
WP->>CU: cudaGraphAddMemFreeNode(begin_graph, leaf_nodes, ptr)
CU-->>WP: free_node
WP->>CU: cuStreamUpdateCaptureDependencies_f(capture->stream, free_node)
Note over WP: fallback: uses begin-stream frontier (original behaviour)
end
WP->>WP: g_graph_allocs.erase(ptr)
Reviews (7): Last reviewed commit: "fix(graph): correct MEM_FREE node depend..." | Re-trigger Greptile |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@reproducer.py`:
- Around line 7-11: The try/except around wp.load_module is swallowing all
failures; change it to only ignore expected, non-fatal errors (e.g.,
ModuleNotFoundError or a specific backend-not-available exception) or, if you
must catch broad exceptions, log the exception and re-raise unexpected ones so
real setup failures aren't hidden; update the block around the call to
wp.load_module(device=device) to catch specific exception types or to call
processLogger/print with the exception details and re-raise for anything not in
the allowed list.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Enterprise
Run ID: 073aac8b-b612-4786-9339-67ee5f9f8680
📒 Files selected for processing (2)
reproducer.pywarp/native/warp.cu
🚧 Files skipped from review as they are similar to previous changes (1)
- warp/native/warp.cu
6458906 to
0ce0e9b
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (3)
reproducer.py (3)
8-12:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid swallowing all module-load failures.
Lines 8-12 catch
Exceptionand ignore it, which can hide real setup problems and make the reproducer appear to pass when initialization is broken.Proposed fix
- try: - wp.load_module(device=device) - except Exception: - # load_module may be unnecessary on CPU; ignore failures - pass + if not device.is_cpu: + wp.load_module(device=device)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@reproducer.py` around lines 8 - 12, The try/except around wp.load_module(device=device) swallows all Exception types and may hide real initialization failures; change it to only ignore known benign errors (e.g., NotImplementedError, OSError, or the specific error your CPU path raises) or else log the exception and re-raise unexpected ones. Specifically, replace the broad except Exception block around wp.load_module with targeted exception handling that logs the device and exception details (including the exception object) for unexpected failures, and only suppress the specific expected error types for CPU paths so real errors surface.
50-51:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse 64 CUDA replays to match the intended validation window.
Line 50 currently runs only 8 replays, which is below the PR’s stated 64-iteration stress window for this failure mode.
Proposed fix
- for _ in range(8): + for _ in range(64): wp.capture_launch(g, stream=replay)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@reproducer.py` around lines 50 - 51, The loop currently runs only 8 replays (for _ in range(8)) but should run 64 to match the intended validation/stress window; update the loop to for _ in range(64) so wp.capture_launch(g, stream=replay) is invoked 64 times using the existing variables/functions (wp.capture_launch, g, replay).
6-7:⚠️ Potential issue | 🟠 Major | ⚡ Quick winExercise CUDA when available; current setup forces CPU-only execution.
Line 7 hardcodes CPU, so the CUDA substream capture/free path this PR is meant to validate is never reached on CUDA-capable machines.
Proposed fix
def main(): - device = wp.get_device("cpu") + device = wp.get_device("cuda:0") if wp.is_cuda_available() else wp.get_device("cpu")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@reproducer.py` around lines 6 - 7, The code in main currently forces CPU by calling wp.get_device("cpu"), so change it to select CUDA when available: check for CUDA presence (e.g., via wp.get_device_count or a try to query a "cuda" device) and set device = wp.get_device("cuda", 0) (or equivalent) when available, otherwise fall back to wp.get_device("cpu"); update the device assignment in main (the device variable and wp.get_device call) to implement this conditional selection so the CUDA substream capture/free path is exercised on CUDA-capable machines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@reproducer.py`:
- Around line 8-12: The try/except around wp.load_module(device=device) swallows
all Exception types and may hide real initialization failures; change it to only
ignore known benign errors (e.g., NotImplementedError, OSError, or the specific
error your CPU path raises) or else log the exception and re-raise unexpected
ones. Specifically, replace the broad except Exception block around
wp.load_module with targeted exception handling that logs the device and
exception details (including the exception object) for unexpected failures, and
only suppress the specific expected error types for CPU paths so real errors
surface.
- Around line 50-51: The loop currently runs only 8 replays (for _ in range(8))
but should run 64 to match the intended validation/stress window; update the
loop to for _ in range(64) so wp.capture_launch(g, stream=replay) is invoked 64
times using the existing variables/functions (wp.capture_launch, g, replay).
- Around line 6-7: The code in main currently forces CPU by calling
wp.get_device("cpu"), so change it to select CUDA when available: check for CUDA
presence (e.g., via wp.get_device_count or a try to query a "cuda" device) and
set device = wp.get_device("cuda", 0) (or equivalent) when available, otherwise
fall back to wp.get_device("cpu"); update the device assignment in main (the
device variable and wp.get_device call) to implement this conditional selection
so the CUDA substream capture/free path is exercised on CUDA-capable machines.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Enterprise
Run ID: 54c9fa40-f6c8-4165-b26b-a597e9f40e12
📒 Files selected for processing (2)
reproducer.pywarp/native/warp.cu
🚧 Files skipped from review as they are similar to previous changes (1)
- warp/native/warp.cu
0ce0e9b to
310b196
Compare
Fixed - - using separate begin_graph / caller_graph variables now, capture_status == ACTIVE check added, and the dep_count == 0 path correctly falls through to the fast path. Fixed reproducer now tries cuda:0 first and falls back to CPU, and load_module exceptions are only swallowed on CPU. |
243cbf8 to
f8938ba
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@reproducer.py`:
- Around line 13-24: The call to wp.load_module(device=device) happens before
the `@wp.kernel-decorated` function touch is registered, so load_module fails
(raising on CUDA); move the wp.load_module(...) call to after the kernel
definition/registration (i.e., after the touch kernel and any other `@wp.kernel`
decorators have executed), remove or adjust the try/except that swallows the
error for non-CPU paths so CUDA errors propagate appropriately, and ensure any
device-specific handling still checks device.is_cpu only when appropriate.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Enterprise
Run ID: 34f059ff-7a98-45e0-955e-d1e561d4a3bc
📒 Files selected for processing (2)
reproducer.pywarp/native/warp.cu
🚧 Files skipped from review as they are similar to previous changes (1)
- warp/native/warp.cu
15843f7 to
e36218d
Compare
Captured cudaGraphAddMemFreeNode was inheriting the begin-stream's frontier instead of the substream's, causing use-after-free on replay. Also adds reproducer.py with CPU fallback for GPU-less environments. Signed-off-by: theonlychant <sacehenry@gmail.com>
e36218d to
cb66d39
Compare
Closes issue wp_free_device_async on a forked substream causes use-after-free during graph replay #1418
Description
When a CUDA graph capture is open on a begin-stream and a forked substream is brought into the capture via
wait_stream/wait_event, allocations made and freed entirely on the substream (e.g.wp.empty()→wp.launch()→del t) produced acudaGraphAddMemFreeNodewhose dependencies were inherited from the begin-stream's frontier rather than the substream's.At graph replay, the runtime is free to schedule the
MEM_FREEbefore the kernels that touched the buffer. After the mempool recycles the virtual address, subsequent launches write into freed pages and CUDA surfacescudaErrorIllegalAddress— typically after the first one or two replays, making this a silent use-after-free.This PR fixes the frontier tracking so
MEM_FREEnodes correctly depend on the substream's last node rather than the begin-stream's.Also adds
reproducer.pywith a CPU fallback path so the bug logic can be verified without a CUDA-capable device.Checklist
Test plan
Run the reproducer directly:
On a CUDA-capable machine, run 64 graph replays and verify no
cudaErrorIllegalAddressis raised. On CPU, the fallback path exercises the same allocation/free ordering logic without requiring a GPU.Also verify with:
Bug fix
You can use this sample
Summary by CodeRabbit
Bug Fixes
Tests