Skip to content

[hipGraph] Pre-allocate graph signal pool at instantiate; live-dispatch reset kernel#6108

Draft
anugodavar wants to merge 5 commits into
developfrom
graph-signal-pool-reset
Draft

[hipGraph] Pre-allocate graph signal pool at instantiate; live-dispatch reset kernel#6108
anugodavar wants to merge 5 commits into
developfrom
graph-signal-pool-reset

Conversation

@anugodavar
Copy link
Copy Markdown
Contributor

Summary

Moves GraphSignalPool creation, segment completion-signal allocation, and reset-wait barrier baking out of the per-launch hot path into hipGraphExec instantiation. The reset kernel is live-dispatched on the launch stream at each hipGraphLaunch, and an ordering Marker per internal stream waits on its completion signal so no internal-stream barrier can fire on stale signal values from a prior launch.

Commits (on top of develop)

  1. Revert "Revert "[hipGraph] Add graph signal pool ..."" - restores the graph signal pool infrastructure that was reverted in Revert "[hipGraph] Add graph signal pool and remove pre/post markers for non-… (#5333)" #5951 (originally [hipGraph] Add graph signal pool and remove pre/post markers for non-… #5333).
  2. [clr] Add GPU reset kernel for graph signal pool reuse - introduces the __amd_rocclr_resetContSignalBuffer kernel and ResetGraphSignalPool to reset GPU-only signal values from the device.
  3. [clr] Add persistent contiguous signal buffer for GPU-side graph signal reset - adds the device-accessible cont_buffer_ holding &amd_signal_t.value addresses, plus PatchContBufferEntry / AllocateOneSignal and CreateBarrierPacket / ApplyHwEventPatches helpers backing the new instantiate-time path.
  4. [clr][hipGraph] Pre-allocate signal pool at instantiate; live-dispatch reset kernel - the main change:
    • CaptureAndFormPacketsForGraph creates the GraphSignalPool, acquires reset_signal_, populates sync_plan_.segment_hw_events, pre-bakes reset-wait barriers on the first batch of every non-stream-0 segment, and patches HW event signals into all AQL packets via ApplyHwEventPatches.
    • EnqueueSegmentedGraph re-arms reset_signal_, live-dispatches the reset kernel on the launch stream via ResetGraphSignalPool, captures the completion signal via Barriers().GetLastSignal(), and enqueues an ordering Marker on every internal stream so dep-barriers don't fire on stale 0 values from the previous launch.

Why live-dispatch instead of pre-baked reset packet?

A pre-baked reset packet captured at instantiate time picks up hidden kernarg fields (HiddenQueuePtr / HiddenPrivateBase / HiddenSharedBase) from the transfer queue's VirtualGPU context. Dispatching that packet on the launch stream's queue later produced GPU hangs due to mismatched queue context. Live-dispatch on the launch stream avoids this entirely.

Test plan

  • graph_bench --topology straight - smoke test.
  • graph_bench --topology full2 / full4 - independent chains.
  • graph_bench --topology paths2 - cross-stream join (the previous deadlock case).
  • graph_bench --sweep - all topologies, including warmup loop (10 launches w/o sync).
  • hip-tests hipGraph* suite.
  • Performance comparison vs. previous per-launch pool allocation path.

Notes

Made with Cursor

anugodavar and others added 4 commits May 14, 2026 11:51
… markers for non-AQL-captured nodes\" (#5333)" (#5951)

Re-applies the graph signal pool and persistent-pool infrastructure that was reverted in a23a24f.

Co-authored-by: Cursor <cursoragent@cursor.com>
Introduce __amd_rocclr_resetGraphSignals kernel to write resetValue
into amd_signal_t.value fields of a graph signal pool in a single
GPU dispatch, avoiding per-launch CPU memset overhead.

- Add ResetGraphSignalPool / CreateGraphSignalPool to Device/roc::Device
- Dispatch reset kernel via KernelBlitManager::resetGraphSignals with
  system-scope AQL fence so stores are visible to host signal-wait hardware
- EnqueueSegmentedGraph: GPU-reset path enqueues ordering markers on all
  internal streams upfront; CPU-reset path enqueues markers only for
  internal streams that receive level-0 (root) segments, avoiding
  unnecessary markers when all roots land on launch_stream

Co-authored-by: Cursor <cursoragent@cursor.com>
…al reset

Introduce GraphSignalPool::cont_buffer_: a device-accessible contiguous
array allocated from GPU coarse-grained VRAM (gpuvm_segment_) that holds
the device VA of amd_signal_t.value for every GPU-only signal in the pool.
The buffer is populated once at pool creation / growth time and remains
stable across graph launches, so the per-launch CPU work of
CollectValuePtrs() + memcpy into a transient kernarg slot is eliminated.

New components:
- GraphSignalPool::GrowContBuffer / PatchContBufferEntry
  Allocate (or grow) the cont_buffer_ from gpuvm_segment_, call
  hsa_amd_agents_allow_access(CPU+GPU) so the CPU can initialise entries
  and the GPU reset kernel can read them, then write each signal's value VA.
- __amd_rocclr_resetContSignalBuffer kernel (blitcl.cpp)
  Identical body to __amd_rocclr_resetGraphSignals; perf gain comes from
  the stable device pointer passed by the caller rather than a per-launch
  assembled pointer list.
- KernelBlitManager::resetContSignalBuffer (rocblit.{hpp,cpp})
  Dispatches the new kernel by passing deviceVaBuf directly as a void*
  (kDirectVa=true), avoiding the stack-address bug that caused error 700.
- Device::ResetGraphSignalPool updated to prefer the contiguous path when
  cont_buffer_ != nullptr; falls back to resetGraphSignals on failure.
- GPU_GRAPH_CONT_SIGNAL_RESET flag (flags.hpp, default true) allows
  forcing the legacy CollectValuePtrs + resetGraphSignals path at runtime.

Co-authored-by: Cursor <cursoragent@cursor.com>
…h reset kernel

Move GraphSignalPool creation and segment completion-signal allocation
out of the per-launch path into hipGraphExec instantiation. Reset-wait
barriers for non-stream-0 segments are pre-built into each segment's
flat buffer so internal streams cannot run before the per-launch reset
kernel retires.

At launch (EnqueueSegmentedGraph), the reset kernel is live-dispatched
on the launch stream via Device::ResetGraphSignalPool, and its completion
signal is captured via Barriers().GetLastSignal() and attached as an
ordering Marker on every internal stream. This prevents premature
firing of internal-stream dep-barriers on stale signal values from a
prior launch (fixes paths2 deadlock under graph_bench warmup).

Also: add Device::CreateBarrierPacket / ApplyHwEventPatches helpers and
GraphSignalPool::PatchContBufferEntry / AllocateOneSignal to back the
new instantiate-time path.

Co-authored-by: Cursor <cursoragent@cursor.com>
… patching

Replace the shared reset_signal_ barrier approach (which races between
CPU re-arm and GPU decrements across concurrent launches) with per-launch
dep-signal patching:

- At instantiate time, prepend a barrier-and packet (dep_signal[0]=0) to
  the first batch of every non-stream-0 segment and record a pointer to
  its dep_signal[0].handle in nonstream0_dep_signal_ptrs (SyncPlan).
- At each hipGraphLaunch, dispatch the reset kernel live on launch_stream
  via ResetGraphSignalPool using the launch device (not the instantiate
  device), capture the per-launch completion signal from
  Barriers().GetLastSignal(), and patch that handle into every recorded
  dep_signal[0].handle pointer before dispatching segment packets.

This eliminates the race because each launch gets a fresh ProfilingSignal
from the runtime Barriers pool that fires exactly once when the reset
kernel retires, making the fence between the reset and non-stream-0
segment execution both race-free and launch-safe.

Cleanup:
- Remove EnqueueRawBarrierAndPacket (barrier-and approach abandoned)
- Remove live_reset_dispatch_, reset_flat_data_, reset_flat_hdrs_ (unused)
- Remove all [DIAG] fprintf instrumentation from rocblit.cpp, rocdevice.cpp,
  device.cpp; downgrade per-launch ClPrint traces to LOG_DEBUG
- Fix || true pool-creation guard; use launch device in ResetGraphSignalPool
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant