[hipGraph] Pre-allocate graph signal pool at instantiate; live-dispatch reset kernel#6108
Draft
anugodavar wants to merge 5 commits into
Draft
[hipGraph] Pre-allocate graph signal pool at instantiate; live-dispatch reset kernel#6108anugodavar wants to merge 5 commits into
anugodavar wants to merge 5 commits into
Conversation
Introduce __amd_rocclr_resetGraphSignals kernel to write resetValue into amd_signal_t.value fields of a graph signal pool in a single GPU dispatch, avoiding per-launch CPU memset overhead. - Add ResetGraphSignalPool / CreateGraphSignalPool to Device/roc::Device - Dispatch reset kernel via KernelBlitManager::resetGraphSignals with system-scope AQL fence so stores are visible to host signal-wait hardware - EnqueueSegmentedGraph: GPU-reset path enqueues ordering markers on all internal streams upfront; CPU-reset path enqueues markers only for internal streams that receive level-0 (root) segments, avoiding unnecessary markers when all roots land on launch_stream Co-authored-by: Cursor <cursoragent@cursor.com>
…al reset
Introduce GraphSignalPool::cont_buffer_: a device-accessible contiguous
array allocated from GPU coarse-grained VRAM (gpuvm_segment_) that holds
the device VA of amd_signal_t.value for every GPU-only signal in the pool.
The buffer is populated once at pool creation / growth time and remains
stable across graph launches, so the per-launch CPU work of
CollectValuePtrs() + memcpy into a transient kernarg slot is eliminated.
New components:
- GraphSignalPool::GrowContBuffer / PatchContBufferEntry
Allocate (or grow) the cont_buffer_ from gpuvm_segment_, call
hsa_amd_agents_allow_access(CPU+GPU) so the CPU can initialise entries
and the GPU reset kernel can read them, then write each signal's value VA.
- __amd_rocclr_resetContSignalBuffer kernel (blitcl.cpp)
Identical body to __amd_rocclr_resetGraphSignals; perf gain comes from
the stable device pointer passed by the caller rather than a per-launch
assembled pointer list.
- KernelBlitManager::resetContSignalBuffer (rocblit.{hpp,cpp})
Dispatches the new kernel by passing deviceVaBuf directly as a void*
(kDirectVa=true), avoiding the stack-address bug that caused error 700.
- Device::ResetGraphSignalPool updated to prefer the contiguous path when
cont_buffer_ != nullptr; falls back to resetGraphSignals on failure.
- GPU_GRAPH_CONT_SIGNAL_RESET flag (flags.hpp, default true) allows
forcing the legacy CollectValuePtrs + resetGraphSignals path at runtime.
Co-authored-by: Cursor <cursoragent@cursor.com>
…h reset kernel Move GraphSignalPool creation and segment completion-signal allocation out of the per-launch path into hipGraphExec instantiation. Reset-wait barriers for non-stream-0 segments are pre-built into each segment's flat buffer so internal streams cannot run before the per-launch reset kernel retires. At launch (EnqueueSegmentedGraph), the reset kernel is live-dispatched on the launch stream via Device::ResetGraphSignalPool, and its completion signal is captured via Barriers().GetLastSignal() and attached as an ordering Marker on every internal stream. This prevents premature firing of internal-stream dep-barriers on stale signal values from a prior launch (fixes paths2 deadlock under graph_bench warmup). Also: add Device::CreateBarrierPacket / ApplyHwEventPatches helpers and GraphSignalPool::PatchContBufferEntry / AllocateOneSignal to back the new instantiate-time path. Co-authored-by: Cursor <cursoragent@cursor.com>
… patching Replace the shared reset_signal_ barrier approach (which races between CPU re-arm and GPU decrements across concurrent launches) with per-launch dep-signal patching: - At instantiate time, prepend a barrier-and packet (dep_signal[0]=0) to the first batch of every non-stream-0 segment and record a pointer to its dep_signal[0].handle in nonstream0_dep_signal_ptrs (SyncPlan). - At each hipGraphLaunch, dispatch the reset kernel live on launch_stream via ResetGraphSignalPool using the launch device (not the instantiate device), capture the per-launch completion signal from Barriers().GetLastSignal(), and patch that handle into every recorded dep_signal[0].handle pointer before dispatching segment packets. This eliminates the race because each launch gets a fresh ProfilingSignal from the runtime Barriers pool that fires exactly once when the reset kernel retires, making the fence between the reset and non-stream-0 segment execution both race-free and launch-safe. Cleanup: - Remove EnqueueRawBarrierAndPacket (barrier-and approach abandoned) - Remove live_reset_dispatch_, reset_flat_data_, reset_flat_hdrs_ (unused) - Remove all [DIAG] fprintf instrumentation from rocblit.cpp, rocdevice.cpp, device.cpp; downgrade per-launch ClPrint traces to LOG_DEBUG - Fix || true pool-creation guard; use launch device in ResetGraphSignalPool
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Moves
GraphSignalPoolcreation, segment completion-signal allocation, and reset-wait barrier baking out of the per-launch hot path intohipGraphExecinstantiation. The reset kernel is live-dispatched on the launch stream at eachhipGraphLaunch, and an orderingMarkerper internal stream waits on its completion signal so no internal-stream barrier can fire on stale signal values from a prior launch.Commits (on top of develop)
__amd_rocclr_resetContSignalBufferkernel andResetGraphSignalPoolto reset GPU-only signal values from the device.cont_buffer_holding&amd_signal_t.valueaddresses, plusPatchContBufferEntry/AllocateOneSignalandCreateBarrierPacket/ApplyHwEventPatcheshelpers backing the new instantiate-time path.CaptureAndFormPacketsForGraphcreates theGraphSignalPool, acquiresreset_signal_, populatessync_plan_.segment_hw_events, pre-bakes reset-wait barriers on the first batch of every non-stream-0 segment, and patches HW event signals into all AQL packets viaApplyHwEventPatches.EnqueueSegmentedGraphre-armsreset_signal_, live-dispatches the reset kernel on the launch stream viaResetGraphSignalPool, captures the completion signal viaBarriers().GetLastSignal(), and enqueues an orderingMarkeron every internal stream so dep-barriers don't fire on stale 0 values from the previous launch.Why live-dispatch instead of pre-baked reset packet?
A pre-baked reset packet captured at instantiate time picks up hidden kernarg fields (HiddenQueuePtr / HiddenPrivateBase / HiddenSharedBase) from the transfer queue's
VirtualGPUcontext. Dispatching that packet on the launch stream's queue later produced GPU hangs due to mismatched queue context. Live-dispatch on the launch stream avoids this entirely.Test plan
graph_bench --topology straight- smoke test.graph_bench --topology full2/full4- independent chains.graph_bench --topology paths2- cross-stream join (the previous deadlock case).graph_bench --sweep- all topologies, including warmup loop (10 launches w/o sync).hipGraph*suite.Notes
develop(which now contains PR clr: move stream assignment from graph launch to instantiate #4778 "stream assignment at instantiate"). The new 4-commit history sits cleanly on top with no merge artifacts.Made with Cursor