Skip to content

Revert "[hipGraph] Add graph signal pool and remove pre/post markers for non-… (#5333)"#5951

Merged
litvaOo merged 2 commits into
developfrom
revert-5333-graph-signal-pool
May 12, 2026
Merged

Revert "[hipGraph] Add graph signal pool and remove pre/post markers for non-… (#5333)"#5951
litvaOo merged 2 commits into
developfrom
revert-5333-graph-signal-pool

Conversation

@anugodavar
Copy link
Copy Markdown
Contributor

Summary

Reverts #5333 (f17b05703a) due to two issues observed in GraphBench:

  1. Resource leak. Per-launch GraphSignalPool was deleted via a forward-declared pointer in ~AccumulateCommand, so ~GraphSignalPool never ran and every per-launch HSA signal was leaked (Resource leak detected by SharedSignalPool, N Signals leaked).
  2. Per-launch host overhead. Shifting from the runtime's rotating signal_list_ (which reuses signals after warmup with WaitCurrent + WaitNext) to a per-launch graph signal pool reintroduces N× signal_create + signal_destroy on every hipGraphLaunch, causing measurable regression in graph_bench across straight, paths2, paths4, full2, full4 topologies.

A safer redesign that preserves PR #5333's per-launch isolation while recycling signals (mirroring the runtime pool's WaitCurrent/WaitNext discipline) is needed before re-landing.

Test plan

Made with Cursor

…for non-… (#5333)"

This reverts commit f17b057.

Reverting due to two issues observed in GraphBench:

1. Resource leak: per-launch GraphSignalPool was deleted via a forward-
   declared pointer in ~AccumulateCommand, so ~GraphSignalPool never ran
   and every per-launch HSA signal was leaked
   ("Resource leak detected by SharedSignalPool, N Signals leaked").

2. Per-launch host overhead: shifting from the runtime's rotating
   signal_list_ (which reuses signals after warmup with WaitCurrent +
   WaitNext) to a per-launch graph signal pool reintroduces N×
   signal_create + signal_destroy on every hipGraphLaunch, causing
   measurable regression in graph_bench across straight/paths2/paths4/
   full2/full4 topologies.

A safer redesign that preserves PR #5333's per-launch isolation while
recycling signals (mirroring the runtime pool's WaitCurrent/WaitNext
discipline) is needed before re-landing.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings May 9, 2026 10:28
@anugodavar anugodavar requested a review from a team as a code owner May 9, 2026 10:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reverts the changes from #5333 to restore the prior hipGraph signal/HW-event lifecycle and queue-tracking behavior, addressing the reported resource leak and per-launch overhead regressions in GraphBench.

Changes:

  • Removes the per-launch GraphSignalPool plumbing (command ownership, device factory methods, and graph propagation).
  • Restores HW-event lifetime management via AccumulateCommand releasing per-device HW events after graph completion.
  • Reverts HwQueueTracker graph-pool special-casing and reintroduces marker-based synchronization around non-captured nodes in segmented graph launches.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
projects/clr/rocclr/platform/command.hpp Drops graph signal pool fields/APIs from Command/AccumulateCommand.
projects/clr/rocclr/platform/command.cpp Restores AccumulateCommand destructor to always release per-device HW events.
projects/clr/rocclr/device/rocm/rocvirtual.hpp Reverts GetLastSignal/addSystemScope interface changes tied to graph pool plumbing.
projects/clr/rocclr/device/rocm/rocvirtual.cpp Removes graph-pool fast path and restores runtime signal rotation behavior in tracker paths.
projects/clr/rocclr/device/rocm/rocdevice.hpp Removes GraphSignalPool type and related device virtual APIs.
projects/clr/rocclr/device/rocm/rocdevice.cpp Removes GraphSignalPool implementation and graph pool factory helpers.
projects/clr/rocclr/device/device.hpp Removes graph pool-related forward decls and VirtualDevice/Device graph pool virtuals.
projects/clr/hipamd/src/hip_graph_internal.hpp Removes graph pool propagation helpers and cached signal-count members.
projects/clr/hipamd/src/hip_graph_internal.cpp Reverts segmented graph launch back to CreateHwEvents + marker-based sync around non-captured nodes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@litvaOo
Copy link
Copy Markdown
Contributor

litvaOo commented May 12, 2026

Overriding as there are no new errors comparing to develop, all issues are reproduced in develop branch CI

@litvaOo litvaOo merged commit a23a24f into develop May 12, 2026
51 of 67 checks passed
@litvaOo litvaOo deleted the revert-5333-graph-signal-pool branch May 12, 2026 07:24
systems-assistant Bot pushed a commit to ROCm/clr that referenced this pull request May 12, 2026
 =?UTF-8?q?ool=20and=20remove=20pre/post=20markers=20for=20non-=E2=80=A6?=
 =?UTF-8?q?=20(#5333)"=20(#5951)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

Reverts #5333 (`f17b05703a`) due to two issues observed in `GraphBench`:

1. **Resource leak.** Per-launch `GraphSignalPool` was deleted via a
forward-declared pointer in `~AccumulateCommand`, so `~GraphSignalPool`
never ran and every per-launch HSA signal was leaked (`Resource leak
detected by SharedSignalPool, N Signals leaked`).
2. **Per-launch host overhead.** Shifting from the runtime's rotating
`signal_list_` (which reuses signals after warmup with `WaitCurrent` +
`WaitNext`) to a per-launch graph signal pool reintroduces N×
`signal_create` + `signal_destroy` on every `hipGraphLaunch`, causing
measurable regression in `graph_bench` across `straight`, `paths2`,
`paths4`, `full2`, `full4` topologies.

A safer redesign that preserves PR #5333's per-launch isolation while
recycling signals (mirroring the runtime pool's `WaitCurrent`/`WaitNext`
discipline) is needed before re-landing.

## Test plan

- [ ] `graph_bench` runs to completion across all five topologies with
no `Signals leaked` warning.
- [ ] `graph_bench` per-launch numbers match pre-PR-#5333 baseline.
- [ ] Existing `hipGraph` unit tests still pass.

Made with [Cursor](https://cursor.com)

Co-authored-by: Cursor <cursoragent@cursor.com>
[rocm-systems] ROCm/rocm-systems#5951 (commit a23a24f)
cmcknigh pushed a commit that referenced this pull request May 12, 2026
…for non-… (#5333)" (#5951)

## Summary

Reverts #5333 (`f17b05703a`) due to two issues observed in `GraphBench`:

1. **Resource leak.** Per-launch `GraphSignalPool` was deleted via a
forward-declared pointer in `~AccumulateCommand`, so `~GraphSignalPool`
never ran and every per-launch HSA signal was leaked (`Resource leak
detected by SharedSignalPool, N Signals leaked`).
2. **Per-launch host overhead.** Shifting from the runtime's rotating
`signal_list_` (which reuses signals after warmup with `WaitCurrent` +
`WaitNext`) to a per-launch graph signal pool reintroduces N×
`signal_create` + `signal_destroy` on every `hipGraphLaunch`, causing
measurable regression in `graph_bench` across `straight`, `paths2`,
`paths4`, `full2`, `full4` topologies.

A safer redesign that preserves PR #5333's per-launch isolation while
recycling signals (mirroring the runtime pool's `WaitCurrent`/`WaitNext`
discipline) is needed before re-landing.

## Test plan

- [ ] `graph_bench` runs to completion across all five topologies with
no `Signals leaked` warning.
- [ ] `graph_bench` per-launch numbers match pre-PR-#5333 baseline.
- [ ] Existing `hipGraph` unit tests still pass.

Made with [Cursor](https://cursor.com)

Co-authored-by: Cursor <cursoragent@cursor.com>
anugodavar added a commit that referenced this pull request May 14, 2026
… markers for non-AQL-captured nodes\" (#5333)" (#5951)

Re-applies the graph signal pool and persistent-pool infrastructure that was reverted in a23a24f.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants