Skip to content

[metal] fix poll(wait_indefinitely()) deadlock on long-running command buffers#9532

Open
ruihe774 wants to merge 1 commit into
gfx-rs:trunkfrom
ruihe774:fix-metal-wait
Open

[metal] fix poll(wait_indefinitely()) deadlock on long-running command buffers#9532
ruihe774 wants to merge 1 commit into
gfx-rs:trunkfrom
ruihe774:fix-metal-wait

Conversation

@ruihe774
Copy link
Copy Markdown
Contributor

@ruihe774 ruihe774 commented May 8, 2026

Connections

Fixes #9531. Same root cause as #8119.

Description

device.poll(PollType::wait_indefinitely()) deadlocked on Metal for command buffers that ran for more than a few hundred milliseconds. Device::wait in wgpu-hal spin-polled MTLCommandBuffer.status() every 1 ms, and for long-running CBs this loop permanently missed the Completed state, hanging forever when timeout was None.

The fix replaces the spin-poll with MTLSharedEvent::waitUntilSignaledValue:timeoutMS:, which is a proper OS-level blocking wait. The shared_event is already wired up in Queue::submit — it's signaled via encodeSignalEvent_value when the command buffer completes — so Device::wait just needs to call one method on it instead of polling.

For sandboxed environments where MTLSharedEvent is unavailable (shared_event is None): the no-timeout path falls back to MTLCommandBuffer::waitUntilCompleted, and the finite-timeout path keeps the existing spin-poll (which the original issue confirmed works correctly).

Testing

Added WAIT_INDEFINITELY_LONG_RUNNING to tests/tests/wgpu-gpu/poll.rs. It dispatches a compute shader (xorshift over 65536 threads × 1M iterations, ~several hundred ms on Apple Silicon) and calls poll(wait_indefinitely()). Without the fix this hangs; with it, it returns normally. The test requires COMPUTE_SHADERS downlevel support and runs on all real backends including Metal.

Squash or Rebase?

Single commit

Checklist

  • I self-reviewed and fully understand this PR.
  • WebGPU implementations built with wgpu may be affected behaviorally. NO behavioral change
  • Validation and feature gates are in place to confine behavioral changes. NO behavioral change
  • Tests demonstrate the validation and altered logic works.
  • CHANGELOG.md entries for the user-facing effects of this change are present.
  • The PR is minimal, and doesn't make sense to land as multiple PRs.
  • Commits are logically scoped and individually reviewable.
  • The PR description has enough context to understand the motivation and solution implemented.

…d buffers

Use MTLSharedEvent::encodeSignalEvent:value: + waitUntilSignaledValue:timeoutMS:
instead of spin-polling MTLCommandBuffer.status(), which can fail to observe
Completed for long-running command buffers (see gfx-rs#8119).

Also fix get_fence_value() / Fence::get_latest() to read shared_event.signaledValue()
rather than cmd_buf.status(), so that wgpu-core's post-wait fence check agrees with
the GPU-level signal and doesn't spuriously return WaitIdleError::Timeout.
@andyleiserson
Copy link
Copy Markdown
Contributor

Can you explain precisely how the deadlock occurs?

We should fix the spin poll (in fact there's already a PR for that -- #9328), but deadlocks are tricky to pin down and I want to be sure it's really fixed in a way that won't come back later.

@ruihe774
Copy link
Copy Markdown
Contributor Author

ruihe774 commented May 11, 2026

I just viewed PR #9328; sorry that I did not spot it. It's using condvar for waiting rather than using the API provided by the OS (waitUntilSignaledValue_timeoutMS); I don't quite agree with the condvar approach.

MTLCommandBuffer.status() is a obj-c property that is updated by a task scheduled to dispatch queue / runloop by Metal. So if the spin wait blocks the runloop or the runloop is not running, it never gets updated. That's my guess; I cannot confirm how it indeed works.

@andyleiserson
Copy link
Copy Markdown
Contributor

Can you explain your concerns about the condvar approach? (Personally, I'm not thrilled about having to go through a Mutex to access fence values that could be read with an Acquire atomic previously, but, the way it is now is simple and clearly correct, unlike the Mutex<()> version or duplicating the completed value.)

I don't know that it's officially stated anywhere, but my understanding is we have been trying to maintain support back to MacOS 10.12, which pre-dates MTLSharedEvent. We can use MTLSharedEvent with a fallback, but it would be nice to have a consistent strategy across all the versions so we're less likely to miss things in testing.

I don't think the issue here is really a deadlock per se. The command buffers are failing with kIOGPUCommandBufferCallbackErrorImpactingInteractivity and have MTLCommandBufferStatusError. In the spin poll wgpu only looks for MTLCommandBufferStatusCompleted so the wait will never complete.

@ruihe774
Copy link
Copy Markdown
Contributor Author

I did not consider kIOGPUCommandBufferCallbackErrorImpactingInteractivity. Let me investigate it next week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metal: device.poll(PollType::wait_indefinitely()) deadlocks on CBs that take >~hundreds of ms

2 participants