[metal] fix poll(wait_indefinitely()) deadlock on long-running command buffers by ruihe774 · Pull Request #9532 · gfx-rs/wgpu

ruihe774 · 2026-05-08T22:30:06Z

Connections

Fixes #9531. Same root cause as #8119.

Description

device.poll(PollType::wait_indefinitely()) deadlocked on Metal for command buffers that ran for more than a few hundred milliseconds. Device::wait in wgpu-hal spin-polled MTLCommandBuffer.status() every 1 ms, and for long-running CBs this loop permanently missed the Completed state, hanging forever when timeout was None.

The fix replaces the spin-poll with MTLSharedEvent::waitUntilSignaledValue:timeoutMS:, which is a proper OS-level blocking wait. The shared_event is already wired up in Queue::submit — it's signaled via encodeSignalEvent_value when the command buffer completes — so Device::wait just needs to call one method on it instead of polling.

For sandboxed environments where MTLSharedEvent is unavailable (shared_event is None): the no-timeout path falls back to MTLCommandBuffer::waitUntilCompleted, and the finite-timeout path keeps the existing spin-poll (which the original issue confirmed works correctly).

Testing

Added WAIT_INDEFINITELY_LONG_RUNNING to tests/tests/wgpu-gpu/poll.rs. It dispatches a compute shader (xorshift over 65536 threads × 1M iterations, ~several hundred ms on Apple Silicon) and calls poll(wait_indefinitely()). Without the fix this hangs; with it, it returns normally. The test requires COMPUTE_SHADERS downlevel support and runs on all real backends including Metal.

Squash or Rebase?

Single commit

Checklist

I self-reviewed and fully understand this PR.
~~WebGPU implementations built with wgpu may be affected behaviorally.~~ NO behavioral change
~~Validation and feature gates are in place to confine behavioral changes.~~ NO behavioral change
Tests demonstrate the validation and altered logic works.
CHANGELOG.md entries for the user-facing effects of this change are present.
The PR is minimal, and doesn't make sense to land as multiple PRs.
Commits are logically scoped and individually reviewable.
The PR description has enough context to understand the motivation and solution implemented.

…d buffers Use MTLSharedEvent::encodeSignalEvent:value: + waitUntilSignaledValue:timeoutMS: instead of spin-polling MTLCommandBuffer.status(), which can fail to observe Completed for long-running command buffers (see gfx-rs#8119). Also fix get_fence_value() / Fence::get_latest() to read shared_event.signaledValue() rather than cmd_buf.status(), so that wgpu-core's post-wait fence check agrees with the GPU-level signal and doesn't spuriously return WaitIdleError::Timeout.

andyleiserson · 2026-05-11T16:04:08Z

Can you explain precisely how the deadlock occurs?

We should fix the spin poll (in fact there's already a PR for that -- #9328), but deadlocks are tricky to pin down and I want to be sure it's really fixed in a way that won't come back later.

ruihe774 · 2026-05-11T17:16:47Z

I just viewed PR #9328; sorry that I did not spot it. It's using condvar for waiting rather than using the API provided by the OS (waitUntilSignaledValue_timeoutMS); I don't quite agree with the condvar approach.

MTLCommandBuffer.status() is a obj-c property that is updated by a task scheduled to dispatch queue / runloop by Metal. So if the spin wait blocks the runloop or the runloop is not running, it never gets updated. That's my guess; I cannot confirm how it indeed works.

andyleiserson · 2026-05-12T21:57:01Z

Can you explain your concerns about the condvar approach? (Personally, I'm not thrilled about having to go through a Mutex to access fence values that could be read with an Acquire atomic previously, but, the way it is now is simple and clearly correct, unlike the Mutex<()> version or duplicating the completed value.)

I don't know that it's officially stated anywhere, but my understanding is we have been trying to maintain support back to MacOS 10.12, which pre-dates MTLSharedEvent. We can use MTLSharedEvent with a fallback, but it would be nice to have a consistent strategy across all the versions so we're less likely to miss things in testing.

I don't think the issue here is really a deadlock per se. The command buffers are failing with kIOGPUCommandBufferCallbackErrorImpactingInteractivity and have MTLCommandBufferStatusError. In the spin poll wgpu only looks for MTLCommandBufferStatusCompleted so the wait will never complete.

ruihe774 · 2026-05-12T21:59:46Z

I did not consider kIOGPUCommandBufferCallbackErrorImpactingInteractivity. Let me investigate it next week

ruihe774 force-pushed the fix-metal-wait branch from cba261b to e9f55fc Compare May 8, 2026 22:57

This was referenced May 13, 2026

implements #8119 : Metal backend's wgpu_hal::Device::wait implementat… #9328

Open

metal: Lose device if there is an error in command buffer execution #9545

Open

inner-daemons self-requested a review May 13, 2026 15:52

inner-daemons assigned andyleiserson May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[metal] fix poll(wait_indefinitely()) deadlock on long-running command buffers#9532

[metal] fix poll(wait_indefinitely()) deadlock on long-running command buffers#9532
ruihe774 wants to merge 1 commit into
gfx-rs:trunkfrom
ruihe774:fix-metal-wait

ruihe774 commented May 8, 2026

Uh oh!

andyleiserson commented May 11, 2026

Uh oh!

ruihe774 commented May 11, 2026 •

edited

Loading

Uh oh!

andyleiserson commented May 12, 2026

Uh oh!

ruihe774 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ruihe774 commented May 8, 2026

Uh oh!

andyleiserson commented May 11, 2026

Uh oh!

ruihe774 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andyleiserson commented May 12, 2026

Uh oh!

ruihe774 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ruihe774 commented May 11, 2026 •

edited

Loading