[metal] fix poll(wait_indefinitely()) deadlock on long-running command buffers#9532
[metal] fix poll(wait_indefinitely()) deadlock on long-running command buffers#9532ruihe774 wants to merge 1 commit into
Conversation
…d buffers Use MTLSharedEvent::encodeSignalEvent:value: + waitUntilSignaledValue:timeoutMS: instead of spin-polling MTLCommandBuffer.status(), which can fail to observe Completed for long-running command buffers (see gfx-rs#8119). Also fix get_fence_value() / Fence::get_latest() to read shared_event.signaledValue() rather than cmd_buf.status(), so that wgpu-core's post-wait fence check agrees with the GPU-level signal and doesn't spuriously return WaitIdleError::Timeout.
|
Can you explain precisely how the deadlock occurs? We should fix the spin poll (in fact there's already a PR for that -- #9328), but deadlocks are tricky to pin down and I want to be sure it's really fixed in a way that won't come back later. |
|
I just viewed PR #9328; sorry that I did not spot it. It's using condvar for waiting rather than using the API provided by the OS (
|
|
Can you explain your concerns about the condvar approach? (Personally, I'm not thrilled about having to go through a I don't know that it's officially stated anywhere, but my understanding is we have been trying to maintain support back to MacOS 10.12, which pre-dates I don't think the issue here is really a deadlock per se. The command buffers are failing with |
|
I did not consider |
Connections
Fixes #9531. Same root cause as #8119.
Description
device.poll(PollType::wait_indefinitely())deadlocked on Metal for command buffers that ran for more than a few hundred milliseconds.Device::waitinwgpu-halspin-polledMTLCommandBuffer.status()every 1 ms, and for long-running CBs this loop permanently missed theCompletedstate, hanging forever whentimeoutwasNone.The fix replaces the spin-poll with
MTLSharedEvent::waitUntilSignaledValue:timeoutMS:, which is a proper OS-level blocking wait. Theshared_eventis already wired up inQueue::submit— it's signaled viaencodeSignalEvent_valuewhen the command buffer completes — soDevice::waitjust needs to call one method on it instead of polling.For sandboxed environments where
MTLSharedEventis unavailable (shared_eventisNone): the no-timeout path falls back toMTLCommandBuffer::waitUntilCompleted, and the finite-timeout path keeps the existing spin-poll (which the original issue confirmed works correctly).Testing
Added
WAIT_INDEFINITELY_LONG_RUNNINGtotests/tests/wgpu-gpu/poll.rs. It dispatches a compute shader (xorshift over 65536 threads × 1M iterations, ~several hundred ms on Apple Silicon) and callspoll(wait_indefinitely()). Without the fix this hangs; with it, it returns normally. The test requiresCOMPUTE_SHADERSdownlevel support and runs on all real backends including Metal.Squash or Rebase?
Single commit
Checklist
WebGPU implementations built withNO behavioral changewgpumay be affected behaviorally.Validation and feature gates are in place to confine behavioral changes.NO behavioral changeCHANGELOG.mdentries for the user-facing effects of this change are present.