Description
On the Metal backend, device.poll(PollType::wait_indefinitely()) deadlocks (never returns) for command buffers that take more than a few hundred milliseconds of GPU time. An identical poll with a finite timeout — Wait { submission_index: None, timeout: Some(_) } — completes promptly on the same workload, so the underlying GPU work is finishing.
This is the same root mechanism flagged in #8119 (wait spin-polls MTLCommandBuffer.status() instead of using addCompletedHandler + condvar), but with a stronger symptom than that issue documents: under #8119 the wait is just slow; here it's an indefinite hang. Filing separately so the deadlock symptom is searchable.
Repro steps
Standalone Cargo project, ~150 LOC. Run on macOS / Apple Silicon: cargo run --release.
Cargo.toml:
[package]
name = "wgpu-metal-poll-repro"
version = "0.0.0"
edition = "2024"
[dependencies]
wgpu = "29.0"
pollster = "0.4"
bytemuck = "1"
src/main.rs:
use std::time::{Duration, Instant};
use wgpu::util::DeviceExt;
const SHADER: &str = r#"
@group(0) @binding(0) var<storage, read_write> buf: array<u32>;
@group(0) @binding(1) var<uniform> p: Params;
struct Params { iters: u32 };
@compute @workgroup_size(64)
fn cs_main(@builtin(global_invocation_id) gid: vec3<u32>) {
var x: u32 = gid.x ^ 0xDEADBEEFu;
for (var i: u32 = 0u; i < p.iters; i = i + 1u) {
x = x ^ (x << 13u);
x = x ^ (x >> 17u);
x = x ^ (x << 5u);
}
buf[gid.x] = x;
}
"#;
fn main() { pollster::block_on(run()); }
async fn run() {
let instance = wgpu::Instance::new(wgpu::InstanceDescriptor {
backends: wgpu::Backends::METAL,
backend_options: Default::default(),
flags: wgpu::InstanceFlags::default(),
memory_budget_thresholds: Default::default(),
display: None,
});
let adapter = instance.request_adapter(&wgpu::RequestAdapterOptions {
power_preference: wgpu::PowerPreference::HighPerformance,
force_fallback_adapter: false,
compatible_surface: None,
}).await.unwrap();
let (device, queue) = adapter.request_device(&wgpu::DeviceDescriptor {
label: None,
required_features: wgpu::Features::default(),
required_limits: wgpu::Limits::default(),
memory_hints: Default::default(),
trace: wgpu::Trace::Off,
experimental_features: Default::default(),
}).await.unwrap();
let module = device.create_shader_module(wgpu::ShaderModuleDescriptor {
label: None,
source: wgpu::ShaderSource::Wgsl(SHADER.into()),
});
let bgl = device.create_bind_group_layout(&wgpu::BindGroupLayoutDescriptor {
label: None,
entries: &[
wgpu::BindGroupLayoutEntry {
binding: 0, visibility: wgpu::ShaderStages::COMPUTE,
ty: wgpu::BindingType::Buffer {
ty: wgpu::BufferBindingType::Storage { read_only: false },
has_dynamic_offset: false, min_binding_size: None,
},
count: None,
},
wgpu::BindGroupLayoutEntry {
binding: 1, visibility: wgpu::ShaderStages::COMPUTE,
ty: wgpu::BindingType::Buffer {
ty: wgpu::BufferBindingType::Uniform,
has_dynamic_offset: false, min_binding_size: None,
},
count: None,
},
],
});
let pl = device.create_pipeline_layout(&wgpu::PipelineLayoutDescriptor {
label: None,
bind_group_layouts: &[Some(&bgl)],
immediate_size: 0,
});
let pipe = device.create_compute_pipeline(&wgpu::ComputePipelineDescriptor {
label: None, layout: Some(&pl), module: &module,
entry_point: Some("cs_main"),
compilation_options: Default::default(), cache: None,
});
const N_THREADS: u32 = 1024 * 64;
const ITERS: u32 = 5_000_000;
let buf = device.create_buffer(&wgpu::BufferDescriptor {
label: None, size: (N_THREADS as u64) * 4,
usage: wgpu::BufferUsages::STORAGE, mapped_at_creation: false,
});
let ubuf = device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
label: None, contents: bytemuck::bytes_of(&ITERS),
usage: wgpu::BufferUsages::UNIFORM,
});
let bg = device.create_bind_group(&wgpu::BindGroupDescriptor {
label: None, layout: &bgl, entries: &[
wgpu::BindGroupEntry { binding: 0, resource: buf.as_entire_binding() },
wgpu::BindGroupEntry { binding: 1, resource: ubuf.as_entire_binding() },
],
});
let encode_and_submit = || {
let mut enc = device.create_command_encoder(&Default::default());
{
let mut pass = enc.begin_compute_pass(&Default::default());
pass.set_pipeline(&pipe);
pass.set_bind_group(0, &bg, &[]);
pass.dispatch_workgroups(N_THREADS / 64, 1, 1);
}
queue.submit(Some(enc.finish()));
};
// [1] Reference: bounded-timeout poll loop completes promptly.
eprintln!("[1] bounded-timeout poll loop:");
encode_and_submit();
let t0 = Instant::now();
loop {
let r = device.poll(wgpu::PollType::Wait {
submission_index: None, timeout: Some(Duration::from_millis(200)),
});
eprintln!(" poll @ {:>5.2}s -> {r:?}", t0.elapsed().as_secs_f32());
match r {
Ok(wgpu::PollStatus::QueueEmpty | wgpu::PollStatus::WaitSucceeded) => break,
Err(wgpu::PollError::Timeout) => continue,
other => panic!("unexpected: {other:?}"),
}
}
eprintln!("[1] elapsed: {:.2}s", t0.elapsed().as_secs_f32());
// [2] Bug: wait_indefinitely on the same workload deadlocks.
eprintln!("\n[2] wait_indefinitely on the same workload:");
encode_and_submit();
let t0 = Instant::now();
let _wd = std::thread::spawn(|| {
std::thread::sleep(Duration::from_secs(60));
eprintln!("*** watchdog: poll(wait_indefinitely) has not returned after 60s — DEADLOCK ***");
std::process::exit(2);
});
let r = device.poll(wgpu::PollType::wait_indefinitely());
eprintln!("[2] returned at {:.2}s -> {r:?}", t0.elapsed().as_secs_f32());
}
Expected vs observed behavior
Expected: device.poll(PollType::wait_indefinitely()) returns once the submitted command buffer completes — same as a finite-timeout poll, just without the deadline.
Observed (Apple M2 Pro, macOS 26.3.2, wgpu 29.0.3):
adapter: AdapterInfo { name: "Apple M2 Pro", ..., backend: Metal, ... }
[1] bounded-timeout poll loop (Wait { timeout: Some(200ms) }):
poll @ 0.20s -> Err(Timeout)
poll @ 0.40s -> Err(Timeout)
poll @ 0.60s -> Ok(QueueEmpty)
[1] elapsed: 0.60s
[2] wait_indefinitely on the same workload:
*** watchdog: poll(wait_indefinitely) has not returned after 60s — DEADLOCK confirmed ***
GPU work finishes in ~0.6s under the bounded poll (case 1). With wait_indefinitely on an identical CB (case 2), the call never returns; the 60-second watchdog fires and aborts the process.
Cause (suspected)
<metal::Device as Device>::wait is a spin-poll on MTLCommandBuffer.status():
loop {
if let MTLCommandBufferStatus::Completed = cmd_buf.status() {
return Ok(true);
}
if let Some(timeout) = timeout {
if start.elapsed() >= timeout { return Ok(false); }
}
thread::sleep(Duration::from_millis(1));
}
For long-running CBs, the polling thread can fail to ever observe Completed from status(), so with timeout: None the loop never exits. With timeout: Some(_) the timeout path returns Ok(false), falls through to wgpu_core::Device::maintain → Queue::maintain → get_fence_value() (which reads the fence value updated from the addCompletedHandler block), and the queue is correctly observed as empty. That's why case 1 works and case 2 doesn't — the fence-value path is reliable; the status() spin is not.
#8119 already proposes the fix: replace the status() spin with a Condvar whose value is advanced from an addCompletedHandler block. Filing this issue separately to surface the deadlock symptom (rather than just inefficiency).
Workaround
Loop Wait { submission_index: None, timeout: Some(short_duration) } until you observe Ok(QueueEmpty | WaitSucceeded) — each iteration's timeout-triggered queue.maintain() reaches the fence-value path and correctly observes completion.
Platform
- macOS 26.3.2 (Build 25D2140), Apple M2 Pro
wgpu = "29.0" (resolved to 29.0.3)
rustc 1.95.0 (59807616e 2026-04-14)
- Backend: Metal
Related
Description
On the Metal backend,
device.poll(PollType::wait_indefinitely())deadlocks (never returns) for command buffers that take more than a few hundred milliseconds of GPU time. An identical poll with a finite timeout —Wait { submission_index: None, timeout: Some(_) }— completes promptly on the same workload, so the underlying GPU work is finishing.This is the same root mechanism flagged in #8119 (
waitspin-pollsMTLCommandBuffer.status()instead of usingaddCompletedHandler+ condvar), but with a stronger symptom than that issue documents: under #8119 the wait is just slow; here it's an indefinite hang. Filing separately so the deadlock symptom is searchable.Repro steps
Standalone Cargo project, ~150 LOC. Run on macOS / Apple Silicon:
cargo run --release.Cargo.toml:src/main.rs:Expected vs observed behavior
Expected:
device.poll(PollType::wait_indefinitely())returns once the submitted command buffer completes — same as a finite-timeout poll, just without the deadline.Observed (Apple M2 Pro, macOS 26.3.2, wgpu 29.0.3):
GPU work finishes in ~0.6s under the bounded poll (case 1). With
wait_indefinitelyon an identical CB (case 2), the call never returns; the 60-second watchdog fires and aborts the process.Cause (suspected)
<metal::Device as Device>::waitis a spin-poll onMTLCommandBuffer.status():For long-running CBs, the polling thread can fail to ever observe
Completedfromstatus(), so withtimeout: Nonethe loop never exits. Withtimeout: Some(_)the timeout path returnsOk(false), falls through towgpu_core::Device::maintain→Queue::maintain→get_fence_value()(which reads the fence value updated from theaddCompletedHandlerblock), and the queue is correctly observed as empty. That's why case 1 works and case 2 doesn't — the fence-value path is reliable; thestatus()spin is not.#8119 already proposes the fix: replace the
status()spin with aCondvarwhose value is advanced from anaddCompletedHandlerblock. Filing this issue separately to surface the deadlock symptom (rather than just inefficiency).Workaround
Loop
Wait { submission_index: None, timeout: Some(short_duration) }until you observeOk(QueueEmpty | WaitSucceeded)— each iteration's timeout-triggeredqueue.maintain()reaches the fence-value path and correctly observes completion.Platform
wgpu = "29.0"(resolved to 29.0.3)rustc 1.95.0 (59807616e 2026-04-14)Related
wgpu_hal::Device::waitimplementation polls instead of waiting #8119 — same root cause (spin-poll onMTLCommandBuffer.status()), framed as a perf issue.