Metal: device.poll(PollType::wait_indefinitely()) deadlocks on CBs that take >~hundreds of ms

**Description**

On the Metal backend, `device.poll(PollType::wait_indefinitely())` deadlocks (never returns) for command buffers that take more than a few hundred milliseconds of GPU time. An identical poll with a finite timeout — `Wait { submission_index: None, timeout: Some(_) }` — completes promptly on the same workload, so the underlying GPU work *is* finishing.

This is the same root mechanism flagged in #8119 (`wait` spin-polls `MTLCommandBuffer.status()` instead of using `addCompletedHandler` + condvar), but with a stronger symptom than that issue documents: under #8119 the wait is just *slow*; here it's an indefinite hang. Filing separately so the deadlock symptom is searchable.

**Repro steps**

Standalone Cargo project, ~150 LOC. Run on macOS / Apple Silicon: `cargo run --release`.

`Cargo.toml`:
```toml
[package]
name = "wgpu-metal-poll-repro"
version = "0.0.0"
edition = "2024"

[dependencies]
wgpu = "29.0"
pollster = "0.4"
bytemuck = "1"
```

`src/main.rs`:
```rust
use std::time::{Duration, Instant};
use wgpu::util::DeviceExt;

const SHADER: &str = r#"
@group(0) @binding(0) var<storage, read_write> buf: array<u32>;
@group(0) @binding(1) var<uniform> p: Params;
struct Params { iters: u32 };

@compute @workgroup_size(64)
fn cs_main(@builtin(global_invocation_id) gid: vec3<u32>) {
    var x: u32 = gid.x ^ 0xDEADBEEFu;
    for (var i: u32 = 0u; i < p.iters; i = i + 1u) {
        x = x ^ (x << 13u);
        x = x ^ (x >> 17u);
        x = x ^ (x << 5u);
    }
    buf[gid.x] = x;
}
"#;

fn main() { pollster::block_on(run()); }

async fn run() {
    let instance = wgpu::Instance::new(wgpu::InstanceDescriptor {
        backends: wgpu::Backends::METAL,
        backend_options: Default::default(),
        flags: wgpu::InstanceFlags::default(),
        memory_budget_thresholds: Default::default(),
        display: None,
    });
    let adapter = instance.request_adapter(&wgpu::RequestAdapterOptions {
        power_preference: wgpu::PowerPreference::HighPerformance,
        force_fallback_adapter: false,
        compatible_surface: None,
    }).await.unwrap();

    let (device, queue) = adapter.request_device(&wgpu::DeviceDescriptor {
        label: None,
        required_features: wgpu::Features::default(),
        required_limits: wgpu::Limits::default(),
        memory_hints: Default::default(),
        trace: wgpu::Trace::Off,
        experimental_features: Default::default(),
    }).await.unwrap();

    let module = device.create_shader_module(wgpu::ShaderModuleDescriptor {
        label: None,
        source: wgpu::ShaderSource::Wgsl(SHADER.into()),
    });
    let bgl = device.create_bind_group_layout(&wgpu::BindGroupLayoutDescriptor {
        label: None,
        entries: &[
            wgpu::BindGroupLayoutEntry {
                binding: 0, visibility: wgpu::ShaderStages::COMPUTE,
                ty: wgpu::BindingType::Buffer {
                    ty: wgpu::BufferBindingType::Storage { read_only: false },
                    has_dynamic_offset: false, min_binding_size: None,
                },
                count: None,
            },
            wgpu::BindGroupLayoutEntry {
                binding: 1, visibility: wgpu::ShaderStages::COMPUTE,
                ty: wgpu::BindingType::Buffer {
                    ty: wgpu::BufferBindingType::Uniform,
                    has_dynamic_offset: false, min_binding_size: None,
                },
                count: None,
            },
        ],
    });
    let pl = device.create_pipeline_layout(&wgpu::PipelineLayoutDescriptor {
        label: None,
        bind_group_layouts: &[Some(&bgl)],
        immediate_size: 0,
    });
    let pipe = device.create_compute_pipeline(&wgpu::ComputePipelineDescriptor {
        label: None, layout: Some(&pl), module: &module,
        entry_point: Some("cs_main"),
        compilation_options: Default::default(), cache: None,
    });

    const N_THREADS: u32 = 1024 * 64;
    const ITERS: u32 = 5_000_000;
    let buf = device.create_buffer(&wgpu::BufferDescriptor {
        label: None, size: (N_THREADS as u64) * 4,
        usage: wgpu::BufferUsages::STORAGE, mapped_at_creation: false,
    });
    let ubuf = device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
        label: None, contents: bytemuck::bytes_of(&ITERS),
        usage: wgpu::BufferUsages::UNIFORM,
    });
    let bg = device.create_bind_group(&wgpu::BindGroupDescriptor {
        label: None, layout: &bgl, entries: &[
            wgpu::BindGroupEntry { binding: 0, resource: buf.as_entire_binding() },
            wgpu::BindGroupEntry { binding: 1, resource: ubuf.as_entire_binding() },
        ],
    });
    let encode_and_submit = || {
        let mut enc = device.create_command_encoder(&Default::default());
        {
            let mut pass = enc.begin_compute_pass(&Default::default());
            pass.set_pipeline(&pipe);
            pass.set_bind_group(0, &bg, &[]);
            pass.dispatch_workgroups(N_THREADS / 64, 1, 1);
        }
        queue.submit(Some(enc.finish()));
    };

    // [1] Reference: bounded-timeout poll loop completes promptly.
    eprintln!("[1] bounded-timeout poll loop:");
    encode_and_submit();
    let t0 = Instant::now();
    loop {
        let r = device.poll(wgpu::PollType::Wait {
            submission_index: None, timeout: Some(Duration::from_millis(200)),
        });
        eprintln!("  poll @ {:>5.2}s -> {r:?}", t0.elapsed().as_secs_f32());
        match r {
            Ok(wgpu::PollStatus::QueueEmpty | wgpu::PollStatus::WaitSucceeded) => break,
            Err(wgpu::PollError::Timeout) => continue,
            other => panic!("unexpected: {other:?}"),
        }
    }
    eprintln!("[1] elapsed: {:.2}s", t0.elapsed().as_secs_f32());

    // [2] Bug: wait_indefinitely on the same workload deadlocks.
    eprintln!("\n[2] wait_indefinitely on the same workload:");
    encode_and_submit();
    let t0 = Instant::now();
    let _wd = std::thread::spawn(|| {
        std::thread::sleep(Duration::from_secs(60));
        eprintln!("*** watchdog: poll(wait_indefinitely) has not returned after 60s — DEADLOCK ***");
        std::process::exit(2);
    });
    let r = device.poll(wgpu::PollType::wait_indefinitely());
    eprintln!("[2] returned at {:.2}s -> {r:?}", t0.elapsed().as_secs_f32());
}
```

**Expected vs observed behavior**

Expected: `device.poll(PollType::wait_indefinitely())` returns once the submitted command buffer completes — same as a finite-timeout poll, just without the deadline.

Observed (Apple M2 Pro, macOS 26.3.2, wgpu 29.0.3):
```
adapter: AdapterInfo { name: "Apple M2 Pro", ..., backend: Metal, ... }

[1] bounded-timeout poll loop (Wait { timeout: Some(200ms) }):
  poll @  0.20s -> Err(Timeout)
  poll @  0.40s -> Err(Timeout)
  poll @  0.60s -> Ok(QueueEmpty)
[1] elapsed: 0.60s

[2] wait_indefinitely on the same workload:

*** watchdog: poll(wait_indefinitely) has not returned after 60s — DEADLOCK confirmed ***
```

GPU work finishes in ~0.6s under the bounded poll (case 1). With `wait_indefinitely` on an identical CB (case 2), the call never returns; the 60-second watchdog fires and aborts the process.

**Cause (suspected)**

[`<metal::Device as Device>::wait`](https://github.com/gfx-rs/wgpu/blob/v29.0.3/wgpu-hal/src/metal/device.rs#L1875-L1909) is a spin-poll on `MTLCommandBuffer.status()`:

```rust
loop {
    if let MTLCommandBufferStatus::Completed = cmd_buf.status() {
        return Ok(true);
    }
    if let Some(timeout) = timeout {
        if start.elapsed() >= timeout { return Ok(false); }
    }
    thread::sleep(Duration::from_millis(1));
}
```

For long-running CBs, the polling thread can fail to ever observe `Completed` from `status()`, so with `timeout: None` the loop never exits. With `timeout: Some(_)` the timeout path returns `Ok(false)`, falls through to `wgpu_core::Device::maintain` → `Queue::maintain` → `get_fence_value()` (which reads the fence value updated from the `addCompletedHandler` block), and the queue is correctly observed as empty. That's why case 1 works and case 2 doesn't — the fence-value path is reliable; the `status()` spin is not.

#8119 already proposes the fix: replace the `status()` spin with a `Condvar` whose value is advanced from an `addCompletedHandler` block. Filing this issue separately to surface the deadlock symptom (rather than just inefficiency).

**Workaround**

Loop `Wait { submission_index: None, timeout: Some(short_duration) }` until you observe `Ok(QueueEmpty | WaitSucceeded)` — each iteration's timeout-triggered `queue.maintain()` reaches the fence-value path and correctly observes completion.

**Platform**

- macOS 26.3.2 (Build 25D2140), Apple M2 Pro
- `wgpu = "29.0"` (resolved to 29.0.3)
- `rustc 1.95.0 (59807616e 2026-04-14)`
- Backend: Metal

**Related**

- #8119 — same root cause (spin-poll on `MTLCommandBuffer.status()`), framed as a perf issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal: device.poll(PollType::wait_indefinitely()) deadlocks on CBs that take >~hundreds of ms #9531

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metal: device.poll(PollType::wait_indefinitely()) deadlocks on CBs that take >~hundreds of ms #9531

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions