Skip to content

Metal: device.poll(PollType::wait_indefinitely()) deadlocks on CBs that take >~hundreds of ms #9531

@ruihe774

Description

@ruihe774

Description

On the Metal backend, device.poll(PollType::wait_indefinitely()) deadlocks (never returns) for command buffers that take more than a few hundred milliseconds of GPU time. An identical poll with a finite timeout — Wait { submission_index: None, timeout: Some(_) } — completes promptly on the same workload, so the underlying GPU work is finishing.

This is the same root mechanism flagged in #8119 (wait spin-polls MTLCommandBuffer.status() instead of using addCompletedHandler + condvar), but with a stronger symptom than that issue documents: under #8119 the wait is just slow; here it's an indefinite hang. Filing separately so the deadlock symptom is searchable.

Repro steps

Standalone Cargo project, ~150 LOC. Run on macOS / Apple Silicon: cargo run --release.

Cargo.toml:

[package]
name = "wgpu-metal-poll-repro"
version = "0.0.0"
edition = "2024"

[dependencies]
wgpu = "29.0"
pollster = "0.4"
bytemuck = "1"

src/main.rs:

use std::time::{Duration, Instant};
use wgpu::util::DeviceExt;

const SHADER: &str = r#"
@group(0) @binding(0) var<storage, read_write> buf: array<u32>;
@group(0) @binding(1) var<uniform> p: Params;
struct Params { iters: u32 };

@compute @workgroup_size(64)
fn cs_main(@builtin(global_invocation_id) gid: vec3<u32>) {
    var x: u32 = gid.x ^ 0xDEADBEEFu;
    for (var i: u32 = 0u; i < p.iters; i = i + 1u) {
        x = x ^ (x << 13u);
        x = x ^ (x >> 17u);
        x = x ^ (x << 5u);
    }
    buf[gid.x] = x;
}
"#;

fn main() { pollster::block_on(run()); }

async fn run() {
    let instance = wgpu::Instance::new(wgpu::InstanceDescriptor {
        backends: wgpu::Backends::METAL,
        backend_options: Default::default(),
        flags: wgpu::InstanceFlags::default(),
        memory_budget_thresholds: Default::default(),
        display: None,
    });
    let adapter = instance.request_adapter(&wgpu::RequestAdapterOptions {
        power_preference: wgpu::PowerPreference::HighPerformance,
        force_fallback_adapter: false,
        compatible_surface: None,
    }).await.unwrap();

    let (device, queue) = adapter.request_device(&wgpu::DeviceDescriptor {
        label: None,
        required_features: wgpu::Features::default(),
        required_limits: wgpu::Limits::default(),
        memory_hints: Default::default(),
        trace: wgpu::Trace::Off,
        experimental_features: Default::default(),
    }).await.unwrap();

    let module = device.create_shader_module(wgpu::ShaderModuleDescriptor {
        label: None,
        source: wgpu::ShaderSource::Wgsl(SHADER.into()),
    });
    let bgl = device.create_bind_group_layout(&wgpu::BindGroupLayoutDescriptor {
        label: None,
        entries: &[
            wgpu::BindGroupLayoutEntry {
                binding: 0, visibility: wgpu::ShaderStages::COMPUTE,
                ty: wgpu::BindingType::Buffer {
                    ty: wgpu::BufferBindingType::Storage { read_only: false },
                    has_dynamic_offset: false, min_binding_size: None,
                },
                count: None,
            },
            wgpu::BindGroupLayoutEntry {
                binding: 1, visibility: wgpu::ShaderStages::COMPUTE,
                ty: wgpu::BindingType::Buffer {
                    ty: wgpu::BufferBindingType::Uniform,
                    has_dynamic_offset: false, min_binding_size: None,
                },
                count: None,
            },
        ],
    });
    let pl = device.create_pipeline_layout(&wgpu::PipelineLayoutDescriptor {
        label: None,
        bind_group_layouts: &[Some(&bgl)],
        immediate_size: 0,
    });
    let pipe = device.create_compute_pipeline(&wgpu::ComputePipelineDescriptor {
        label: None, layout: Some(&pl), module: &module,
        entry_point: Some("cs_main"),
        compilation_options: Default::default(), cache: None,
    });

    const N_THREADS: u32 = 1024 * 64;
    const ITERS: u32 = 5_000_000;
    let buf = device.create_buffer(&wgpu::BufferDescriptor {
        label: None, size: (N_THREADS as u64) * 4,
        usage: wgpu::BufferUsages::STORAGE, mapped_at_creation: false,
    });
    let ubuf = device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
        label: None, contents: bytemuck::bytes_of(&ITERS),
        usage: wgpu::BufferUsages::UNIFORM,
    });
    let bg = device.create_bind_group(&wgpu::BindGroupDescriptor {
        label: None, layout: &bgl, entries: &[
            wgpu::BindGroupEntry { binding: 0, resource: buf.as_entire_binding() },
            wgpu::BindGroupEntry { binding: 1, resource: ubuf.as_entire_binding() },
        ],
    });
    let encode_and_submit = || {
        let mut enc = device.create_command_encoder(&Default::default());
        {
            let mut pass = enc.begin_compute_pass(&Default::default());
            pass.set_pipeline(&pipe);
            pass.set_bind_group(0, &bg, &[]);
            pass.dispatch_workgroups(N_THREADS / 64, 1, 1);
        }
        queue.submit(Some(enc.finish()));
    };

    // [1] Reference: bounded-timeout poll loop completes promptly.
    eprintln!("[1] bounded-timeout poll loop:");
    encode_and_submit();
    let t0 = Instant::now();
    loop {
        let r = device.poll(wgpu::PollType::Wait {
            submission_index: None, timeout: Some(Duration::from_millis(200)),
        });
        eprintln!("  poll @ {:>5.2}s -> {r:?}", t0.elapsed().as_secs_f32());
        match r {
            Ok(wgpu::PollStatus::QueueEmpty | wgpu::PollStatus::WaitSucceeded) => break,
            Err(wgpu::PollError::Timeout) => continue,
            other => panic!("unexpected: {other:?}"),
        }
    }
    eprintln!("[1] elapsed: {:.2}s", t0.elapsed().as_secs_f32());

    // [2] Bug: wait_indefinitely on the same workload deadlocks.
    eprintln!("\n[2] wait_indefinitely on the same workload:");
    encode_and_submit();
    let t0 = Instant::now();
    let _wd = std::thread::spawn(|| {
        std::thread::sleep(Duration::from_secs(60));
        eprintln!("*** watchdog: poll(wait_indefinitely) has not returned after 60s — DEADLOCK ***");
        std::process::exit(2);
    });
    let r = device.poll(wgpu::PollType::wait_indefinitely());
    eprintln!("[2] returned at {:.2}s -> {r:?}", t0.elapsed().as_secs_f32());
}

Expected vs observed behavior

Expected: device.poll(PollType::wait_indefinitely()) returns once the submitted command buffer completes — same as a finite-timeout poll, just without the deadline.

Observed (Apple M2 Pro, macOS 26.3.2, wgpu 29.0.3):

adapter: AdapterInfo { name: "Apple M2 Pro", ..., backend: Metal, ... }

[1] bounded-timeout poll loop (Wait { timeout: Some(200ms) }):
  poll @  0.20s -> Err(Timeout)
  poll @  0.40s -> Err(Timeout)
  poll @  0.60s -> Ok(QueueEmpty)
[1] elapsed: 0.60s

[2] wait_indefinitely on the same workload:

*** watchdog: poll(wait_indefinitely) has not returned after 60s — DEADLOCK confirmed ***

GPU work finishes in ~0.6s under the bounded poll (case 1). With wait_indefinitely on an identical CB (case 2), the call never returns; the 60-second watchdog fires and aborts the process.

Cause (suspected)

<metal::Device as Device>::wait is a spin-poll on MTLCommandBuffer.status():

loop {
    if let MTLCommandBufferStatus::Completed = cmd_buf.status() {
        return Ok(true);
    }
    if let Some(timeout) = timeout {
        if start.elapsed() >= timeout { return Ok(false); }
    }
    thread::sleep(Duration::from_millis(1));
}

For long-running CBs, the polling thread can fail to ever observe Completed from status(), so with timeout: None the loop never exits. With timeout: Some(_) the timeout path returns Ok(false), falls through to wgpu_core::Device::maintainQueue::maintainget_fence_value() (which reads the fence value updated from the addCompletedHandler block), and the queue is correctly observed as empty. That's why case 1 works and case 2 doesn't — the fence-value path is reliable; the status() spin is not.

#8119 already proposes the fix: replace the status() spin with a Condvar whose value is advanced from an addCompletedHandler block. Filing this issue separately to surface the deadlock symptom (rather than just inefficiency).

Workaround

Loop Wait { submission_index: None, timeout: Some(short_duration) } until you observe Ok(QueueEmpty | WaitSucceeded) — each iteration's timeout-triggered queue.maintain() reaches the fence-value path and correctly observes completion.

Platform

  • macOS 26.3.2 (Build 25D2140), Apple M2 Pro
  • wgpu = "29.0" (resolved to 29.0.3)
  • rustc 1.95.0 (59807616e 2026-04-14)
  • Backend: Metal

Related

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions