`drain_and_flush` calls `submit_and_wait_all` (one `io_uring_enter` with no `EINTR` retry) + `file.sync_all`. Errors are logged and dropped, then `process_async_completion_queue` runs anyway. If the drain didn't actually finish, in-flight ops have no CQEs → `PendingRequest` stays in the slab → `desc_idx` was already advanced past in the avail ring → no used-ring entry produced. Snapshot is written claiming `next_avail = N`; on restore that descriptor is permanently lost → guest blk_mq tag never returned → in-kernel hang. Same external symptom as the (now-fixed) ordering bugs, different cause. Also missing: invariant check that `engine.num_ops() == 0` after a successful drain (PR [#6][pr6] added `pending_ops()` for exactly this).
0 commit comments