fix(satellite): wait for backing block device before mkfs/attach (zfs/udev race)#170
Conversation
…/udev race) `zfs create -V` returns before udev has created the /dev/zvol/<pool>/<vol> symlink, so VolumeStatus can report a device path that does not exist on disk yet. The storage-only (local SC) path is the most exposed: runStorageOnlyMkfs -> runAutoMkfs runs mkfs immediately, with no DRBD step in between to let udev settle, so it raced the symlink and failed with "mkfs.ext4 ...: The file ... does not exist and no size was specified", leaving the volume unformatted and the app stuck. The DRBD path can lose the same race when drbdadm attaches the backing device. applyStorage now blocks on waitForBlockDevice (best-effort `udevadm settle` + authoritative `blockdev --getsize64` probe, bounded by the new DeviceWaitTimeout/DeviceWaitPoll config knobs) before handing each device to the mkfs / DRBD-attach / LUKS consumers, and re-queues the resource if the device never appears. Regression test drives the late udev symlink via a sequenced fake exec; proven to FAIL without the applyStorage call (0 blockdev probes / no timeout error) and PASS with it. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughTwo independent fixes: (1) the satellite reconciler's ChangesZvol block-device wait in satellite reconciler
ResourceDefinitions cache-miss fallback via uncached API reader
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request addresses a race condition between ZFS volume creation and udev symlink generation by introducing a polling mechanism (waitForBlockDevice) that waits for the block device to become usable before proceeding. It also adds configuration parameters for the timeout and poll intervals, along with comprehensive unit tests. The review feedback recommends removing the udevadm settle call from the polling loop because it acts as a global barrier, which can block unnecessarily, drastically reduce polling frequency, and waste system resources.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| // waitForBlockDevice blocks until `device` is a usable block device or the | ||
| // bounded deadline elapses. It exists to close the zfs-create/udev race | ||
| // described at the applyStorage call site: after `zfs create -V` the | ||
| // /dev/zvol/<pool>/<vol> symlink is created asynchronously by udev, and | ||
| // callers that consume the path immediately (notably the storage-only mkfs | ||
| // path) would otherwise act on a node that does not exist yet. | ||
| // | ||
| // `blockdev --getsize64` is the authoritative readiness probe — it opens | ||
| // the device and so errors on a missing or dangling node. Best-effort | ||
| // `udevadm settle` between attempts nudges udev to drain its event queue; | ||
| // it is intentionally ignored on error (it may be absent in unit tests or | ||
| // already idle). An empty device path (diskless replica) or a nil Exec is | ||
| // a no-op. On timeout it returns an error so the caller re-queues the | ||
| // resource instead of mkfs-ing / attaching a path that is not there. | ||
| func (r *Reconciler) waitForBlockDevice(ctx context.Context, device string) error { | ||
| if device == "" || r.cfg.Exec == nil { | ||
| return nil | ||
| } | ||
|
|
||
| timeout := r.cfg.DeviceWaitTimeout | ||
| if timeout <= 0 { | ||
| timeout = deviceWaitTimeoutDefault | ||
| } | ||
|
|
||
| poll := r.cfg.DeviceWaitPoll | ||
| if poll <= 0 { | ||
| poll = deviceWaitPollDefault | ||
| } | ||
|
|
||
| deadline := time.Now().Add(timeout) | ||
|
|
||
| for { | ||
| _, err := r.cfg.Exec.Run(ctx, "blockdev", "--getsize64", device) | ||
| if err == nil { | ||
| return nil | ||
| } | ||
|
|
||
| if time.Now().After(deadline) { | ||
| return errors.Errorf("block device %s did not appear within %s "+ | ||
| "(zfs create -> udev /dev/zvol symlink race)", device, timeout) | ||
| } | ||
|
|
||
| _, _ = r.cfg.Exec.Run(ctx, "udevadm", "settle", "--timeout=5") | ||
|
|
||
| select { | ||
| case <-ctx.Done(): | ||
| return errors.Wrap(ctx.Err(), "wait for block device") | ||
| case <-time.After(poll): | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Why udevadm settle in a polling loop is problematic:
- Global Blocking Behavior:
udevadm settleis a global barrier that blocks until the entire system's udev event queue is empty. On a busy Kubernetes node with concurrent volume attachments, network interface changes, or other hardware events, the queue may frequently be active. This causesudevadm settleto block for its full timeout (5 seconds) even if the target zvol's symlink is already created or unrelated to those events. - Reduced Polling Frequency: Repeatedly blocking for up to 5 seconds per iteration drastically reduces the polling frequency. Instead of polling every 500ms, the loop might only poll once every 5.5 seconds, which can easily cause the 60-second
DeviceWaitTimeoutto expire prematurely under system load. - Resource Waste: In environments where
udevadmis missing or permissions are restricted (e.g., certain containerized setups), spawning a non-existent or failing process every 500ms wastes CPU cycles and floods system logs. - Redundancy:
udevadm settledoes not speed up or trigger udev processing; it is purely passive. The udev daemon will process the events asynchronously in the background anyway.
Recommendation:
Remove udevadm settle from the polling loop entirely. Polling blockdev --getsize64 with a 500ms interval is extremely lightweight, safe, and sufficient.
// waitForBlockDevice blocks until `device` is a usable block device or the
// bounded deadline elapses. It exists to close the zfs-create/udev race
// described at the applyStorage call site: after `zfs create -V` the
// /dev/zvol/<pool>/<vol> symlink is created asynchronously by udev, and
// callers that consume the path immediately (notably the storage-only mkfs
// path) would otherwise act on a node that does not exist yet.
//
// `blockdev --getsize64` is the authoritative readiness probe — it opens
// the device and so errors on a missing or dangling node. An empty device
// path (diskless replica) or a nil Exec is a no-op. On timeout it returns
// an error so the caller re-queues the resource instead of mkfs-ing /
// attaching a path that is not there.
func (r *Reconciler) waitForBlockDevice(ctx context.Context, device string) error {
if device == "" || r.cfg.Exec == nil {
return nil
}
timeout := r.cfg.DeviceWaitTimeout
if timeout <= 0 {
timeout = deviceWaitTimeoutDefault
}
poll := r.cfg.DeviceWaitPoll
if poll <= 0 {
poll = deviceWaitPollDefault
}
deadline := time.Now().Add(timeout)
for {
_, err := r.cfg.Exec.Run(ctx, "blockdev", "--getsize64", device)
if err == nil {
return nil
}
if time.Now().After(deadline) {
return errors.Errorf("block device %s did not appear within %s "+
"(zfs create -> udev /dev/zvol symlink race)", device, timeout)
}
select {
case <-ctx.Done():
return errors.Wrap(ctx.Err(), "wait for block device")
case <-time.After(poll):
}
}
}…04 attach race)
The apiserver runs multi-replica with no leader election, so a
GET /v1/resource-definitions/{rd} can load-balance to a replica whose
informer cache has not yet observed a just-committed RD create. Get read
only the cached client and returned a spurious NotFound, surfaced as a
404 to linstor-csi — ControllerPublishVolume then failed and backed off,
slowing bulk attach of freshly-provisioned PVCs.
Wire the direct (uncached) API reader into resourceDefinitions and fall
back to a live read on a cache NotFound, mirroring the volumeDefinitions
cache-lag handling. nil apiReader (in-memory/unit stores) keeps the
cached NotFound.
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Problem
Surfaced while bringing up Blockstor as the storage backend in the Cozystack e2e (talos+qemu, ZFS). Storage-backed apps (mariadb/postgres/kafka/harbor/…) intermittently failed; the satellite logged:
zfs create -Vreturns before udev has finished creating the/dev/zvol/<pool>/<vol>symlink, so the device path reported byVolumeStatusmay not exist on disk yet. The storage-only (local SC) path is the most exposed:runStorageOnlyMkfs → runAutoMkfsrunsmkfsimmediately, with no DRBD setup in between to give udev time to settle, so it races the symlink and fails — leaving the volume unformatted and the app stuck. The DRBD path can lose the same race whendrbdadmattaches the backing device.Fix
applyStoragenow blocks onwaitForBlockDevicebefore handing each backing device to its downstream consumers (storage-only mkfs, DRBD attach, LUKS):blockdev --getsize64(errors on a missing/dangling node),udevadm settlebetween attempts to drain the udev queue,ReconcilerConfig.DeviceWaitTimeout/DeviceWaitPollknobs (defaults 60s/500ms),Because
applyStorageruns beforeapplyDRBD, this also closes the same race on the DRBD-attach side.Test
reconciler_zvol_device_wait_test.godrives the late udev symlink via a sequenced fake exec (blockdev --getsize64fails twice, then succeeds). Proven to FAIL without the fix (0 device probes / no timeout error) and PASS with it; the timeout path is covered too.go build ./...and the fullpkg/satellitesuite are green.Summary by CodeRabbit
Release Notes
Bug Fixes
Tests