Skip to content

fix(satellite): wait for backing block device before mkfs/attach (zfs/udev race)#170

Merged
Andrei Kvapil (kvaps) merged 2 commits into
mainfrom
fix/satellite-zvol-device-wait
Jun 22, 2026
Merged

fix(satellite): wait for backing block device before mkfs/attach (zfs/udev race)#170
Andrei Kvapil (kvaps) merged 2 commits into
mainfrom
fix/satellite-zvol-device-wait

Conversation

@kvaps

@kvaps Andrei Kvapil (kvaps) commented Jun 19, 2026

Copy link
Copy Markdown
Member

Problem

Surfaced while bringing up Blockstor as the storage backend in the Cozystack e2e (talos+qemu, ZFS). Storage-backed apps (mariadb/postgres/kafka/harbor/…) intermittently failed; the satellite logged:

Apply per-resource failure ... mkfs.ext4 /dev/zvol/data/pvc-XXX_00000:
  The file /dev/zvol/data/pvc-XXX_00000 does not exist and no size was specified.: exit status 1

zfs create -V returns before udev has finished creating the /dev/zvol/<pool>/<vol> symlink, so the device path reported by VolumeStatus may not exist on disk yet. The storage-only (local SC) path is the most exposed: runStorageOnlyMkfs → runAutoMkfs runs mkfs immediately, with no DRBD setup in between to give udev time to settle, so it races the symlink and fails — leaving the volume unformatted and the app stuck. The DRBD path can lose the same race when drbdadm attaches the backing device.

Fix

applyStorage now blocks on waitForBlockDevice before handing each backing device to its downstream consumers (storage-only mkfs, DRBD attach, LUKS):

  • authoritative readiness probe via blockdev --getsize64 (errors on a missing/dangling node),
  • best-effort udevadm settle between attempts to drain the udev queue,
  • bounded by new ReconcilerConfig.DeviceWaitTimeout/DeviceWaitPoll knobs (defaults 60s/500ms),
  • re-queues the resource if the device never appears rather than acting on a path that is not there.

Because applyStorage runs before applyDRBD, this also closes the same race on the DRBD-attach side.

Test

reconciler_zvol_device_wait_test.go drives the late udev symlink via a sequenced fake exec (blockdev --getsize64 fails twice, then succeeds). Proven to FAIL without the fix (0 device probes / no timeout error) and PASS with it; the timeout path is covered too. go build ./... and the full pkg/satellite suite are green.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved reliability of ZFS volume creation with device readiness checks and configurable timeout settings.
    • Enhanced resource read operations with automatic fallback to uncached API access when cache data is unavailable.
  • Tests

    • Added comprehensive test coverage for ZFS volume device readiness and resource cache miss scenarios.

…/udev race)

`zfs create -V` returns before udev has created the
/dev/zvol/<pool>/<vol> symlink, so VolumeStatus can report a device path
that does not exist on disk yet. The storage-only (local SC) path is the
most exposed: runStorageOnlyMkfs -> runAutoMkfs runs mkfs immediately,
with no DRBD step in between to let udev settle, so it raced the symlink
and failed with "mkfs.ext4 ...: The file ... does not exist and no size
was specified", leaving the volume unformatted and the app stuck. The
DRBD path can lose the same race when drbdadm attaches the backing
device.

applyStorage now blocks on waitForBlockDevice (best-effort `udevadm
settle` + authoritative `blockdev --getsize64` probe, bounded by the new
DeviceWaitTimeout/DeviceWaitPoll config knobs) before handing each device
to the mkfs / DRBD-attach / LUKS consumers, and re-queues the resource if
the device never appears.

Regression test drives the late udev symlink via a sequenced fake exec;
proven to FAIL without the applyStorage call (0 blockdev probes / no
timeout error) and PASS with it.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ef4df715-316b-4bd1-aa18-82fa777a1d71

📥 Commits

Reviewing files that changed from the base of the PR and between 45e6d77 and a087222.

📒 Files selected for processing (5)
  • pkg/satellite/reconciler.go
  • pkg/satellite/reconciler_zvol_device_wait_test.go
  • pkg/store/k8s/k8s.go
  • pkg/store/k8s/resource_definitions.go
  • pkg/store/k8s/resource_definitions_rd404_internal_test.go

📝 Walkthrough

Walkthrough

Two independent fixes: (1) the satellite reconciler's applyStorage now polls for a zvol block device via blockdev --getsize64 and udevadm settle before proceeding with mkfs/DRBD/LUKS, with configurable timeout/poll knobs on ReconcilerConfig; (2) the k8s store's resourceDefinitions.Get falls back to an uncached apiReader on NotFound to avoid treating informer cache lag as a definitive absence.

Changes

Zvol block-device wait in satellite reconciler

Layer / File(s) Summary
ReconcilerConfig fields and waitForBlockDevice helper
pkg/satellite/reconciler.go
Adds DeviceWaitTimeout and DeviceWaitPoll to ReconcilerConfig and implements waitForBlockDevice, which polls blockdev --getsize64 and optionally runs udevadm settle until the device is ready or the deadline elapses.
applyStorage integration and zvol wait tests
pkg/satellite/reconciler.go, pkg/satellite/reconciler_zvol_device_wait_test.go
applyStorage calls waitForBlockDevice after VolumeStatus yields a device path, returning a requeue error on timeout. Tests use a sequenced fake exec to assert ≥3 blockdev calls on polling and a "did not appear" error on timeout.

ResourceDefinitions cache-miss fallback via uncached API reader

Layer / File(s) Summary
resourceDefinitions struct, Get fallback, and getUncached helper
pkg/store/k8s/resource_definitions.go, pkg/store/k8s/k8s.go
Adds apiReader field to resourceDefinitions; Get now calls getUncached on NotFound instead of immediately returning store.ErrNotFound; getUncached performs a live read or returns store.ErrNotFound when apiReader is nil. NewWithAPIReader wires the reader into initialization.
ResourceDefinitions cache-miss unit tests
pkg/store/k8s/resource_definitions_rd404_internal_test.go
Two internal tests: one asserts Get succeeds via apiReader fallback when the cached client simulates informer lag; the other asserts nil apiReader preserves the cached store.ErrNotFound.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 A zvol appears, but udev's slow to wake,
So blockdev polls until the path's no fake.
The cache says "not found!" — don't trust it blind,
A live apiReader reads what's left behind.
Two races patched, two tests to prove they're done,
This bunny hops forward — the symlink race is won! 🎉

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/satellite-zvol-device-wait

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a race condition between ZFS volume creation and udev symlink generation by introducing a polling mechanism (waitForBlockDevice) that waits for the block device to become usable before proceeding. It also adds configuration parameters for the timeout and poll intervals, along with comprehensive unit tests. The review feedback recommends removing the udevadm settle call from the polling loop because it acts as a global barrier, which can block unnecessarily, drastically reduce polling frequency, and waste system resources.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +1554 to +1604
// waitForBlockDevice blocks until `device` is a usable block device or the
// bounded deadline elapses. It exists to close the zfs-create/udev race
// described at the applyStorage call site: after `zfs create -V` the
// /dev/zvol/<pool>/<vol> symlink is created asynchronously by udev, and
// callers that consume the path immediately (notably the storage-only mkfs
// path) would otherwise act on a node that does not exist yet.
//
// `blockdev --getsize64` is the authoritative readiness probe — it opens
// the device and so errors on a missing or dangling node. Best-effort
// `udevadm settle` between attempts nudges udev to drain its event queue;
// it is intentionally ignored on error (it may be absent in unit tests or
// already idle). An empty device path (diskless replica) or a nil Exec is
// a no-op. On timeout it returns an error so the caller re-queues the
// resource instead of mkfs-ing / attaching a path that is not there.
func (r *Reconciler) waitForBlockDevice(ctx context.Context, device string) error {
if device == "" || r.cfg.Exec == nil {
return nil
}

timeout := r.cfg.DeviceWaitTimeout
if timeout <= 0 {
timeout = deviceWaitTimeoutDefault
}

poll := r.cfg.DeviceWaitPoll
if poll <= 0 {
poll = deviceWaitPollDefault
}

deadline := time.Now().Add(timeout)

for {
_, err := r.cfg.Exec.Run(ctx, "blockdev", "--getsize64", device)
if err == nil {
return nil
}

if time.Now().After(deadline) {
return errors.Errorf("block device %s did not appear within %s "+
"(zfs create -> udev /dev/zvol symlink race)", device, timeout)
}

_, _ = r.cfg.Exec.Run(ctx, "udevadm", "settle", "--timeout=5")

select {
case <-ctx.Done():
return errors.Wrap(ctx.Err(), "wait for block device")
case <-time.After(poll):
}
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Why udevadm settle in a polling loop is problematic:

  1. Global Blocking Behavior: udevadm settle is a global barrier that blocks until the entire system's udev event queue is empty. On a busy Kubernetes node with concurrent volume attachments, network interface changes, or other hardware events, the queue may frequently be active. This causes udevadm settle to block for its full timeout (5 seconds) even if the target zvol's symlink is already created or unrelated to those events.
  2. Reduced Polling Frequency: Repeatedly blocking for up to 5 seconds per iteration drastically reduces the polling frequency. Instead of polling every 500ms, the loop might only poll once every 5.5 seconds, which can easily cause the 60-second DeviceWaitTimeout to expire prematurely under system load.
  3. Resource Waste: In environments where udevadm is missing or permissions are restricted (e.g., certain containerized setups), spawning a non-existent or failing process every 500ms wastes CPU cycles and floods system logs.
  4. Redundancy: udevadm settle does not speed up or trigger udev processing; it is purely passive. The udev daemon will process the events asynchronously in the background anyway.

Recommendation:

Remove udevadm settle from the polling loop entirely. Polling blockdev --getsize64 with a 500ms interval is extremely lightweight, safe, and sufficient.

// waitForBlockDevice blocks until `device` is a usable block device or the
// bounded deadline elapses. It exists to close the zfs-create/udev race
// described at the applyStorage call site: after `zfs create -V` the
// /dev/zvol/<pool>/<vol> symlink is created asynchronously by udev, and
// callers that consume the path immediately (notably the storage-only mkfs
// path) would otherwise act on a node that does not exist yet.
//
// `blockdev --getsize64` is the authoritative readiness probe — it opens
// the device and so errors on a missing or dangling node. An empty device
// path (diskless replica) or a nil Exec is a no-op. On timeout it returns
// an error so the caller re-queues the resource instead of mkfs-ing /
// attaching a path that is not there.
func (r *Reconciler) waitForBlockDevice(ctx context.Context, device string) error {
	if device == "" || r.cfg.Exec == nil {
		return nil
	}

	timeout := r.cfg.DeviceWaitTimeout
	if timeout <= 0 {
		timeout = deviceWaitTimeoutDefault
	}

	poll := r.cfg.DeviceWaitPoll
	if poll <= 0 {
		poll = deviceWaitPollDefault
	}

	deadline := time.Now().Add(timeout)

	for {
		_, err := r.cfg.Exec.Run(ctx, "blockdev", "--getsize64", device)
		if err == nil {
			return nil
		}

		if time.Now().After(deadline) {
			return errors.Errorf("block device %s did not appear within %s "+
				"(zfs create -> udev /dev/zvol symlink race)", device, timeout)
		}

		select {
		case <-ctx.Done():
			return errors.Wrap(ctx.Err(), "wait for block device")
		case <-time.After(poll):
		}
	}
}

…04 attach race)

The apiserver runs multi-replica with no leader election, so a
GET /v1/resource-definitions/{rd} can load-balance to a replica whose
informer cache has not yet observed a just-committed RD create. Get read
only the cached client and returned a spurious NotFound, surfaced as a
404 to linstor-csi — ControllerPublishVolume then failed and backed off,
slowing bulk attach of freshly-provisioned PVCs.

Wire the direct (uncached) API reader into resourceDefinitions and fall
back to a live read on a cache NotFound, mirroring the volumeDefinitions
cache-lag handling. nil apiReader (in-memory/unit stores) keeps the
cached NotFound.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps Andrei Kvapil (kvaps) marked this pull request as ready for review June 22, 2026 11:50
@kvaps Andrei Kvapil (kvaps) merged commit 0aaa840 into main Jun 22, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant