Skip to content

fix(satellite): fire late-add metadata/self-heal only for unattached desired volumes (BUG-048 deadlock)#168

Merged
Andrei Kvapil (kvaps) merged 3 commits into
mainfrom
fix/bug-048-resize-deadlock
Jun 16, 2026
Merged

fix(satellite): fire late-add metadata/self-heal only for unattached desired volumes (BUG-048 deadlock)#168
Andrei Kvapil (kvaps) merged 3 commits into
mainfrom
fix/bug-048-resize-deadlock

Conversation

@kvaps

@kvaps Andrei Kvapil (kvaps) commented Jun 16, 2026

Copy link
Copy Markdown
Member

Summary

PR #164 (9866305, BUG-048 concurrent late vd-add fix) widened the ensurePerVolumeMetadata trigger gate in pkg/satellite/reconciler.go:

dr.GetMetadataCreated() is true for the lifetime of every post-activation RD, so the per-volume metadata pass — and, alongside it on every diskful reconcile, the cluster-wide late-add self-heals — now fire on every reconcile of a converged/resizing diskful resource.

ensurePerVolumeMetadata runs a per-volume drbdadm dump-md probe, which is an md_buffer consumer. During a vd s resize DRBD holds md_buffer for the whole cluster-wide size change (change_cluster_wide_device_size / drbd_determine_dev_size). The per-reconcile dump-md then perpetually loses the cluster-wide state-change arbitration: the resize never completes, md_buffer is never released, and the resource deadlocks cluster-wide, reboot-proof (forensic repro: size-change retry count 10,790 vs the normal 2-20, on a zfs-thin 2-diskful + 1-diskless-client RD; clean at f3515045a). The cluster-wide self-heals (MintLateAddSource = disconnect + new-current-uuid --clear-bitmap + adjust; InvalidateVolume) compound the contention when the resized volume transiently dips below UpToDate.

CI was fully green on #164 (unit + integration + every E2E lane incl. the existing vd-resize cell) yet shipped this. Green CI proved nothing — so this PR's fix is inseparable from a regression test that fails on #164 and passes on the fix.

The fix

Narrow the gate to the race-free, deadlock-safe truth: fire the per-volume metadata pass only when some desired volume is NOT present-and-attached in the kernel (absent from the device list, or present but Diskless / no-metadata).

Rationale (DRBD invariant): a volume that is attached already has a valid DRBD-9 superblock — the kernel cannot attach a lower disk without one — so dump-md on it is pointless and is the md_buffer consumer that contends with the resize. The new predicate:

  • fires on a genuine late-add (the new volume is unattached / comes up Diskless) → BUG-048 fix(rest,store,satellite): converge concurrent late vd-adds (BUG-048) #164 preserved, and the existing TestApplyLateAddedVolumePerVolumeMetadataRunsDespitePreRenderedRes still passes;
  • skips a converged steady-state (all desired volumes attached) → no dump-md;
  • skips during a resize (all volumes attached, only the grown region resyncing) → no dump-md, no md_buffer contention → no deadlock.

Implementation:

  • pkg/drbd/drbdsetup_show.go — new Adm.AttachedVolumes: parses drbdsetup status <res> --json into the set of local volume numbers that are attached with metadata (disk-state UpToDate / Inconsistent / Consistent / Outdated). Conservative toward firing: empty set on any probe/parse failure or unloaded slot, so a cold start still runs the (idempotent) metadata pass — the pre-fix(rest,store,satellite): converge concurrent late vd-adds (BUG-048) #164 behaviour.
  • pkg/satellite/reconciler.gohasUnattachedDesiredVolume predicate backing the gate (shouldEnsurePerVolumeMetadata).
  • pkg/satellite/reconciler.godefense-in-depth: the BUG-048 late-add self-heals (maybeLateAddPromote / maybeKickLateAddResync) are gated out of any reconcile pass that just issued drbdadm resize. Their kernel-truth probes already veto a normal converged resize, but a cluster-wide size change can transiently dip a volume below UpToDate, and their recovery actions are themselves cluster-wide state-change / md_buffer operations that would reintroduce the deadlock.

The REST/idempotency layer (pkg/rest/volume_definitions.go) from #164 is not touched — that part is correct.

Regression test (fails on #164, passes on fix)

pkg/satellite/reconciler_bug048_resize_deadlock_test.go drives a full rec.Apply() and observes the mock-recorded command stream:

  • converged steady-state (all desired volumes attached UpToDate) → no drbdadm dump-md, no cluster-wide self-heal;
  • resize-in-flight (drbdadm resize issued, grown volume transiently Inconsistent) → same: no dump-md, no self-heal;
  • BUG-048 preserved: a desired volume NOT attached (genuine late-add) → the per-volume metadata pass DOES fire (dump-md + create-md for the missing volume).

Proven against the pre-fix dr.GetMetadataCreated() gate (temporarily restored locally):

  • pre-fix FAIL — drbdadm dump-md <rd>/0 and /1 fired on the converged reconcile; on the resize the self-heal disconnect <rd> + drbdsetup new-current-uuid --clear-bitmap 1001 fired mid-resize (MintLateAddSource minting a source on the transiently-Inconsistent volume during a resize — the catastrophic contention);
  • with the fix PASS — no dump-md, no self-heal in steady-state/resize; the late-add bookend fires the metadata pass for the unattached volume.

pkg/drbd/attached_volumes_test.go unit-pins the AttachedVolumes parse (only metadata-bearing disk-states count; Diskless / transient / absent excluded; empty on no kernel slot).

CLI-bug-fix protocol cells

The fix touches DRBD resize / satellite, so per the project protocol:

  • L6 tests/e2e/cli-matrix/bug-048-resize-deadlock-zfs-thin.sh — builds the forensic shape (zfs-thin, 2 diskful + 1 diskless-client), drives 3 sequential grows under reconcile pressure, asserts each grow completes (size lands + diskful replicas re-reach UpToDate within a bounded window) with no deadlock surface (StandAlone / suspended).
  • L7 tests/operator-harness/replay/bug048-resize.yaml — codifies the operator sequence (rd c → vd c 1G → 2 diskful + 1 diskless-client → grow 1G→2G→3G) with vd_size_kib + all_uptodate + replica_diskless convergence asserts.

Honest scope: the full kernel md_buffer deadlock needs scale/contention to escalate (forensic repro ~10k size-change retries), so the deterministic CI guard for this bug is the L1 unit regression; the L6/L7 cells are best-effort stand reproduction the orchestrator runs on a stand under load (a hang/timeout there is the deadlock).

Validation

  • go build ./... — clean.
  • golangci-lint run — 0 issues.
  • go test ./... — green (29 packages).
  • go test -race ./pkg/satellite/... ./pkg/drbd/... ./internal/controller/... ./pkg/rest/... ./pkg/store/... — no DATA RACE. (A pre-existing pkg/store/k8s optimistic-lock contention flake in patch_bug_201_test.go surfaces only under -race; that package is byte-identical to main and untouched by this PR.)
  • Integration (envtest 1.34.1) — go test -tags=integration ./tests/integration/... 127 PASS / 0 FAIL, incl. TestGroupKWFLateVD / TestGroupEVDCreateListDelete / TestGroupKWFHappyPath (BUG-048 stays fixed). The existing TestApplyLateAddedVolumePerVolumeMetadataRunsDespitePreRenderedRes and the maybe_late_add_promote / needs_late_add_promote tests stay green.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Fixed a DRBD resize deadlock that could occur during volume expansion operations, improving system stability during cluster resizing.
    • Enhanced metadata handling and volume attachment detection for more reliable DRBD operations.
  • Tests

    • Added comprehensive test coverage for DRBD resize scenarios and metadata handling to prevent regression.

Andrei Kvapil (kvaps) and others added 3 commits June 16, 2026 09:46
…desired volumes (BUG-048 deadlock)

PR #164 widened the ensurePerVolumeMetadata trigger gate from the
.res-count hasLateAddedVolume() to the unconditional
dr.GetMetadataCreated(), so every post-activation diskful reconcile ran
a per-volume drbdadm dump-md (an md_buffer consumer) plus the cluster-
wide late-add self-heals. During a vd s resize DRBD holds md_buffer for
the whole cluster-wide size change (change_cluster_wide_device_size /
drbd_determine_dev_size); the per-reconcile md probe + self-heals then
perpetually lose the cluster-wide state-change arbitration, the resize
never completes, md_buffer is never released, and the resource deadlocks
cluster-wide, reboot-proof.

Narrow the gate to the race-free, deadlock-safe truth: fire the per-
volume metadata pass ONLY when some desired volume is NOT present-and-
attached in the kernel (absent, or present but Diskless / no-metadata).
An attached volume already has a valid DRBD-9 superblock (the kernel
cannot attach a lower disk without one), so dump-md on it is pointless
and is the md_buffer consumer that contends with the resize. This still
fires for a genuine late-add (the new volume is unattached) so BUG-048
#164 and the pre-rendered-.res regression are preserved, but skips a
converged steady-state and a resize-in-flight (all volumes attached).

Add Adm.AttachedVolumes (parses drbdsetup status --json into the set of
attached, metadata-bearing local volume numbers) as the kernel-truth
backing for the new hasUnattachedDesiredVolume predicate. Also gate the
late-add promote / resync-kick self-heals OUT of a reconcile that just
issued drbdadm resize (defense-in-depth: their recovery actions are
themselves cluster-wide state-change / md_buffer operations).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…dump-md/self-heal on attached volumes)

L1 regression for the BUG-048 resize deadlock. The satellite test drives
a full rec.Apply() and observes the mock-recorded command stream:

  - converged steady-state (all desired volumes attached UpToDate) →
    NO drbdadm dump-md, NO cluster-wide self-heal (disconnect /
    new-current-uuid --clear-bitmap / invalidate);
  - resize-in-flight (drbdadm resize issued, grown volume transiently
    Inconsistent) → same: no dump-md, no self-heal;
  - BUG-048 preserved: a desired volume NOT attached in the kernel
    (genuine late-add) → the per-volume metadata pass DOES fire
    (dump-md + create-md for the missing volume).

Proven to FAIL on the pre-fix dr.GetMetadataCreated() gate (dump-md
fired on the converged reconcile; MintLateAddSource disconnect +
new-current-uuid fired mid-resize) and PASS with the fix.

drbd AttachedVolumes unit pins the kernel-truth parse: only
metadata-bearing disk-states (UpToDate/Inconsistent/Consistent/Outdated)
count as attached; Diskless / transient / absent volumes do not; empty
on no kernel slot (fail-toward-firing).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
… (zfs-thin 2-diskful+1-diskless)

CLI-bug-fix protocol cells for the BUG-048 resize deadlock (the fix
touches DRBD resize / satellite paths).

L6 cli-matrix bug-048-resize-deadlock-zfs-thin.sh: build the forensic
shape (zfs-thin, 2 diskful + 1 diskless-client), drive 3 sequential
grows under reconcile pressure, and assert each grow COMPLETES (size
lands + diskful replicas re-reach UpToDate within a bounded window) with
NO deadlock surface (StandAlone / suspended / a diskful volume stuck
below UpToDate). With the bug the first grow stalls in a size-change
retry storm and the assert times out.

L7 replay bug048-resize.yaml: codify the operator sequence
(rd c -> vd c 1G -> 2 diskful + 1 diskless-client -> grow 1G->2G->3G)
with vd_size_kib + all_uptodate + replica_diskless convergence asserts;
the grow timeouts are deadlock discriminators, not convergence targets.

Honest scope: the full kernel md_buffer deadlock needs scale/contention
to escalate, so the DETERMINISTIC CI guard for this bug is the L1 unit
regression; these cells are best-effort stand reproduction the
orchestrator runs on a stand under load.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 77452d96-4827-498b-ad40-fbeb1c293464

📥 Commits

Reviewing files that changed from the base of the PR and between d364a44 and 77f374d.

📒 Files selected for processing (6)
  • pkg/drbd/attached_volumes_test.go
  • pkg/drbd/drbdsetup_show.go
  • pkg/satellite/reconciler.go
  • pkg/satellite/reconciler_bug048_resize_deadlock_test.go
  • tests/e2e/cli-matrix/bug-048-resize-deadlock-zfs-thin.sh
  • tests/operator-harness/replay/bug048-resize.yaml

📝 Walkthrough

Walkthrough

Fixes a DRBD md_buffer deadlock (BUG-048) triggered during volume resize. Adds Adm.AttachedVolumes to probe kernel disk-state via drbdsetup status --json. The reconciler's per-volume metadata gate now uses kernel-attached state instead of .res volume counts, and late-add self-heals are suppressed during resize passes. Adds unit, regression, e2e, and operator-harness replay tests.

Changes

BUG-048 DRBD Resize Deadlock Fix

Layer / File(s) Summary
AttachedVolumes implementation
pkg/drbd/drbdsetup_show.go
Adds Adm.AttachedVolumes which runs drbdsetup status --json and returns a set of volume numbers whose disk-state carries metadata (UpToDate, Inconsistent, Consistent, Outdated); attachedVolumeSet performs the disk-state filter.
AttachedVolumes unit tests
pkg/drbd/attached_volumes_test.go
Three tests via fake executor: only metadata-bearing disk states included, absent volumes excluded, empty kernel output returns nil.
Reconciler metadata gate and helpers
pkg/satellite/reconciler.go
Adds hasUnattachedDesiredVolume and shouldEnsurePerVolumeMetadata helpers; replaces the prior !diskless && MetadataCreated && !isDisklessToDiskfulFlip condition in applyDRBD with a call to shouldEnsurePerVolumeMetadata, gating metadata creation on kernel-attached state.
Reconciler resizing guard in self-heal pipeline
pkg/satellite/reconciler.go
Threads resizing boolean into finishDRBDApply and maybePromoteSelfHeals; wraps maybeLateAddPromote and maybeKickLateAddResync in if !resizing to suppress late-add self-heals during resize passes.
Reconciler BUG-048 regression tests
pkg/satellite/reconciler_bug048_resize_deadlock_test.go
Three regression tests (steady-state, resize-in-flight, true late-add) with fixture helpers and assertNoDumpMd / assertNoLateAddSelfHeal assertion helpers covering the full behavior matrix.
E2E and operator-harness replay tests
tests/e2e/cli-matrix/bug-048-resize-deadlock-zfs-thin.sh, tests/operator-harness/replay/bug048-resize.yaml
ZFS-thin e2e script performs three sequential grows on a live cluster with deadlock surface checks after each; YAML replay spec provisions replicas and runs two resize steps with all_uptodate and diskless-not-wedged assertions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • cozystack/blockstor#120: Modifies the same finishDRBDApply/maybePromoteSelfHeals code path in pkg/satellite/reconciler.go to add the maybeSoloPromote logic, which is the same pipeline extended by this PR's resizing guard.

Poem

🐇 Hop hop, the deadlock is gone at last,
The kernel whispers which volumes are fast,
No metadata probes when resize is near,
Late-add self-heals wait till the coast is clear,
Three grows tested, no md_buffer in sight —
BUG-048 banished to the night! ✨

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/bug-048-resize-deadlock

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a cluster-wide resize deadlock (BUG-048) by optimizing how per-volume metadata is reconciled. It introduces a gate that prevents unnecessary 'drbdadm dump-md' probes on already-attached volumes, which previously contended with in-flight resizes for the md_buffer. The implementation adds 'AttachedVolumes' to probe kernel state and 'shouldEnsurePerVolumeMetadata' to safely gate the metadata pass. The review identified potential nil pointer dereferences when accessing 'r.cfg.Adm' and suggested a minor CLI flag adjustment for consistency with existing code, all of which are addressed in the provided feedback.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +3362 to +3372
func (r *Reconciler) hasUnattachedDesiredVolume(ctx context.Context, dr *intent.DesiredResource) bool {
attached := r.cfg.Adm.AttachedVolumes(ctx, dr.GetName())

for _, vol := range dr.GetVolumes() {
if _, ok := attached[vol.GetVolumeNumber()]; !ok {
return true
}
}

return false
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The hasUnattachedDesiredVolume method accesses r.cfg.Adm directly. Since r.cfg.Adm can be nil in certain configurations (such as storage-only setups or specific unit tests), this will cause a nil pointer dereference panic. Adding a defensive nil guard prevents this.

Suggested change
func (r *Reconciler) hasUnattachedDesiredVolume(ctx context.Context, dr *intent.DesiredResource) bool {
attached := r.cfg.Adm.AttachedVolumes(ctx, dr.GetName())
for _, vol := range dr.GetVolumes() {
if _, ok := attached[vol.GetVolumeNumber()]; !ok {
return true
}
}
return false
}
func (r *Reconciler) hasUnattachedDesiredVolume(ctx context.Context, dr *intent.DesiredResource) bool {
if r.cfg.Adm == nil {
return true
}
attached := r.cfg.Adm.AttachedVolumes(ctx, dr.GetName())
for _, vol := range dr.GetVolumes() {
if _, ok := attached[vol.GetVolumeNumber()]; !ok {
return true
}
}
return false
}

Comment on lines +3382 to +3387
func (r *Reconciler) shouldEnsurePerVolumeMetadata(ctx context.Context, dr *intent.DesiredResource, diskless bool) bool {
return !diskless &&
dr.GetMetadataCreated() &&
r.hasUnattachedDesiredVolume(ctx, dr) &&
!r.isDisklessToDiskfulFlip(ctx, dr, diskless)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The shouldEnsurePerVolumeMetadata method should guard against a nil r.cfg.Adm before calling hasUnattachedDesiredVolume to avoid unnecessary execution and potential panics.

Suggested change
func (r *Reconciler) shouldEnsurePerVolumeMetadata(ctx context.Context, dr *intent.DesiredResource, diskless bool) bool {
return !diskless &&
dr.GetMetadataCreated() &&
r.hasUnattachedDesiredVolume(ctx, dr) &&
!r.isDisklessToDiskfulFlip(ctx, dr, diskless)
}
func (r *Reconciler) shouldEnsurePerVolumeMetadata(ctx context.Context, dr *intent.DesiredResource, diskless bool) bool {
return !diskless &&
r.cfg.Adm != nil &&
dr.GetMetadataCreated() &&
r.hasUnattachedDesiredVolume(ctx, dr) &&
!r.isDisklessToDiskfulFlip(ctx, dr, diskless)
}

Comment on lines +354 to +358
func (a *Adm) AttachedVolumes(ctx context.Context, resource string) map[int32]struct{} {
out, err := a.exec.Run(ctx, "drbdsetup", "status", resource, "--json")
if err != nil {
return nil
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with the Show method in the same file (line 221), consider using "-j" instead of "--json" and placing the option before the positional resource argument. Some older or stricter CLI parsers can be sensitive to option ordering (placing options after positional arguments).

Suggested change
func (a *Adm) AttachedVolumes(ctx context.Context, resource string) map[int32]struct{} {
out, err := a.exec.Run(ctx, "drbdsetup", "status", resource, "--json")
if err != nil {
return nil
}
func (a *Adm) AttachedVolumes(ctx context.Context, resource string) map[int32]struct{} {
out, err := a.exec.Run(ctx, "drbdsetup", "status", "-j", resource)
if err != nil {
return nil
}

@kvaps Andrei Kvapil (kvaps) marked this pull request as ready for review June 16, 2026 08:37
@kvaps Andrei Kvapil (kvaps) merged commit 825b6e5 into main Jun 16, 2026
26 of 27 checks passed
Andrei Kvapil (kvaps) added a commit that referenced this pull request Jun 17, 2026
BUG-048 concurrent late-vd fix (#164 converge) + resize-deadlock guard
(#168), release-gate GO + completed 24h ZFS-thick burn-in. Also documents
the #167 auto-release pipeline that now cuts the GitHub Release on tag.

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant