fix(satellite): fire late-add metadata/self-heal only for unattached desired volumes (BUG-048 deadlock)#168
Conversation
…desired volumes (BUG-048 deadlock) PR #164 widened the ensurePerVolumeMetadata trigger gate from the .res-count hasLateAddedVolume() to the unconditional dr.GetMetadataCreated(), so every post-activation diskful reconcile ran a per-volume drbdadm dump-md (an md_buffer consumer) plus the cluster- wide late-add self-heals. During a vd s resize DRBD holds md_buffer for the whole cluster-wide size change (change_cluster_wide_device_size / drbd_determine_dev_size); the per-reconcile md probe + self-heals then perpetually lose the cluster-wide state-change arbitration, the resize never completes, md_buffer is never released, and the resource deadlocks cluster-wide, reboot-proof. Narrow the gate to the race-free, deadlock-safe truth: fire the per- volume metadata pass ONLY when some desired volume is NOT present-and- attached in the kernel (absent, or present but Diskless / no-metadata). An attached volume already has a valid DRBD-9 superblock (the kernel cannot attach a lower disk without one), so dump-md on it is pointless and is the md_buffer consumer that contends with the resize. This still fires for a genuine late-add (the new volume is unattached) so BUG-048 #164 and the pre-rendered-.res regression are preserved, but skips a converged steady-state and a resize-in-flight (all volumes attached). Add Adm.AttachedVolumes (parses drbdsetup status --json into the set of attached, metadata-bearing local volume numbers) as the kernel-truth backing for the new hasUnattachedDesiredVolume predicate. Also gate the late-add promote / resync-kick self-heals OUT of a reconcile that just issued drbdadm resize (defense-in-depth: their recovery actions are themselves cluster-wide state-change / md_buffer operations). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…dump-md/self-heal on attached volumes)
L1 regression for the BUG-048 resize deadlock. The satellite test drives
a full rec.Apply() and observes the mock-recorded command stream:
- converged steady-state (all desired volumes attached UpToDate) →
NO drbdadm dump-md, NO cluster-wide self-heal (disconnect /
new-current-uuid --clear-bitmap / invalidate);
- resize-in-flight (drbdadm resize issued, grown volume transiently
Inconsistent) → same: no dump-md, no self-heal;
- BUG-048 preserved: a desired volume NOT attached in the kernel
(genuine late-add) → the per-volume metadata pass DOES fire
(dump-md + create-md for the missing volume).
Proven to FAIL on the pre-fix dr.GetMetadataCreated() gate (dump-md
fired on the converged reconcile; MintLateAddSource disconnect +
new-current-uuid fired mid-resize) and PASS with the fix.
drbd AttachedVolumes unit pins the kernel-truth parse: only
metadata-bearing disk-states (UpToDate/Inconsistent/Consistent/Outdated)
count as attached; Diskless / transient / absent volumes do not; empty
on no kernel slot (fail-toward-firing).
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
… (zfs-thin 2-diskful+1-diskless) CLI-bug-fix protocol cells for the BUG-048 resize deadlock (the fix touches DRBD resize / satellite paths). L6 cli-matrix bug-048-resize-deadlock-zfs-thin.sh: build the forensic shape (zfs-thin, 2 diskful + 1 diskless-client), drive 3 sequential grows under reconcile pressure, and assert each grow COMPLETES (size lands + diskful replicas re-reach UpToDate within a bounded window) with NO deadlock surface (StandAlone / suspended / a diskful volume stuck below UpToDate). With the bug the first grow stalls in a size-change retry storm and the assert times out. L7 replay bug048-resize.yaml: codify the operator sequence (rd c -> vd c 1G -> 2 diskful + 1 diskless-client -> grow 1G->2G->3G) with vd_size_kib + all_uptodate + replica_diskless convergence asserts; the grow timeouts are deadlock discriminators, not convergence targets. Honest scope: the full kernel md_buffer deadlock needs scale/contention to escalate, so the DETERMINISTIC CI guard for this bug is the L1 unit regression; these cells are best-effort stand reproduction the orchestrator runs on a stand under load. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughFixes a DRBD md_buffer deadlock (BUG-048) triggered during volume resize. Adds ChangesBUG-048 DRBD Resize Deadlock Fix
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request addresses a cluster-wide resize deadlock (BUG-048) by optimizing how per-volume metadata is reconciled. It introduces a gate that prevents unnecessary 'drbdadm dump-md' probes on already-attached volumes, which previously contended with in-flight resizes for the md_buffer. The implementation adds 'AttachedVolumes' to probe kernel state and 'shouldEnsurePerVolumeMetadata' to safely gate the metadata pass. The review identified potential nil pointer dereferences when accessing 'r.cfg.Adm' and suggested a minor CLI flag adjustment for consistency with existing code, all of which are addressed in the provided feedback.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| func (r *Reconciler) hasUnattachedDesiredVolume(ctx context.Context, dr *intent.DesiredResource) bool { | ||
| attached := r.cfg.Adm.AttachedVolumes(ctx, dr.GetName()) | ||
|
|
||
| for _, vol := range dr.GetVolumes() { | ||
| if _, ok := attached[vol.GetVolumeNumber()]; !ok { | ||
| return true | ||
| } | ||
| } | ||
|
|
||
| return false | ||
| } |
There was a problem hiding this comment.
The hasUnattachedDesiredVolume method accesses r.cfg.Adm directly. Since r.cfg.Adm can be nil in certain configurations (such as storage-only setups or specific unit tests), this will cause a nil pointer dereference panic. Adding a defensive nil guard prevents this.
| func (r *Reconciler) hasUnattachedDesiredVolume(ctx context.Context, dr *intent.DesiredResource) bool { | |
| attached := r.cfg.Adm.AttachedVolumes(ctx, dr.GetName()) | |
| for _, vol := range dr.GetVolumes() { | |
| if _, ok := attached[vol.GetVolumeNumber()]; !ok { | |
| return true | |
| } | |
| } | |
| return false | |
| } | |
| func (r *Reconciler) hasUnattachedDesiredVolume(ctx context.Context, dr *intent.DesiredResource) bool { | |
| if r.cfg.Adm == nil { | |
| return true | |
| } | |
| attached := r.cfg.Adm.AttachedVolumes(ctx, dr.GetName()) | |
| for _, vol := range dr.GetVolumes() { | |
| if _, ok := attached[vol.GetVolumeNumber()]; !ok { | |
| return true | |
| } | |
| } | |
| return false | |
| } |
| func (r *Reconciler) shouldEnsurePerVolumeMetadata(ctx context.Context, dr *intent.DesiredResource, diskless bool) bool { | ||
| return !diskless && | ||
| dr.GetMetadataCreated() && | ||
| r.hasUnattachedDesiredVolume(ctx, dr) && | ||
| !r.isDisklessToDiskfulFlip(ctx, dr, diskless) | ||
| } |
There was a problem hiding this comment.
The shouldEnsurePerVolumeMetadata method should guard against a nil r.cfg.Adm before calling hasUnattachedDesiredVolume to avoid unnecessary execution and potential panics.
| func (r *Reconciler) shouldEnsurePerVolumeMetadata(ctx context.Context, dr *intent.DesiredResource, diskless bool) bool { | |
| return !diskless && | |
| dr.GetMetadataCreated() && | |
| r.hasUnattachedDesiredVolume(ctx, dr) && | |
| !r.isDisklessToDiskfulFlip(ctx, dr, diskless) | |
| } | |
| func (r *Reconciler) shouldEnsurePerVolumeMetadata(ctx context.Context, dr *intent.DesiredResource, diskless bool) bool { | |
| return !diskless && | |
| r.cfg.Adm != nil && | |
| dr.GetMetadataCreated() && | |
| r.hasUnattachedDesiredVolume(ctx, dr) && | |
| !r.isDisklessToDiskfulFlip(ctx, dr, diskless) | |
| } |
| func (a *Adm) AttachedVolumes(ctx context.Context, resource string) map[int32]struct{} { | ||
| out, err := a.exec.Run(ctx, "drbdsetup", "status", resource, "--json") | ||
| if err != nil { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
For consistency with the Show method in the same file (line 221), consider using "-j" instead of "--json" and placing the option before the positional resource argument. Some older or stricter CLI parsers can be sensitive to option ordering (placing options after positional arguments).
| func (a *Adm) AttachedVolumes(ctx context.Context, resource string) map[int32]struct{} { | |
| out, err := a.exec.Run(ctx, "drbdsetup", "status", resource, "--json") | |
| if err != nil { | |
| return nil | |
| } | |
| func (a *Adm) AttachedVolumes(ctx context.Context, resource string) map[int32]struct{} { | |
| out, err := a.exec.Run(ctx, "drbdsetup", "status", "-j", resource) | |
| if err != nil { | |
| return nil | |
| } |
BUG-048 concurrent late-vd fix (#164 converge) + resize-deadlock guard (#168), release-gate GO + completed 24h ZFS-thick burn-in. Also documents the #167 auto-release pipeline that now cuts the GitHub Release on tag. Signed-off-by: Andrei Kvapil <kvapss@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>
Summary
PR #164 (9866305, BUG-048 concurrent late vd-add fix) widened the
ensurePerVolumeMetadatatrigger gate inpkg/satellite/reconciler.go:f3515045a):if !diskless && hasLateAddedVolume(resPath, dr) && !flipif !diskless && dr.GetMetadataCreated() && !flipdr.GetMetadataCreated()is true for the lifetime of every post-activation RD, so the per-volume metadata pass — and, alongside it on every diskful reconcile, the cluster-wide late-add self-heals — now fire on every reconcile of a converged/resizing diskful resource.ensurePerVolumeMetadataruns a per-volumedrbdadm dump-mdprobe, which is an md_buffer consumer. During avd sresize DRBD holds md_buffer for the whole cluster-wide size change (change_cluster_wide_device_size/drbd_determine_dev_size). The per-reconciledump-mdthen perpetually loses the cluster-wide state-change arbitration: the resize never completes, md_buffer is never released, and the resource deadlocks cluster-wide, reboot-proof (forensic repro: size-change retry count 10,790 vs the normal 2-20, on a zfs-thin 2-diskful + 1-diskless-client RD; clean atf3515045a). The cluster-wide self-heals (MintLateAddSource= disconnect +new-current-uuid --clear-bitmap+ adjust;InvalidateVolume) compound the contention when the resized volume transiently dips below UpToDate.CI was fully green on #164 (unit + integration + every E2E lane incl. the existing vd-resize cell) yet shipped this. Green CI proved nothing — so this PR's fix is inseparable from a regression test that fails on #164 and passes on the fix.
The fix
Narrow the gate to the race-free, deadlock-safe truth: fire the per-volume metadata pass only when some desired volume is NOT present-and-attached in the kernel (absent from the device list, or present but Diskless / no-metadata).
Rationale (DRBD invariant): a volume that is attached already has a valid DRBD-9 superblock — the kernel cannot attach a lower disk without one — so
dump-mdon it is pointless and is the md_buffer consumer that contends with the resize. The new predicate:TestApplyLateAddedVolumePerVolumeMetadataRunsDespitePreRenderedResstill passes;dump-md;dump-md, no md_buffer contention → no deadlock.Implementation:
pkg/drbd/drbdsetup_show.go— newAdm.AttachedVolumes: parsesdrbdsetup status <res> --jsoninto the set of local volume numbers that are attached with metadata (disk-state UpToDate / Inconsistent / Consistent / Outdated). Conservative toward firing: empty set on any probe/parse failure or unloaded slot, so a cold start still runs the (idempotent) metadata pass — the pre-fix(rest,store,satellite): converge concurrent late vd-adds (BUG-048) #164 behaviour.pkg/satellite/reconciler.go—hasUnattachedDesiredVolumepredicate backing the gate (shouldEnsurePerVolumeMetadata).pkg/satellite/reconciler.go— defense-in-depth: the BUG-048 late-add self-heals (maybeLateAddPromote/maybeKickLateAddResync) are gated out of any reconcile pass that just issueddrbdadm resize. Their kernel-truth probes already veto a normal converged resize, but a cluster-wide size change can transiently dip a volume below UpToDate, and their recovery actions are themselves cluster-wide state-change / md_buffer operations that would reintroduce the deadlock.The REST/idempotency layer (
pkg/rest/volume_definitions.go) from #164 is not touched — that part is correct.Regression test (fails on #164, passes on fix)
pkg/satellite/reconciler_bug048_resize_deadlock_test.godrives a fullrec.Apply()and observes the mock-recorded command stream:drbdadm dump-md, no cluster-wide self-heal;drbdadm resizeissued, grown volume transiently Inconsistent) → same: nodump-md, no self-heal;dump-md+create-mdfor the missing volume).Proven against the pre-fix
dr.GetMetadataCreated()gate (temporarily restored locally):drbdadm dump-md <rd>/0and/1fired on the converged reconcile; on the resize the self-healdisconnect <rd>+drbdsetup new-current-uuid --clear-bitmap 1001fired mid-resize (MintLateAddSource minting a source on the transiently-Inconsistent volume during a resize — the catastrophic contention);dump-md, no self-heal in steady-state/resize; the late-add bookend fires the metadata pass for the unattached volume.pkg/drbd/attached_volumes_test.gounit-pins theAttachedVolumesparse (only metadata-bearing disk-states count; Diskless / transient / absent excluded; empty on no kernel slot).CLI-bug-fix protocol cells
The fix touches DRBD resize / satellite, so per the project protocol:
tests/e2e/cli-matrix/bug-048-resize-deadlock-zfs-thin.sh— builds the forensic shape (zfs-thin, 2 diskful + 1 diskless-client), drives 3 sequential grows under reconcile pressure, asserts each grow completes (size lands + diskful replicas re-reach UpToDate within a bounded window) with no deadlock surface (StandAlone / suspended).tests/operator-harness/replay/bug048-resize.yaml— codifies the operator sequence (rd c → vd c 1G → 2 diskful + 1 diskless-client → grow 1G→2G→3G) withvd_size_kib+all_uptodate+replica_disklessconvergence asserts.Honest scope: the full kernel md_buffer deadlock needs scale/contention to escalate (forensic repro ~10k size-change retries), so the deterministic CI guard for this bug is the L1 unit regression; the L6/L7 cells are best-effort stand reproduction the orchestrator runs on a stand under load (a hang/timeout there is the deadlock).
Validation
go build ./...— clean.golangci-lint run— 0 issues.go test ./...— green (29 packages).go test -race ./pkg/satellite/... ./pkg/drbd/... ./internal/controller/... ./pkg/rest/... ./pkg/store/...— no DATA RACE. (A pre-existingpkg/store/k8soptimistic-lock contention flake inpatch_bug_201_test.gosurfaces only under-race; that package is byte-identical to main and untouched by this PR.)go test -tags=integration ./tests/integration/...127 PASS / 0 FAIL, incl.TestGroupKWFLateVD/TestGroupEVDCreateListDelete/TestGroupKWFHappyPath(BUG-048 stays fixed). The existingTestApplyLateAddedVolumePerVolumeMetadataRunsDespitePreRenderedResand themaybe_late_add_promote/needs_late_add_promotetests stay green.Summary by CodeRabbit
Release Notes
Bug Fixes
Tests