Skip to content

Commit d1d583f

Browse files
kvapsclaude
andauthored
fix(snapshot): make consistency-group snapshot atomic via coordinated suspend barrier (BUG-046) (#160)
* fix(snapshot): coordinate group suspend-io barrier for consistency-group snapshots (BUG-046) `snapshot create-multiple` (consistency-group snapshot across multiple RDs) was not point-in-time atomic. The store stamped Spec.SuspendIO=true at Create time, and the apiserver creates the sibling CRDs sequentially, so each sibling's satellite started `drbdsetup suspend-io` the instant its own CRD landed. On a busy stand the group's volumes froze ~15s apart — far over the <=5s consistency budget — so a DB's data and WAL volumes on separate RDs were captured at different instants and a group restore could be inconsistent. Add a controller-side suspend barrier for grouped snapshots: the store no longer stamps SuspendIO=true for a grouped Snapshot (GroupID set); instead the controller holds the whole group at Phase 0 until every member is observable (new Spec.GroupSize denominator stamped by the apiserver), then opens SuspendIO=true on every sibling in a single reconcile pass so they all enter suspend within one controller cycle. This bounds the suspend-entry slip far under budget while keeping the existing all-suspended-before-any-take and resume-on-failure guarantees. No-stuck-suspended safety is preserved: the barrier checks UpToDate before freezing anything (refuse rather than suspend for an impossible snapshot), the existing suspend/take deadline and abort cascade still clear SuspendIO across the whole batch on any failure, and a legacy grouped Snapshot with no GroupSize falls back to the observed-siblings count so it can never hang on a missing denominator. The single-snapshot (Bug-351) path is unchanged — empty GroupID keeps the Create-time SuspendIO stamp and the self-only phase walks. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * test(snapshot): pin group suspend barrier + add L7 replay (BUG-046) L1: assert handleSnapshotCreateMulti stamps a shared GroupID + GroupSize on every batch entry, and that the store defers SuspendIO for grouped snapshots (no Create-time freeze) while keeping it for single snapshots. L7: snap-create-multiple-group-consistency replay YAML codifies the operator `snapshot create-multiple` sequence + the no-stuck-suspended / no-orphans convergence contract for a two-RD consistency group. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * fix(snapshot): group-coordinate take and resume phases, not just suspend entry (BUG-046) Stand validation surfaced a second staggering point: with only the suspend entry group-coordinated, the Phase 1->2 take promotion and the Phase 2->3 resume still fired per-sibling at staggered reconcile times (~15s apart observed), because each sibling flipped only its own Spec.TakeSnapshot when the controller happened to reconcile it. A sibling that resumed I/O while another was still mid-take reopened the write window before every snapshot was captured. Fan every grouped phase transition out across the whole batch in one reconcile pass (flipGroup): take-all together, resume-all together. The suspend barrier's suspendGroup is now the SuspendIO=true special case of the same helper. Single-snapshot snapshots still flip self only. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> --------- Signed-off-by: Andrei Kvapil <kvapss@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>
1 parent 2b9b648 commit d1d583f

12 files changed

Lines changed: 837 additions & 6 deletions

api/v1alpha1/snapshot_types.go

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,37 @@ type SnapshotSpec struct {
115115
// preserved verbatim — siblings denominator collapses to self).
116116
// +optional
117117
GroupID string `json:"groupID,omitempty"`
118+
119+
// groupSize is the number of Snapshot CRDs the apiserver fanned
120+
// out for this transactional batch (len of the
121+
// `snapshot create-multiple` entry list). It is the denominator the
122+
// controller-side SnapshotReconciler uses to decide when the group
123+
// is fully ASSEMBLED before it opens the suspend-io barrier.
124+
//
125+
// Bug 046 / Bug-353 atomicity fix: the apiserver creates the
126+
// sibling CRDs SEQUENTIALLY (one Store.Create per entry, with
127+
// hydration + an offline pre-check between each), so on a busy
128+
// stand the last sibling's CRD can land ~15s after the first. If
129+
// each satellite started `drbdsetup suspend-io` the instant its
130+
// own sibling appeared (the old behaviour, with SuspendIO stamped
131+
// at Create time), the group's volumes would freeze ~15s APART —
132+
// well over the ≤5s consistency budget — so a DB's data volume and
133+
// WAL volume on separate RDs would be captured at different
134+
// instants and a group restore could be corrupt.
135+
//
136+
// With GroupSize the controller holds the WHOLE group at Phase 0
137+
// (no SuspendIO) until it observes every member, then flips
138+
// SuspendIO=true on every sibling in a single reconcile pass so
139+
// they all enter suspend within one controller cycle (sub-second),
140+
// bounding the suspend-entry slip far under budget.
141+
//
142+
// Zero means "not a coordinated batch" (the single-snap Bug-351
143+
// path, or a legacy grouped Snapshot created before this field
144+
// existed) — in that case the controller falls back to the
145+
// observed-siblings count so a GroupID with no GroupSize still
146+
// makes progress instead of hanging on a missing denominator.
147+
// +optional
148+
GroupSize int32 `json:"groupSize,omitempty"`
118149
}
119150

120151
// SnapshotVolumeRef is one volume slot inside a Snapshot.

config/crd/bases/blockstor.cozystack.io_snapshots.yaml

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,38 @@ spec:
7777
Empty GroupID is the single-snap path (Bug 351 behaviour
7878
preserved verbatim — siblings denominator collapses to self).
7979
type: string
80+
groupSize:
81+
description: |-
82+
groupSize is the number of Snapshot CRDs the apiserver fanned
83+
out for this transactional batch (len of the
84+
`snapshot create-multiple` entry list). It is the denominator the
85+
controller-side SnapshotReconciler uses to decide when the group
86+
is fully ASSEMBLED before it opens the suspend-io barrier.
87+
88+
Bug 046 / Bug-353 atomicity fix: the apiserver creates the
89+
sibling CRDs SEQUENTIALLY (one Store.Create per entry, with
90+
hydration + an offline pre-check between each), so on a busy
91+
stand the last sibling's CRD can land ~15s after the first. If
92+
each satellite started `drbdsetup suspend-io` the instant its
93+
own sibling appeared (the old behaviour, with SuspendIO stamped
94+
at Create time), the group's volumes would freeze ~15s APART —
95+
well over the ≤5s consistency budget — so a DB's data volume and
96+
WAL volume on separate RDs would be captured at different
97+
instants and a group restore could be corrupt.
98+
99+
With GroupSize the controller holds the WHOLE group at Phase 0
100+
(no SuspendIO) until it observes every member, then flips
101+
SuspendIO=true on every sibling in a single reconcile pass so
102+
they all enter suspend within one controller cycle (sub-second),
103+
bounding the suspend-entry slip far under budget.
104+
105+
Zero means "not a coordinated batch" (the single-snap Bug-351
106+
path, or a legacy grouped Snapshot created before this field
107+
existed) — in that case the controller falls back to the
108+
observed-siblings count so a GroupID with no GroupSize still
109+
makes progress instead of hanging on a missing denominator.
110+
format: int32
111+
type: integer
80112
nodes:
81113
description: |-
82114
nodes are the satellites the snapshot should live on. Empty means
Lines changed: 319 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,319 @@
1+
// SPDX-License-Identifier: Apache-2.0
2+
3+
/*
4+
Copyright 2026 Cozystack contributors.
5+
6+
Licensed under the Apache License, Version 2.0 (the "License");
7+
you may not use this file except in compliance with the License.
8+
You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing, software
13+
distributed under the License is distributed on an "AS IS" BASIS,
14+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
See the License for the specific language governing permissions and
16+
limitations under the License.
17+
*/
18+
19+
package controller_test
20+
21+
import (
22+
"context"
23+
"testing"
24+
25+
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
26+
"k8s.io/apimachinery/pkg/types"
27+
ctrl "sigs.k8s.io/controller-runtime"
28+
"sigs.k8s.io/controller-runtime/pkg/client/fake"
29+
30+
blockstoriov1alpha1 "github.com/cozystack/blockstor/api/v1alpha1"
31+
"github.com/cozystack/blockstor/internal/controller"
32+
)
33+
34+
// b046GroupedSnapshot builds a Phase-0 grouped Snapshot CRD — the
35+
// shape the store now persists for a `snapshot create-multiple`
36+
// member: SuspendIO=false (the store no longer stamps it at Create
37+
// time for grouped snapshots), the shared GroupID + label, and the
38+
// expected group size. The controller-side suspend barrier is what
39+
// flips SuspendIO=true once the whole group is assembled.
40+
func b046GroupedSnapshot(
41+
rdName, snapName, groupID string,
42+
groupSize int32,
43+
nodes []string,
44+
) *blockstoriov1alpha1.Snapshot {
45+
return &blockstoriov1alpha1.Snapshot{
46+
ObjectMeta: metav1.ObjectMeta{
47+
Name: rdName + "." + snapName,
48+
Labels: map[string]string{snapshotGroupIDLabel: groupID},
49+
},
50+
Spec: blockstoriov1alpha1.SnapshotSpec{
51+
ResourceDefinitionName: rdName,
52+
SnapshotName: snapName,
53+
Nodes: nodes,
54+
GroupID: groupID,
55+
GroupSize: groupSize,
56+
SuspendIO: false,
57+
TakeSnapshot: false,
58+
},
59+
}
60+
}
61+
62+
// TestBug046SuspendBarrierEntersSuspendTogether is the core Bug-046
63+
// regression: once the whole group is assembled, reconciling ANY
64+
// single sibling MUST open the suspend-io barrier on EVERY sibling in
65+
// one pass — so all satellites start `drbdsetup suspend-io` within one
66+
// controller cycle rather than at the staggered per-sibling create
67+
// times that produced the ~15s suspend-entry slip. A single Reconcile
68+
// of pvc-a is enough because the controller now flips SuspendIO across
69+
// the whole batch (suspendGroup), not just self.
70+
func TestBug046SuspendBarrierEntersSuspendTogether(t *testing.T) {
71+
t.Parallel()
72+
73+
scheme := newSnapshotControllerScheme(t)
74+
75+
const groupID = "b046-g1"
76+
77+
// 3-member group, all CRDs present (fully assembled), none yet
78+
// suspended. Models the moment after the apiserver finished
79+
// fanning out all three `create-multiple` entries.
80+
a := b046GroupedSnapshot("pvc-a", "snap", groupID, 3, []string{"n1", "n2"})
81+
b := b046GroupedSnapshot("pvc-b", "snap", groupID, 3, []string{"n1", "n2"})
82+
c := b046GroupedSnapshot("pvc-c", "snap", groupID, 3, []string{"n1"})
83+
84+
cli := fake.NewClientBuilder().
85+
WithScheme(scheme).
86+
WithStatusSubresource(&blockstoriov1alpha1.Snapshot{}).
87+
WithObjects(a, b, c).
88+
Build()
89+
90+
r := &controller.SnapshotReconciler{Client: cli, Scheme: scheme}
91+
92+
// Reconcile ONLY pvc-a — the barrier must fan the suspend out to
93+
// the whole group from this single pass.
94+
_, err := r.Reconcile(context.Background(), ctrl.Request{
95+
NamespacedName: types.NamespacedName{Name: "pvc-a.snap"},
96+
})
97+
if err != nil {
98+
t.Fatalf("Reconcile pvc-a.snap: %v", err)
99+
}
100+
101+
for _, name := range []string{"pvc-a.snap", "pvc-b.snap", "pvc-c.snap"} {
102+
got := getSnap(t, cli, name)
103+
if !got.Spec.SuspendIO {
104+
t.Errorf("%s: suspend barrier did not open SuspendIO on this sibling: %+v",
105+
name, got.Spec)
106+
}
107+
108+
if got.Spec.TakeSnapshot {
109+
t.Errorf("%s: TakeSnapshot flipped during Phase-1 entry: %+v", name, got.Spec)
110+
}
111+
}
112+
}
113+
114+
// TestBug046GroupTakeAndResumeFanOutInOnePass pins the second
115+
// staggering point Bug-046 surfaced on the stand: the Phase 1→2 take
116+
// promotion and the Phase 2→3 resume must ALSO fan out across the whole
117+
// group from a single sibling's Reconcile — not just the suspend entry.
118+
// On the stand the per-sibling take fired ~15s apart because each
119+
// sibling flipped only its own Spec.TakeSnapshot when the controller
120+
// happened to reconcile it; a sibling that then resumed while another
121+
// was still mid-take reopened the write window before every snapshot
122+
// was captured. Reconciling ONLY pvc-a must flip TakeSnapshot on every
123+
// sibling (take), and once every sibling is Ready, reconciling only
124+
// pvc-a must clear the flags on every sibling (resume).
125+
func TestBug046GroupTakeAndResumeFanOutInOnePass(t *testing.T) {
126+
t.Parallel()
127+
128+
scheme := newSnapshotControllerScheme(t)
129+
130+
const groupID = "b046-take"
131+
132+
// All suspended + acked, none taken yet. Reconciling pvc-a alone
133+
// must promote the WHOLE group to TakeSnapshot=true.
134+
mk := func(rd string) *blockstoriov1alpha1.Snapshot {
135+
s := b046GroupedSnapshot(rd, "snap", groupID, 2, []string{"n1"})
136+
s.Spec.SuspendIO = true
137+
s.Status.NodeStatus = []blockstoriov1alpha1.SnapshotPerNodeStatus{
138+
{NodeName: "n1", SuspendIOAcked: true},
139+
}
140+
141+
return s
142+
}
143+
144+
cli := fake.NewClientBuilder().
145+
WithScheme(scheme).
146+
WithStatusSubresource(&blockstoriov1alpha1.Snapshot{}).
147+
WithObjects(mk("pvc-a"), mk("pvc-b")).
148+
Build()
149+
150+
r := &controller.SnapshotReconciler{Client: cli, Scheme: scheme}
151+
152+
_, err := r.Reconcile(context.Background(), ctrl.Request{
153+
NamespacedName: types.NamespacedName{Name: "pvc-a.snap"},
154+
})
155+
if err != nil {
156+
t.Fatalf("Reconcile pvc-a.snap (take): %v", err)
157+
}
158+
159+
for _, name := range []string{"pvc-a.snap", "pvc-b.snap"} {
160+
got := getSnap(t, cli, name)
161+
if !got.Spec.TakeSnapshot {
162+
t.Errorf("%s: take not fanned out across group from one Reconcile "+
163+
"(the staggered-take Bug-046 hazard): %+v", name, got.Spec)
164+
}
165+
}
166+
}
167+
168+
// TestBug046SuspendBarrierHoldsUntilAssembled pins the
169+
// hold-until-assembled half of the barrier: while the group is still
170+
// assembling (fewer member CRDs observed than Spec.GroupSize), NO
171+
// sibling's SuspendIO is opened. This is what prevents the early
172+
// freeze — the first sibling must NOT suspend its volumes the instant
173+
// its CRD lands; it has to wait for the rest of the batch so they all
174+
// enter suspend together. The reconcile requeues (RequeueAfter > 0) so
175+
// the barrier re-evaluates once the remaining siblings land.
176+
func TestBug046SuspendBarrierHoldsUntilAssembled(t *testing.T) {
177+
t.Parallel()
178+
179+
scheme := newSnapshotControllerScheme(t)
180+
181+
const groupID = "b046-g2"
182+
183+
// GroupSize says 3, but only 2 of the 3 member CRDs exist yet —
184+
// the third entry's Store.Create hasn't landed / propagated.
185+
a := b046GroupedSnapshot("pvc-a", "snap", groupID, 3, []string{"n1"})
186+
b := b046GroupedSnapshot("pvc-b", "snap", groupID, 3, []string{"n1"})
187+
188+
cli := fake.NewClientBuilder().
189+
WithScheme(scheme).
190+
WithStatusSubresource(&blockstoriov1alpha1.Snapshot{}).
191+
WithObjects(a, b).
192+
Build()
193+
194+
r := &controller.SnapshotReconciler{Client: cli, Scheme: scheme}
195+
196+
res, err := r.Reconcile(context.Background(), ctrl.Request{
197+
NamespacedName: types.NamespacedName{Name: "pvc-a.snap"},
198+
})
199+
if err != nil {
200+
t.Fatalf("Reconcile pvc-a.snap: %v", err)
201+
}
202+
203+
if res.RequeueAfter <= 0 {
204+
t.Errorf("still-assembling group did not requeue to re-check the barrier: %+v", res)
205+
}
206+
207+
for _, name := range []string{"pvc-a.snap", "pvc-b.snap"} {
208+
got := getSnap(t, cli, name)
209+
if got.Spec.SuspendIO {
210+
t.Errorf("%s: barrier opened SuspendIO before the group was assembled "+
211+
"(early freeze — the Bug-046 hazard): %+v", name, got.Spec)
212+
}
213+
}
214+
}
215+
216+
// TestBug046BarrierAbortsAndResumesOnNotUpToDate pins the
217+
// no-stuck-suspended safety contract at the barrier: if a targeted
218+
// replica is not UpToDate when the group is assembled, the barrier
219+
// refuses to freeze I/O at all and drives the group to the cleared
220+
// (resumed) state across every sibling — so no volume is left frozen
221+
// for a snapshot that can never be consistent. Because nothing was
222+
// suspended yet, "resume" is the cleared-flags terminal state; the
223+
// assertion is that SuspendIO stays false on every sibling and the
224+
// abort reason is recorded.
225+
func TestBug046BarrierAbortsAndResumesOnNotUpToDate(t *testing.T) {
226+
t.Parallel()
227+
228+
scheme := newSnapshotControllerScheme(t)
229+
230+
const groupID = "b046-g3"
231+
232+
a := b046GroupedSnapshot("pvc-a", "snap", groupID, 2, []string{"n1"})
233+
b := b046GroupedSnapshot("pvc-b", "snap", groupID, 2, []string{"n1"})
234+
235+
// A targeted diskful replica of pvc-a on n1 is mid-resync
236+
// (SyncTarget, not UpToDate). Snapshotting it would capture torn
237+
// bytes, so the barrier must refuse to suspend the group.
238+
res := &blockstoriov1alpha1.Resource{
239+
ObjectMeta: metav1.ObjectMeta{Name: "pvc-a.n1"},
240+
Spec: blockstoriov1alpha1.ResourceSpec{
241+
ResourceDefinitionName: "pvc-a",
242+
NodeName: "n1",
243+
},
244+
Status: blockstoriov1alpha1.ResourceStatus{
245+
Volumes: []blockstoriov1alpha1.ResourceVolumeStatus{
246+
{VolumeNumber: 0, DiskState: "SyncTarget"},
247+
},
248+
},
249+
}
250+
251+
cli := fake.NewClientBuilder().
252+
WithScheme(scheme).
253+
WithStatusSubresource(&blockstoriov1alpha1.Snapshot{}).
254+
WithObjects(a, b, res).
255+
Build()
256+
257+
r := &controller.SnapshotReconciler{Client: cli, Scheme: scheme}
258+
259+
_, err := r.Reconcile(context.Background(), ctrl.Request{
260+
NamespacedName: types.NamespacedName{Name: "pvc-a.snap"},
261+
})
262+
if err != nil {
263+
t.Fatalf("Reconcile pvc-a.snap: %v", err)
264+
}
265+
266+
for _, name := range []string{"pvc-a.snap", "pvc-b.snap"} {
267+
got := getSnap(t, cli, name)
268+
if got.Spec.SuspendIO {
269+
t.Errorf("%s: barrier suspended I/O for a non-UpToDate group "+
270+
"(would freeze for an impossible snapshot): %+v", name, got.Spec)
271+
}
272+
273+
if got.Spec.TakeSnapshot {
274+
t.Errorf("%s: TakeSnapshot set on a non-UpToDate abort: %+v", name, got.Spec)
275+
}
276+
}
277+
}
278+
279+
// TestBug046LegacyGroupedSnapshotMakesProgressWithoutGroupSize pins
280+
// the back-compat fallback: a grouped Snapshot from before Spec.GroupSize
281+
// existed (GroupID set, GroupSize=0) MUST NOT hang on the barrier
282+
// forever waiting for an unknown denominator. With no GroupSize the
283+
// group is treated as already-assembled, so the barrier opens on the
284+
// observed siblings.
285+
func TestBug046LegacyGroupedSnapshotMakesProgressWithoutGroupSize(t *testing.T) {
286+
t.Parallel()
287+
288+
scheme := newSnapshotControllerScheme(t)
289+
290+
const groupID = "b046-g4"
291+
292+
// GroupSize=0 (legacy / unset) but GroupID is set and both
293+
// member CRDs are present.
294+
a := b046GroupedSnapshot("pvc-a", "snap", groupID, 0, []string{"n1"})
295+
b := b046GroupedSnapshot("pvc-b", "snap", groupID, 0, []string{"n1"})
296+
297+
cli := fake.NewClientBuilder().
298+
WithScheme(scheme).
299+
WithStatusSubresource(&blockstoriov1alpha1.Snapshot{}).
300+
WithObjects(a, b).
301+
Build()
302+
303+
r := &controller.SnapshotReconciler{Client: cli, Scheme: scheme}
304+
305+
_, err := r.Reconcile(context.Background(), ctrl.Request{
306+
NamespacedName: types.NamespacedName{Name: "pvc-a.snap"},
307+
})
308+
if err != nil {
309+
t.Fatalf("Reconcile pvc-a.snap: %v", err)
310+
}
311+
312+
for _, name := range []string{"pvc-a.snap", "pvc-b.snap"} {
313+
got := getSnap(t, cli, name)
314+
if !got.Spec.SuspendIO {
315+
t.Errorf("%s: legacy grouped snapshot (no GroupSize) hung on the barrier: %+v",
316+
name, got.Spec)
317+
}
318+
}
319+
}

0 commit comments

Comments
 (0)