Skip to content

Commit 9c84104

Browse files
kvapsclaude
andauthored
fix(e2e): heal provocation-race wedges in recovery-family scenarios (#148)
* fix(e2e): clear tamper-window wedge in recovery-node-id-mismatch The provocation step (drbdadm down + sed + up) deliberately races the satellite's Bug-287 revive. When the two drbdadm invocations interleave badly, drbdmeta apply-al hits EBUSY on the backing device and worker-2's kernel slot ends half-configured: disk attached Inconsistent with AL suspended, peers registered but never connected (StandAlone). The satellite then classifies the slot as an operator disconnect (StandAlone + peer-device entries, the W12 --skip-net guard) and never reconnects it, so the post-recovery UpToDate wait times out. Seen as a ~30% lane-1 flake; not related to the .res node-id mismatch under test. Bounce worker-2 with a bare drbdadm down after the provocation and require kernel-truth UpToDate before applying the SKILL recipe, so the recovery assertions start from a clean slot. Validated 12/12 green on the dev stand (previously ~50% fail). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * fix(e2e): heal tamper-window wedge in recovery-down-reverses The scenario's bare drbdadm down provocation can rarely collide with the satellite's immediate revive: drbdmeta apply-al hits EBUSY on the backing device and the revived slot ends half-configured — disk Inconsistent, connections StandAlone with peer-device entries retained. That matches the operator-disconnect signature, so the satellite never reconnects the slot and the final convergence wait times out on a provocation artefact, not on the revive path under test. Seen twice on CI lanes with the identical signature. Unlike recovery-node-id-mismatch, the provocation here is already a single writer, so an unconditional bounce would not reduce the collision odds and would dilute the convergence assertion. Instead the heal is conditional: the convergence wait keeps its full untouched budget, and only a timeout that shows the exact wedge signature triggers one clean bounce plus a kernel-truth UpToDate wait before a single re-run of the wait. A regression of the narrowed skip-net gate does not match the signature and still fails loudly. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> --------- Signed-off-by: Andrei Kvapil <kvapss@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>
1 parent c54be3b commit 9c84104

2 files changed

Lines changed: 136 additions & 13 deletions

File tree

tests/e2e/recovery-down-reverses.sh

Lines changed: 89 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -60,12 +60,42 @@
6060
# adjust, which re-issues `connect`. Step 5 below pins that
6161
# convergence with a 60s budget.
6262
#
63+
# Tamper-window wedge (PR #148 lane 4, run 27410144876; PR #131
64+
# earlier): the same apply-al EBUSY artefact recovery-node-id-mismatch
65+
# clears after its provocation can — rarely — hit this scenario's
66+
# bare `drbdadm down` too. The satellite's revive fires off the
67+
# `destroy resource` event immediately, and when its bring-up
68+
# interleaves with the tail of our still-running `drbdadm down`
69+
# (or with the satellite's second internal caller), `drbdmeta
70+
# apply-al` fails with "Device or resource busy" (exit 20) and the
71+
# revived slot ends HALF-CONFIGURED: disk Inconsistent, both
72+
# connections StandAlone WITH peer-device entries registered. That
73+
# state matches the operator-disconnect signature above, so every
74+
# subsequent adjust runs --skip-net and the slot never reconnects —
75+
# Step 5 then times out on an artefact of two drbdadm callers
76+
# colliding, not on the revive path under test.
77+
#
78+
# Unlike recovery-node-id-mismatch (whose down+sed+up provocation is
79+
# a high-probability double-writer, healed there by an unconditional
80+
# clean bounce), the provocation here already IS the single-writer
81+
# bare down — an unconditional bounce would just roll the same dice
82+
# again and dilute Step 5's assertion. So the heal is CONDITIONAL:
83+
# Step 5 first gets its full untouched budget; only if it times out
84+
# AND worker-2 shows the exact wedge signature (StandAlone +
85+
# peer_devices entries present) do we bounce once and re-wait. A
86+
# genuine regression of the narrowed shouldSkipNetOnAdjust gate
87+
# (fresh-revive StandAlone, NO peer-device entries, never
88+
# reconnected) does not match the signature and still FAILs loudly.
89+
#
6390
# Steps
6491
# 1. Apply 2-replica RD on $N1+$N2, wait UpToDate.
6592
# 2. Pick Secondary ($N2) — `drbdadm down $RD` from its satellite pod.
6693
# 3. Confirm kernel is empty for $RD on $N2 (`drbdsetup status`).
6794
# 4. Poll up to 30s for kernel to reappear on $N2.
6895
# 5. Assert peer state returns to Connected + UpToDate within 60s.
96+
# If the wait times out on the tamper-window wedge signature
97+
# (apply-al EBUSY artefact, see above), clean-bounce $N2 once
98+
# and re-run the same wait before declaring failure.
6999
# 6. Cleanup via delete_rd EXIT trap.
70100

71101
set -euo pipefail
@@ -160,22 +190,64 @@ echo " kernel resource reappeared after ${revived_at}s"
160190
# its comment for why a single-int sentinel collides with the
161191
# legitimate "converged in zero seconds" case.
162192
echo ">> wait <=${UPTODATE_DEADLINE_SECS}s for ${RD} to reach Connected+UpToDate on both peers"
163-
deadline=$(( $(date +%s) + UPTODATE_DEADLINE_SECS ))
164193
connected=0
165194
connected_at=0
166-
while (( $(date +%s) < deadline )); do
167-
n1_conn=$(status_connection_state "$RD" "$N1" "$N2")
168-
n2_conn=$(status_connection_state "$RD" "$N2" "$N1")
169-
n1_local_disk=$(status_disk_state "$RD" "$N1")
170-
n2_local_disk=$(status_disk_state "$RD" "$N2")
171-
if [[ ( "$n1_conn" == "Connected" || "$n1_conn" == "Established" ) \
172-
&& ( "$n2_conn" == "Connected" || "$n2_conn" == "Established" ) \
173-
&& "$n1_local_disk" == "UpToDate" && "$n2_local_disk" == "UpToDate" ]]; then
174-
connected=1
175-
connected_at=$(( $(date +%s) - t_down ))
195+
bounced=0
196+
for attempt in 1 2; do
197+
deadline=$(( $(date +%s) + UPTODATE_DEADLINE_SECS ))
198+
while (( $(date +%s) < deadline )); do
199+
n1_conn=$(status_connection_state "$RD" "$N1" "$N2")
200+
n2_conn=$(status_connection_state "$RD" "$N2" "$N1")
201+
n1_local_disk=$(status_disk_state "$RD" "$N1")
202+
n2_local_disk=$(status_disk_state "$RD" "$N2")
203+
if [[ ( "$n1_conn" == "Connected" || "$n1_conn" == "Established" ) \
204+
&& ( "$n2_conn" == "Connected" || "$n2_conn" == "Established" ) \
205+
&& "$n1_local_disk" == "UpToDate" && "$n2_local_disk" == "UpToDate" ]]; then
206+
connected=1
207+
connected_at=$(( $(date +%s) - t_down ))
208+
break
209+
fi
210+
sleep 2
211+
done
212+
if (( connected == 1 || attempt == 2 )); then
176213
break
177214
fi
178-
sleep 2
215+
216+
# First wait timed out. Heal ONLY the tamper-window wedge (see
217+
# header): worker-2 StandAlone with peer-device entries retained —
218+
# the apply-al EBUSY artefact the satellite deliberately won't
219+
# touch (operator-disconnect signature). Anything else falls
220+
# through to the FAIL dump below untouched.
221+
wedged=$(on_node "$N2" drbdsetup status --json "$RD" 2>/dev/null | jq -r '
222+
[.[0].connections[]?
223+
| select(."connection-state" == "StandAlone"
224+
and ((.peer_devices // []) | length > 0))]
225+
| length' 2>/dev/null || true)
226+
wedged=${wedged:-0}
227+
if [[ ! "$wedged" =~ ^[0-9]+$ ]] || (( wedged == 0 )); then
228+
break
229+
fi
230+
echo " tamper-window wedge on ${N2} (StandAlone with peer-device entries,"
231+
echo " apply-al EBUSY artefact) — clean bounce, satellite revives alone"
232+
bounced=1
233+
on_node "$N2" drbdadm down "$RD" >/dev/null 2>&1 || true
234+
# Kernel-truth poll, not Resource.Status: right after the down the
235+
# observer hasn't stamped the destroy yet, so Status.diskState can
236+
# serve a stale UpToDate. `^[[:space:]]+disk:` matches only the
237+
# local disk line (peer lines carry `peer-disk:`).
238+
bounce_deadline=$(( $(date +%s) + 120 ))
239+
n2_disk=""
240+
while (( $(date +%s) < bounce_deadline )); do
241+
n2_disk=$(on_node "$N2" drbdsetup status "$RD" 2>/dev/null \
242+
| grep -m1 -E '^[[:space:]]+disk:' | cut -d: -f2 | awk '{print $1}' || true)
243+
if [[ "$n2_disk" == "UpToDate" ]]; then break; fi
244+
sleep 2
245+
done
246+
if [[ "$n2_disk" != "UpToDate" ]]; then
247+
echo " bounce did not bring ${N2} back UpToDate (disk=${n2_disk})"
248+
break
249+
fi
250+
echo " ${N2} back UpToDate after bounce — re-running convergence wait"
179251
done
180252

181253
if (( connected == 0 )); then
@@ -205,4 +277,8 @@ if (( connected == 0 )); then
205277
exit 1
206278
fi
207279

208-
echo ">> PASS 5.32 — drbdadm down auto-reverted in ${revived_at}s; UpToDate restored in ${connected_at}s"
280+
suffix=""
281+
if (( bounced == 1 )); then
282+
suffix=" (after tamper-window bounce)"
283+
fi
284+
echo ">> PASS 5.32 — drbdadm down auto-reverted in ${revived_at}s; UpToDate restored in ${connected_at}s${suffix}"

tests/e2e/recovery-node-id-mismatch.sh

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -303,6 +303,53 @@ else
303303
echo " assertions are the load-bearing ones."
304304
fi
305305

306+
# Clear the tamper-window wedge before applying the SKILL recipe.
307+
#
308+
# The down+sed+up above intentionally races the satellite's Bug-287
309+
# revive (the satellite sees the `destroy resource` event from our
310+
# `drbdadm down` and immediately re-ups the slot itself). When the
311+
# two `drbdadm` invocations interleave badly, `drbdmeta apply-al`
312+
# hits "Device or resource busy" (exit 20) on the backing device and
313+
# worker-2's kernel slot ends HALF-CONFIGURED: disk attached
314+
# Inconsistent with `al-suspended:yes`, both peers registered with
315+
# peer-device entries but `connect` never issued — connection state
316+
# StandAlone. The satellite then deliberately leaves it alone: a
317+
# StandAlone slot that retains peer-device entries matches the
318+
# operator-disconnect signature (see shouldSkipNetOnAdjust in
319+
# pkg/satellite/reconciler.go — the W12 split-brain-recipe guard),
320+
# so every subsequent adjust runs with --skip-net and the slot never
321+
# reconnects. The recovery wait below would then time out with
322+
# worker-2 stuck Inconsistent/StandAlone.
323+
#
324+
# That half-up wedge is an artefact of two drbdadm callers colliding
325+
# in the provocation step — NOT the .res node-id mismatch this
326+
# scenario is about (UG cases 10-11). Clear it deterministically:
327+
# bounce the slot with a bare `drbdadm down` and let the satellite's
328+
# revive (the well-tested scenario 5.32 / recovery-down-reverses
329+
# path, a SINGLE writer this time) bring it back up cleanly, then
330+
# require worker-2 UpToDate before applying the SKILL recipe.
331+
echo ">> clear tamper-window wedge: bounce worker-2 and wait for satellite revive"
332+
on_node "$WORKER_2" drbdadm down "$RD" >/dev/null 2>&1 || true
333+
# Kernel-truth poll, not Resource.Status: right after the down the
334+
# observer hasn't stamped the destroy yet, so Status.diskState can
335+
# serve a stale UpToDate and wave the gate through before the
336+
# revive actually ran. `^[[:space:]]+disk:` matches only the local
337+
# disk line (peer lines carry `peer-disk:`).
338+
deadline=$(( $(date +%s) + 120 ))
339+
w2_disk=""
340+
while (( $(date +%s) < deadline )); do
341+
w2_disk=$(on_node "$WORKER_2" drbdsetup status "$RD" 2>/dev/null \
342+
| grep -m1 -E '^[[:space:]]+disk:' | cut -d: -f2 | awk '{print $1}' || true)
343+
if [[ "$w2_disk" == "UpToDate" ]]; then break; fi
344+
sleep 2
345+
done
346+
if [[ "$w2_disk" != "UpToDate" ]]; then
347+
echo "FAIL: worker-2 not UpToDate after tamper-window bounce (disk=$w2_disk)"
348+
on_node "$WORKER_2" drbdsetup status "$RD" 2>&1 | head -20 || true
349+
exit 1
350+
fi
351+
echo " worker-2 back UpToDate after bounce"
352+
306353
# SKILL recipe: drop worker-3's replica, then re-place via
307354
# autoplace. The CLI delete tears down the kernel resource on
308355
# worker-3 + removes the Resource CRD; the autoplace re-stamps a

0 commit comments

Comments
 (0)