Skip to content

Commit e9e2ba9

Browse files
kvapsclaude
andcommitted
fix(e2e): poll kernel ground truth in state-auto-resync 3-peer wait
state-auto-resync's local wait_uptodate_3 polled only status_disk_state, which reads the CRD .status projection. On a busy CI stand that projection lags tens of seconds behind the kernel, so the 240s wait timed out while the post-fail drbdsetup dump showed all three peers already UpToDate — the resource had converged, only the projection hadn't surfaced. This is the recurring projection-lag flake class, not the SkipInitialSync gate: the controller stamps SkipInitialSync in the same allocation pass as the node-id/port (when the RD is observable, as it is for a normal multi-replica deploy), and the satellite gate uses the existing bounded 5s requeue, so the gate adds no minute-scale latency. Add kernel_all_uptodate (an N-peer generalisation of the existing kernel_pair_uptodate) and accept it as an additional pass in wait_uptodate_3, mirroring the kernel-fallback lib.sh's wait_uptodate gained in c635627. It only ADDS an accept path: a genuinely non-converged RD still reads non-UpToDate, and a lone node with no peers can't falsely pass, so real failures are not masked. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
1 parent a535b8f commit e9e2ba9

2 files changed

Lines changed: 45 additions & 0 deletions

File tree

tests/e2e/lib.sh

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -190,6 +190,31 @@ kernel_pair_uptodate() {
190190
2>/dev/null || true
191191
}
192192

193+
# kernel_all_uptodate <rd> <node> [vol] — kernel ground truth for an
194+
# N-peer RD: prints "ok" iff `node`'s local disk-state AND the
195+
# peer-disk-state of EVERY connection are UpToDate, read straight from
196+
# `drbdsetup status <rd> --json` on `node`. A single node's status frame
197+
# reports its own disk-state plus the peer-disk-state of all peers, so
198+
# for a 3-replica RD one query on any peer covers all three replicas.
199+
# The connection set must be non-empty (a lone node with no peers can't
200+
# prove the others are UpToDate). Empty/parse failure (node unreachable,
201+
# slot mid-negotiation) prints nothing → caller keeps waiting.
202+
# Independent of the controller's CRD .status projection — same purpose
203+
# as kernel_pair_uptodate, generalised past two peers.
204+
kernel_all_uptodate() {
205+
local rd=$1 node=$2 vol=${3:-0}
206+
on_node "$node" drbdsetup status "$rd" --json 2>/dev/null | jq -r \
207+
--argjson v "$vol" '
208+
([.[0].devices[]? | select(.volume==$v) | ."disk-state"] | first) as $loc
209+
| [.[0].connections[]? | .peer_devices[]?
210+
| select(.volume==$v) | ."peer-disk-state"] as $peers
211+
| if ($loc=="UpToDate"
212+
and ($peers | length) > 0
213+
and ($peers | all(. == "UpToDate")))
214+
then "ok" else "no" end' \
215+
2>/dev/null || true
216+
}
217+
193218
# status_connection_state <rd> <node> <peer> — full kernel connection
194219
# state string as observed FROM `node` TOWARD `peer`: Connected /
195220
# Connecting / StandAlone / BrokenPipe / NetworkFailure / Timeout /

tests/e2e/state-auto-resync.sh

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,19 @@ trap 'cleanup_partition; delete_rd "$RD"' EXIT
117117
# 5.15 scenario needs all three diskful rows UpToDate before
118118
# we start poking the kernel; otherwise the disconnect would
119119
# race the initial-sync the autoplace just kicked off.
120+
#
121+
# Accepts kernel ground truth as well as the CRD projection. The
122+
# CRD .status projection (status_disk_state) can lag tens of seconds
123+
# behind the kernel on a busy CI stand — observed here as a 240s
124+
# timeout whose post-fail drbdsetup dump showed all three peers
125+
# already UpToDate (the row had converged; only the projection hadn't
126+
# surfaced). kernel_all_uptodate reads `drbdsetup status --json` on
127+
# $N1 directly: one frame reports $N1's local disk-state plus the
128+
# peer-disk-state of $N2 and $N3, so it proves all three replicas
129+
# from kernel truth. This mirrors the kernel-fallback lib.sh's
130+
# wait_uptodate gained in c63562707 and only ADDS an accept path —
131+
# a genuinely non-converged RD still shows non-UpToDate, so real
132+
# failures are not masked.
120133
wait_uptodate_3() {
121134
local rd=$1 deadline=$(( $(date +%s) + 240 ))
122135
while (( $(date +%s) < deadline )); do
@@ -130,9 +143,16 @@ wait_uptodate_3() {
130143
fi
131144
done
132145
if (( ok == 1 )); then return 0; fi
146+
147+
# Kernel ground truth: independent of the CRD projection lag.
148+
if [[ "$(kernel_all_uptodate "$rd" "$N1")" == "ok" ]]; then
149+
return 0
150+
fi
151+
133152
sleep 2
134153
done
135154
echo "FAIL: $rd never reached UpToDate on all 3 peers" >&2
155+
echo " last CRD diskState: $N1=$(status_disk_state "$rd" "$N1") $N2=$(status_disk_state "$rd" "$N2") $N3=$(status_disk_state "$rd" "$N3")" >&2
136156
on_node "$N1" drbdsetup status "$rd" 2>/dev/null || true
137157
return 1
138158
}

0 commit comments

Comments
 (0)