Skip to content

Commit b352a46

Browse files
author
SqlRush
committed
fix(cluster): spec-5.15 online-join Hardening v1.1 — half-publish proof, bootstrap epoch-proof, marker identity-grouping
Spec: spec-5.15-online-declared-node-join-membership.md (Hardening v1.1) Three post-ship correctness findings on the online declared-node join path, all fail-closed (behind cluster.online_join, default off): - HF-1 / INV-J9 (half-publish window): the v1.0 joiner opened its write gate on the durable COMMITTED join marker alone (note_self_admitted), so a coordinator that crashed/stalled after the marker was majority-durable but before the publish (epoch advance + survivor state=MEMBER) left the joiner writable while every other node still saw it JOINING. New pure fn cluster_reconfig_join_publish_proven(admitted_epoch): qvotec opens the gate only after a majority of MEMBER survivors have advanced their durable observed epoch to >= admitted_epoch (the publish actually propagated). A marker-durable-but-unpublished state keeps the gate CLOSED; the joiner then times out to 53R61 and restarts with a fresh incarnation. - HF-2 / INV-J14 (bootstrap fail-open): the v1.0 joiner_self_tick used a timing-grace heuristic — if no running peer was observed within a window it declared cold-bootstrap and left the gate open forever, so a slow qvotec could mis-see a rejoiner as a bootstrap and permanently fail-open. Replaced with a positive epoch proof cluster_reconfig_bootstrap_quorum_at_initial() (quorum of declared CSSD-alive AND no declared peer past EPOCH_INITIAL); undecided keeps the gate CLOSED (fail-closed). shmem init defaults the gate CLOSED when online_join is on (open when off — no behavior change off). - HF-3 / INV-J13 (false majority): self-admit and the startup seed counted "any COMMITTED marker" toward majority, so two minority writes from different commit attempts (different coordinator/epoch) could aggregate. ClusterJoinCommitMarker gains a per-attempt commit_nonce (CLUSTER_JCMK_VERSION 1->2; v1 markers fail-closed rejected); new cluster_join_marker_same_commit() groups by full identity, so only a single commit present on a disk majority is honored. Test-infra fix the L8 e2e surfaced: the three reconfig-join injection fire sites were never in the cluster_inject registry, so cluster.injection_points rejected them as unknown and the half-publish injection never armed. Register cluster-reconfig-join-commit-marker-durable (the one L8 needs) — total registry 139 -> 140; ripple the count baselines in t/015/017/018/020/021/022/023/024/030 and the cluster-reconfig-% count in cluster_regress reconfig_smoke (5 -> 6). Tests: cluster_unit 131/131 (U16-U19 new: same_commit identity group, version fail-closed, publish-proven member quorum, bootstrap epoch proof); t/315 18/18 incl. new L8 real crash-in-window e2e (coordinator paused post-marker-durable, joiner stays 53R60 then 53R61, never half-published); cluster_regress 11/11; PG 219/219; clang-format v18 clean. No catversion bump (marker is voting-disk on-disk, self-versioned, not catalog).
1 parent 2056233 commit b352a46

21 files changed

Lines changed: 588 additions & 85 deletions

src/backend/cluster/cluster_inject.c

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -229,6 +229,13 @@ static ClusterInjectPoint cluster_injection_points[] = {
229229
{ .name = "cluster-reconfig-decide-coordinator" },
230230
{ .name = "cluster-reconfig-epoch-bump-pre" },
231231
{ .name = "cluster-reconfig-broadcast-procsig-pre" },
232+
/*
233+
* spec-5.15 Hardening v1.1 (HF-1): pause the join coordinator inside
234+
* commit_member after the COMMITTED marker is majority-durable but before
235+
* the publish, so t/315 L8 can prove the half-publish window keeps the
236+
* joiner's write gate CLOSED (publish-proof, not the marker alone, opens it).
237+
*/
238+
{ .name = "cluster-reconfig-join-commit-marker-durable" },
232239
{ .name = "cluster-cssd-mark-peer-dead" },
233240
/* Stage 1.15 (spec-1.15 D11 inject) — 4 SCN encoding-layer injects. */
234241
{ .name = "cluster-scn-advance-pre" },

src/backend/cluster/cluster_membership.c

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -180,7 +180,7 @@ cluster_membership_is_member(int32 node_id)
180180
* ============================================================
181181
*/
182182

183-
/* Set m->crc32c = CRC32C over [magic .. supersedes_leave_epoch]. */
183+
/* Set m->crc32c = CRC32C over [magic .. commit_nonce]. */
184184
void
185185
cluster_join_marker_compute_crc(ClusterJoinCommitMarker *m)
186186
{
@@ -227,6 +227,23 @@ cluster_join_marker_is_committed_basis(const ClusterJoinCommitMarker *m, int32 e
227227
&& m->phase == CLUSTER_JCMK_PHASE_COMMITTED;
228228
}
229229

230+
/*
231+
* INV-J13 (Hardening v1.1): same commit attempt? Compares the whole identity
232+
* range [0 .. offsetof(crc32c)) -- magic/version/node_id/phase/_pad/generation/
233+
* admitted_incarnation/admitted_epoch/supersedes_leave_epoch/commit_nonce. The
234+
* markers are memset-0 before fill so _pad never differs. Used by the self-
235+
* admit and the startup-seed majority judgements so that two minority writes
236+
* from DIFFERENT commit attempts (different coordinator / epoch / nonce) cannot
237+
* be counted together as a false majority (P1-3).
238+
*/
239+
bool
240+
cluster_join_marker_same_commit(const ClusterJoinCommitMarker *a, const ClusterJoinCommitMarker *b)
241+
{
242+
if (a == NULL || b == NULL)
243+
return false;
244+
return memcmp(a, b, offsetof(ClusterJoinCommitMarker, crc32c)) == 0;
245+
}
246+
230247
/*
231248
* Apply one durable marker to the admitted floor (INV-J7). Only a committed
232249
* basis raises the floor; record_admitted is monotonic so re-applying / lower

src/backend/cluster/cluster_qvotec.c

Lines changed: 37 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -821,15 +821,26 @@ qvotec_poll_once(void)
821821
/*
822822
* spec-5.15 D5: detect THIS node's own admission — a §2.6 COMMITTED join
823823
* marker in region-3 slot self, with admitted_incarnation == our incarnation,
824-
* on a quorum-majority of disks. note_self_admitted then adopts the admitted
825-
* epoch + sets self MEMBER + opens the joiner write gate (one extra slot read
826-
* per disk; qvotec already has the fds open).
824+
* on a quorum-majority of disks (one extra slot read per disk; qvotec already
825+
* has the fds open).
826+
*
827+
* Hardening v1.1:
828+
* HF-3 (INV-J13): require a majority of the SAME commit (identical nonce),
829+
* not "any COMMITTED marker" — two minority writes from different commit
830+
* attempts (different coordinator / epoch) must not aggregate (P1-3).
831+
* HF-1 (INV-J9): open the gate only after the publish-proof also holds (a
832+
* member quorum reached admitted_epoch). marker-durable-but-coordinator-
833+
* crashed-before-publish keeps the gate CLOSED -> the joiner times out ->
834+
* 53R61 -> restarts (P1-1 half-publish window). Until then the lmon
835+
* epoch catch-up keeps transport alive and the next poll re-checks.
827836
*/
828837
if (cluster_node_id >= 0 && cluster_node_id < CLUSTER_MAX_NODES) {
829-
uint32 self_agree = 0;
830-
uint64 self_admitted_epoch = 0;
838+
ClusterJoinCommitMarker self_markers[CLUSTER_MAX_VOTING_DISKS];
839+
int n_self = 0;
831840
uint32 majority = ((uint32)qvotec_n_disks / 2u) + 1u;
832-
int d;
841+
int d, a, b;
842+
int win = -1;
843+
uint32 win_agree = 0;
833844

834845
for (d = 0; d < qvotec_n_disks; d++) {
835846
union {
@@ -847,12 +858,27 @@ qvotec_poll_once(void)
847858
continue;
848859
if (m.admitted_incarnation != qvotec_self_incarnation)
849860
continue; /* a stale prior-incarnation admission — not us */
850-
self_agree++;
851-
if (m.admitted_epoch > self_admitted_epoch)
852-
self_admitted_epoch = m.admitted_epoch;
861+
self_markers[n_self++] = m;
862+
}
863+
864+
/* HF-3: find a single commit (nonce) present on >= majority disks. */
865+
for (a = 0; a < n_self; a++) {
866+
uint32 same = 0;
867+
868+
for (b = 0; b < n_self; b++)
869+
if (cluster_join_marker_same_commit(&self_markers[a], &self_markers[b]))
870+
same++;
871+
if (same >= majority) {
872+
win = a;
873+
win_agree = same;
874+
break;
875+
}
853876
}
854-
if (self_agree >= majority)
855-
cluster_reconfig_note_self_admitted(self_admitted_epoch);
877+
878+
/* HF-1: gate-open requires the publish-proof too, not the marker alone. */
879+
if (win >= 0 && win_agree >= majority
880+
&& cluster_reconfig_join_publish_proven(self_markers[win].admitted_epoch))
881+
cluster_reconfig_note_self_admitted(self_markers[win].admitted_epoch);
856882
}
857883

858884
/*

0 commit comments

Comments
 (0)