Skip to content

Commit 758f403

Browse files
author
SqlRush
committed
fix(cluster): spec-5.15 Hardening v1.2 — cold-bootstrap proof on durable slot, not live CSSD
A founding survivor could mis-classify itself as a rejoiner and refuse its own writes (53R61) after an UNRELATED node fail-stop, blocking 3-node reconfig (the join-remaster barrier never completes — surfaced by spec-5.16 t/326). Root cause: the v1.1 cold-bootstrap proof (cluster_reconfig_bootstrap_quorum_at_initial) counted live CSSD-alive peers. Transient IC / heartbeat churn could drop the quorum below threshold, leaving a genuine co-booting member UNDECIDED (its joiner gate never latched BOOTSTRAP). An unrelated peer's later fail-stop then advanced the cluster epoch, and joiner_self_tick reclassified the still-UNDECIDED member as a rejoiner: it closed its own write gate and timed out to 53R61. INV-J14's latch only covered the BOOTSTRAP branch, not the UNDECIDED window. Fix: anchor the co-boot quorum on the durable voting-disk slot instead of live CSSD. Count a declared peer toward the quorum only on a valid observed slot (cluster_reconfig_get_observed_slot true, generation > 0) at epoch INITIAL — never on a default-0 placeholder, never on live CSSD state. Stable across CSSD churn, so a founding member latches BOOTSTRAP reliably during formation, before any unrelated epoch advance. The quorum threshold (declared/2+1) and the rejoiner predicate (any peer epoch > INITIAL) are unchanged; only the "alive at INITIAL" predicate moves to the durable slot. Unit: test_cluster_reconfig U20 (valid-slot-vs-CSSD differentiator + default-0 fail-closed, TDD red->green) and U19 updated to the durable-slot semantics. No on-disk / wire / catalog / GUC change; no catversion bump. Spec: spec-5.15-online-declared-node-join-membership.md (Hardening v1.2)
1 parent b150aa8 commit 758f403

2 files changed

Lines changed: 100 additions & 19 deletions

File tree

src/backend/cluster/cluster_reconfig.c

Lines changed: 43 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1200,39 +1200,69 @@ cluster_reconfig_cluster_already_running(void)
12001200
}
12011201

12021202
/*
1203-
* spec-5.15 Hardening v1.1 (HF-2 / INV-J14) — positive cold-bootstrap proof.
1203+
* spec-5.15 Hardening v1.1 (HF-2 / INV-J14) — positive cold-bootstrap proof,
1204+
* REVISED by Hardening v1.2 (INV-J14 self-join-gate race) to rest on the durable
1205+
* voting-disk slot rather than live CSSD.
1206+
*
12041207
* A node may keep its write gate open WITHOUT online-join admission only when a
1205-
* majority of declared nodes are CSSD-alive AND no declared peer is observed
1206-
* past CLUSTER_EPOCH_INITIAL (everyone co-booting at epoch 0). This is an EPOCH
1207-
* proof, not a timing grace: a slow qvotec leaves the decision UNDECIDED (gate
1208-
* stays closed, fail-closed) instead of mis-deciding bootstrap and permanently
1209-
* fail-opening (P1-2). A rejoiner can never satisfy it — by the time it sees
1210-
* its peers they are already at epoch > INITIAL.
1208+
* majority of declared nodes are observed CO-BOOTING at CLUSTER_EPOCH_INITIAL on
1209+
* a VALID durable slot (qvotec saw a real voting-disk slot, generation > 0, at
1210+
* epoch INITIAL) AND no declared peer is observed past CLUSTER_EPOCH_INITIAL.
1211+
* This is an EPOCH proof, not a timing grace: a slow qvotec leaves the decision
1212+
* UNDECIDED (gate stays closed, fail-closed) instead of mis-deciding bootstrap
1213+
* and permanently fail-opening (P1-2). A rejoiner can never satisfy it — by the
1214+
* time it sees its peers they are already at epoch > INITIAL.
1215+
*
1216+
* v1.2 RATIONALE: the v1.1 proof counted live CSSD-alive peers, which a
1217+
* transient IC / heartbeat churn could momentarily drop below quorum — leaving a
1218+
* GENUINE founding member UNDECIDED (never latched). An UNRELATED peer's later
1219+
* fail-stop then advanced the epoch, and joiner_self_tick reclassified that
1220+
* still-UNDECIDED member as a rejoiner: it closed its own write gate and timed
1221+
* out to 53R61 (refused its own writes), never participating again. Anchoring
1222+
* the quorum on the DURABLE slot (stable across CSSD churn) lets a founding
1223+
* member latch reliably during formation, closing the UNDECIDED window before
1224+
* any unrelated epoch advance. A default-0 placeholder (generation 0) is NOT
1225+
* proof and must never count (else a node with no real evidence fail-opens) —
1226+
* the v1.2 user constraint: latch only on a valid co-boot slot, never on 0.
1227+
* Quorum (not all-declared) is retained so a degraded co-boot (e.g. 2 of 3) can
1228+
* still form, and because requiring every peer would only WIDEN the UNDECIDED
1229+
* window the race exploits.
12111230
*/
12121231
bool
12131232
cluster_reconfig_bootstrap_quorum_at_initial(void)
12141233
{
12151234
uint32 declared = 0;
1216-
uint32 alive_at_initial = 0;
1235+
uint32 proven_at_initial = 0;
12171236
int i;
12181237

12191238
for (i = 0; i < CLUSTER_MAX_NODES; i++) {
1239+
uint64 inc = 0;
1240+
uint64 gen = 0;
1241+
uint64 ep;
1242+
12201243
if (cluster_conf_lookup_node(i) == NULL)
12211244
continue;
12221245
declared++;
12231246
if (i == cluster_node_id) {
1224-
alive_at_initial++; /* self is up, at INITIAL (not yet admitted) */
1247+
proven_at_initial++; /* self is up, at INITIAL (not yet admitted) */
12251248
continue;
12261249
}
1250+
ep = cluster_reconfig_get_observed_epoch(i);
12271251
/* any declared peer past INITIAL => a running cluster, NOT a bootstrap */
1228-
if (cluster_reconfig_get_observed_epoch(i) > CLUSTER_EPOCH_INITIAL)
1252+
if (ep > CLUSTER_EPOCH_INITIAL)
12291253
return false;
1230-
if (cluster_cssd_get_peer_state(i) != CLUSTER_CSSD_PEER_DEAD)
1231-
alive_at_initial++;
1254+
/*
1255+
* Count a peer only on a VALID durable co-boot slot: a real observed
1256+
* voting-disk slot (generation > 0) at epoch INITIAL. Never count a
1257+
* default-0 placeholder (generation 0) nor rely on live CSSD state.
1258+
*/
1259+
if (cluster_reconfig_get_observed_slot(i, &inc, &gen) && gen > 0
1260+
&& ep == CLUSTER_EPOCH_INITIAL)
1261+
proven_at_initial++;
12321262
}
12331263
if (declared == 0)
12341264
return false;
1235-
return alive_at_initial >= ((declared / 2u) + 1u);
1265+
return proven_at_initial >= ((declared / 2u) + 1u);
12361266
}
12371267

12381268
/*

src/test/cluster_unit/test_cluster_reconfig.c

Lines changed: 57 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1413,20 +1413,70 @@ UT_TEST(test_reconfig_bootstrap_quorum_epoch_proof)
14131413
ut_join_setup(); /* self = node 0 */
14141414
ut_declared_set[1] = true; /* 3 declared nodes */
14151415
ut_declared_set[2] = true;
1416-
ut_peer_state[1] = CLUSTER_CSSD_PEER_ALIVE;
1417-
ut_peer_state[2] = CLUSTER_CSSD_PEER_ALIVE;
14181416

1419-
/* all CSSD-alive at epoch 0 -> cold bootstrap proven. */
1417+
/* quorum of declared on VALID co-boot slots at INITIAL -> bootstrap proven
1418+
* (v1.2: anchored on the durable voting-disk slot, not live CSSD). */
1419+
cluster_reconfig_record_observed_slot(1, 1, 1, 0);
1420+
cluster_reconfig_record_observed_slot(2, 1, 1, 0);
14201421
UT_ASSERT(cluster_reconfig_bootstrap_quorum_at_initial());
14211422

14221423
/* a peer past INITIAL (running cluster) -> NOT a bootstrap (fail-closed). */
14231424
cluster_reconfig_record_observed_slot(1, 1, 1, 4);
14241425
UT_ASSERT(!cluster_reconfig_bootstrap_quorum_at_initial());
14251426

1426-
/* back at epoch 0 but only self alive (peers DEAD) -> below quorum -> false. */
1427-
cluster_reconfig_record_observed_slot(1, 1, 1, 0);
1428-
ut_peer_state[1] = CLUSTER_CSSD_PEER_DEAD;
1427+
/* no valid co-boot slot on either peer (generation 0) -> only self proven ->
1428+
* below quorum -> false (never latch on a default-0 placeholder). */
1429+
cluster_reconfig_record_observed_slot(1, 0, 0, 0);
1430+
cluster_reconfig_record_observed_slot(2, 0, 0, 0);
1431+
UT_ASSERT(!cluster_reconfig_bootstrap_quorum_at_initial());
1432+
}
1433+
1434+
1435+
/* ======================================================================
1436+
* U20 (spec-5.15 Hardening v1.2 / INV-J14 self-join-gate race) -- the
1437+
* cold-bootstrap proof must rest on a VALID durable co-boot slot
1438+
* (cluster_reconfig_get_observed_slot true, generation > 0, observed_epoch
1439+
* == INITIAL), NOT on live CSSD state and NOT on a default-0 placeholder.
1440+
*
1441+
* Root cause it guards: a founding survivor that has durable proof of co-
1442+
* booting at INITIAL but whose peers' live CSSD is momentarily DOWN (IC /
1443+
* heartbeat churn) was denied bootstrap by the v1.1 CSSD-quorum proof, so it
1444+
* stayed UNDECIDED; a later UNRELATED node fail-stop then advanced the epoch
1445+
* and reclassified this genuine member as a rejoiner -> 53R61 (refused its own
1446+
* writes). Anchoring the proof on the durable voting-disk slot lets the member
1447+
* latch reliably during formation (immune to CSSD churn), closing the window.
1448+
* ====================================================================== */
1449+
UT_TEST(test_reconfig_bootstrap_proof_valid_slot_not_cssd)
1450+
{
1451+
/* --- A. valid co-boot slots at INITIAL but peers CSSD-DEAD -> still a
1452+
* proven bootstrap (durable slot, not live CSSD). v1.1 returned
1453+
* false here (the race window); v1.2 returns true. --- */
1454+
ut_join_setup(); /* self = node 0 */
1455+
ut_declared_set[1] = true; /* 3 declared nodes */
1456+
ut_declared_set[2] = true;
1457+
cluster_reconfig_record_observed_slot(1, 7, 1, 0); /* valid slot, INITIAL */
1458+
cluster_reconfig_record_observed_slot(2, 7, 1, 0); /* valid slot, INITIAL */
1459+
ut_peer_state[1] = CLUSTER_CSSD_PEER_DEAD; /* live CSSD churned down */
14291460
ut_peer_state[2] = CLUSTER_CSSD_PEER_DEAD;
1461+
UT_ASSERT(cluster_reconfig_bootstrap_quorum_at_initial());
1462+
1463+
/* --- B. CSSD-alive but NO valid slot (generation 0 placeholder) must
1464+
* NOT prove bootstrap — never latch on a default-0 epoch. --- */
1465+
ut_join_setup();
1466+
ut_declared_set[1] = true;
1467+
ut_declared_set[2] = true;
1468+
ut_peer_state[1] = CLUSTER_CSSD_PEER_ALIVE;
1469+
ut_peer_state[2] = CLUSTER_CSSD_PEER_ALIVE;
1470+
/* no record_observed_slot -> generation 0 -> not a valid co-boot proof */
1471+
UT_ASSERT(!cluster_reconfig_bootstrap_quorum_at_initial());
1472+
1473+
/* --- C. a peer observed past INITIAL is a running cluster, never a
1474+
* bootstrap (rejoiner fail-closed) — unchanged from v1.1. --- */
1475+
ut_join_setup();
1476+
ut_declared_set[1] = true;
1477+
ut_declared_set[2] = true;
1478+
cluster_reconfig_record_observed_slot(1, 7, 1, 0); /* valid, INITIAL */
1479+
cluster_reconfig_record_observed_slot(2, 7, 1, 5); /* valid, past INITIAL */
14301480
UT_ASSERT(!cluster_reconfig_bootstrap_quorum_at_initial());
14311481
}
14321482

@@ -1508,6 +1558,7 @@ main(void)
15081558
UT_RUN(test_reconfig_join_publish_proven_member_quorum);
15091559
UT_RUN(test_reconfig_join_publish_proven_no_member_failclosed);
15101560
UT_RUN(test_reconfig_bootstrap_quorum_epoch_proof);
1561+
UT_RUN(test_reconfig_bootstrap_proof_valid_slot_not_cssd);
15111562

15121563
UT_DONE();
15131564
return ut_failed_count == 0 ? 0 : 1;

0 commit comments

Comments
 (0)