Skip to content

Commit b150aa8

Browse files
author
SqlRush
committed
fix(cluster): spec-5.18 Hardening v1.1 — survivor-local removal apply + self-demote durability
Post-ship review of permanent node removal found 4 P1 + 1 P2, all in the non-coordinator survivor path and the removed-node self path (the coordinator main path was already covered). Root cause: a runtime removal was applied only on the coordinator and on the removed node itself; non-coordinator survivors never applied it locally. - HF-1/HF-3: a non-coordinator survivor now APPLIES the removal locally on the NODE_REMOVE_ANNOUNCE — seeds removed_bitmap + membership_state[N]=REMOVED with the coordinator's incarnation floor, permanently remasters N-mastered shards, clears N's GES/PCM, and proves zero leftover before it ACKs (was: only dropped refs + ACKed, leaving the permanent-removal fact absent from its memory until a restart, and ACKing without a leftover proof). Retried each survivor lmon tick until verify converges. Announce payload gains removed_incarnation (IC wire version 1->2). New INV-LF11 (survivor-local-apply). - HF-2: a removed node that is still running / restarts keeps self REMOVED — the lmon self-state maintenance and joiner_self_tick no longer flip it to JOINING/MEMBER, and the 53R64 self-demote gate now consults the durable removed_bitmap (lock-free) as the authoritative floor, not just membership_state. - HF-4: the survivor adopts the announce's identity when its recorded one is absent/terminal/stale and ACKs with the payload identity; the coordinator validates the full (target, epoch, event_id) tuple — consecutive removals no longer wedge the cleanup barrier on a stale event_id. - HF-5: pg_cluster_remove_node returns removal_in_progress (not a stale accepted) when a concurrent request reserved first. Tests: cluster_unit U-H (payload v2 + removed_incarnation in CRC); t/325 +HF-2 (restarted removed node fail-closes xid assignment via 53R64); new 3-node t/326 (non-coordinator survivor records the removal + retains it after the coordinator leaves). Local: cluster_unit + cluster_regress(12) + PG219 + t/325/t/326/t/313 all green; clang-format v18 clean. No catversion / shmem region / wait event / errcode change. HF-4 4-node consecutive-removal e2e + HF-5 concurrency forward to spec-5.19. Spec: spec-5.18-online-node-leave-fence-cleanup.md (## Hardening v1.1)
1 parent 7c0685a commit b150aa8

8 files changed

Lines changed: 403 additions & 31 deletions

File tree

.github/workflows/nightly.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -149,7 +149,7 @@ jobs:
149149
- { name: stage3-recent, ranges: "228-237", unit: false, regress: false }
150150
- { name: stage4-wal, ranges: "242-248 273 275", unit: false, regress: false }
151151
- { name: stage4-hardgate, ranges: "274-274", unit: false, regress: false }
152-
- { name: stage5-ges-locking, ranges: "276-325", unit: false, regress: false }
152+
- { name: stage5-ges-locking, ranges: "276-326", unit: false, regress: false }
153153
steps:
154154
- name: Checkout
155155
uses: actions/checkout@v4

src/backend/cluster/cluster_node_remove.c

Lines changed: 107 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -270,14 +270,21 @@ cluster_node_remove_self_is_removed(void)
270270
if (!cluster_enabled || cluster_node_id < 0)
271271
return false;
272272
/*
273-
* Lock-free hot-path check (called at every writable-xid assignment): the
274-
* membership_state[self] byte is the naturally-atomic SSOT. REMOVED is published
275-
* into this node's own table by the startup durable-marker rebuild
273+
* Lock-free hot-path check (called at every writable-xid assignment). REMOVED
274+
* is published into this node's own table by the startup durable-marker rebuild
276275
* (rebuild_from_disks) and, for a still-running removed node, by the
277-
* NODE_REMOVE_ANNOUNCE handler (nr_announce_handler self-demote). No reconfig
278-
* lock is taken here (mirrors the clean-leave refuse-writes gate).
276+
* NODE_REMOVE_ANNOUNCE handler (nr_announce_handler self-demote).
277+
*
278+
* HF-2: consult the durable removed_bitmap (lock-free, monotonic) IN ADDITION to
279+
* the membership_state[self] byte. membership_state[self] is rewritten every
280+
* LMON tick by the joiner / self-state maintenance paths; although those now
281+
* carry a REMOVED terminal guard, the durable bitmap is the authoritative floor
282+
* and closes any residual window where a stale self-state write could transiently
283+
* un-REMOVE this node and open the 53R64 write gate. Either signal = removed.
284+
* No reconfig lock is taken (mirrors the clean-leave refuse-writes gate).
279285
*/
280-
return cluster_membership_get_state(cluster_node_id) == CLUSTER_MEMBER_REMOVED;
286+
return cluster_membership_get_state(cluster_node_id) == CLUSTER_MEMBER_REMOVED
287+
|| cluster_reconfig_is_removed_unlocked(cluster_node_id);
281288
}
282289

283290

@@ -441,6 +448,16 @@ cluster_node_remove_request(int32 node_id)
441448
pg_atomic_write_u32(&nr_state->phase, CLUSTER_REMOVE_REQUESTED);
442449
pg_atomic_fetch_add_u64(&nr_state->removal_request_count, 1);
443450
}
451+
/*
452+
* HF-5: ACCEPTED at the lock-free precheck but the phase advanced out of the
453+
* reservable set between the precheck and this exclusive section — a concurrent
454+
* request reserved first. Do NOT return a stale ACCEPTED without actually
455+
* reserving (the caller would believe its target is being removed when it is
456+
* not); downgrade to removal_in_progress so it retries.
457+
*/
458+
else if (verdict == CLUSTER_REMOVE_REQ_ACCEPTED) {
459+
verdict = CLUSTER_REMOVE_REQ_IN_PROGRESS;
460+
}
444461
/* RESUME: keep the existing target/epoch; just re-arm the drive from SHRUNK by
445462
* moving CLEANUP_BLOCKED back to CLEANUP (the lmon_tick retries the cleanup). */
446463
else if (verdict == CLUSTER_REMOVE_REQ_RESUME && cur_phase == CLUSTER_REMOVE_CLEANUP_BLOCKED) {
@@ -656,16 +673,50 @@ cluster_node_remove_drive(void)
656673
void
657674
cluster_node_remove_survivor_ack(int32 target_node_id, uint64 remove_epoch)
658675
{
659-
/* a non-coordinator survivor drops its local refs to the removed node and
660-
* records its ACK so the coordinator's barrier can complete. */
676+
uint64 removal_event_id;
677+
uint64 removed_incarnation;
678+
int32 coordinator;
679+
680+
/*
681+
* HF-1/HF-3 (INV-LF11): a non-coordinator survivor APPLIES the removal
682+
* locally — not just drops its N-refs. It seeds the durable removed set +
683+
* membership_state[N]=REMOVED (so its own removed_bitmap carries N for any
684+
* fence baseline it later publishes, INV-LF10), permanently remasters
685+
* N-mastered shards onto a MEMBER survivor, clears N's GES/PCM, and PROVES
686+
* zero leftover — and only ACKs once verify passes. The coordinator's final
687+
* REMOVED marker (the trust source) is built on "local verify + all-survivor
688+
* ACK", so an ACK must mean THIS survivor is genuinely clean, not merely
689+
* "dropped some refs". Idempotent: the survivor lmon_tick re-runs it every
690+
* tick until it converges (the announce is one-shot, so a transient leftover
691+
* must be retried locally, not re-announced).
692+
*/
661693
if (nr_state == NULL || target_node_id < 0 || target_node_id >= CLUSTER_MAX_NODES)
662694
return;
663695

664-
(void)cluster_grd_cleanup_on_node_dead(target_node_id);
665-
(void)cluster_pcm_lock_clear_pending_x_for_node(target_node_id);
696+
/* identity for the seed + ACK = THIS attempt (recorded by the announce
697+
* handler / lmon_tick adopt path, HF-4). */
698+
LWLockAcquire(&nr_state->lock, LW_SHARED);
699+
removal_event_id = nr_state->removal_event_id;
700+
removed_incarnation = nr_state->target_last_incarnation;
701+
coordinator = nr_state->coordinator_node_id;
702+
LWLockRelease(&nr_state->lock);
703+
704+
/* seed the durable removed set + membership REMOVED with the coordinator's
705+
* pinned incarnation floor (carried in the announce, HF-1). */
706+
cluster_reconfig_seed_removed_membership(target_node_id, remove_epoch, removed_incarnation,
707+
/*raise_epoch_floor*/ true);
708+
709+
/* full cluster-wide cleanup on THIS survivor + zero-leftover proof (HF-3).
710+
* run_cleanup bumps leftover_detected_count + returns false when not clean. */
711+
if (!cluster_node_remove_run_cleanup(target_node_id, remove_epoch)) {
712+
pg_atomic_write_u32(&nr_state->survivor_acked, 0);
713+
return; /* leftover -> retry next survivor tick, do NOT ACK */
714+
}
666715

667-
cluster_node_remove_ic_send_ack(nr_state->coordinator_node_id, target_node_id, remove_epoch,
668-
nr_state->removal_event_id);
716+
/* clean: ACK with THIS attempt's identity so the coordinator's barrier keys on
717+
* this removal, not a stale prior one. */
718+
cluster_node_remove_ic_send_ack(coordinator, target_node_id, remove_epoch, removal_event_id);
719+
pg_atomic_write_u32(&nr_state->survivor_acked, 1);
669720
}
670721

671722
void
@@ -674,6 +725,7 @@ cluster_node_remove_lmon_tick(void)
674725
ClusterRemovePhase phase;
675726
int32 node_id;
676727
int32 coordinator;
728+
uint64 remove_epoch;
677729

678730
if (nr_state == NULL || !cluster_enabled || !cluster_online_node_removal)
679731
return;
@@ -684,6 +736,7 @@ cluster_node_remove_lmon_tick(void)
684736
phase = (ClusterRemovePhase)pg_atomic_read_u32(&nr_state->phase);
685737
node_id = nr_state->target_node_id;
686738
coordinator = nr_state->coordinator_node_id;
739+
remove_epoch = nr_state->remove_epoch;
687740
LWLockRelease(&nr_state->lock);
688741

689742
if (node_id < 0)
@@ -695,10 +748,20 @@ cluster_node_remove_lmon_tick(void)
695748
if (phase >= CLUSTER_REMOVE_CLEANUP && phase <= CLUSTER_REMOVE_CLEANUP_BLOCKED
696749
&& pg_atomic_read_u32(&nr_state->announce_sent) == 0) {
697750
cluster_node_remove_ic_broadcast_announce(node_id, nr_state->remove_epoch,
698-
nr_state->removal_event_id);
751+
nr_state->removal_event_id,
752+
nr_state->target_last_incarnation);
699753
pg_atomic_write_u32(&nr_state->announce_sent, 1);
700754
}
701755
cluster_node_remove_drive();
756+
} else {
757+
/*
758+
* HF-1/HF-3 (INV-LF11): survivor side — (re)apply the recorded removal
759+
* locally + ACK; retry each tick until verify converges. The announce is
760+
* one-shot, so a transient leftover (or an announce that arrived before the
761+
* GRD/PCM state was settleable) must be retried here, not re-announced.
762+
*/
763+
if (pg_atomic_read_u32(&nr_state->survivor_acked) == 0)
764+
cluster_node_remove_survivor_ack(node_id, remove_epoch);
702765
}
703766
}
704767

@@ -707,7 +770,7 @@ cluster_node_remove_lmon_tick(void)
707770
* IC wire (D10): NODE_REMOVE_ANNOUNCE (broadcast) + REMOVE_CLEANUP_ACK (p2p).
708771
* ============================================================ */
709772

710-
/* survivor side: a coordinator announced a removal — drop refs + ACK. */
773+
/* survivor side: a coordinator announced a removal — apply it locally + ACK. */
711774
static void
712775
nr_announce_handler(const ClusterICEnvelope *env, const void *payload)
713776
{
@@ -721,24 +784,38 @@ nr_announce_handler(const ClusterICEnvelope *env, const void *payload)
721784
* into our own membership table (lock-free SSOT for the self-demote write
722785
* gate) + the durable removed set so a still-running removed node fail-closes
723786
* new writable transactions (53R64) instead of serving as a phantom member.
724-
* Do NOT send a cleanup ACK — a removed node is not a survivor.
787+
* HF-1: pin the coordinator's incarnation floor from the announce so a future
788+
* re-admit must present a strictly newer incarnation. Do NOT send a cleanup
789+
* ACK — a removed node is not a survivor.
725790
*/
726791
cluster_reconfig_seed_removed_membership(cluster_node_id, p->remove_epoch,
727-
0 /* incarnation floor set by coordinator */,
792+
p->removed_incarnation,
728793
/*raise_epoch_floor*/ false);
729794
return;
730795
}
731796

732-
/* record the attempt so our ACK echoes the right identity. */
797+
/*
798+
* HF-4: adopt THIS removal attempt's identity when our recorded one is absent,
799+
* terminal, or a different attempt — a survivor never runs the driver's
800+
* abort/commit resets, so a prior attempt's identity would otherwise linger and
801+
* get this attempt's ACK rejected by the coordinator (event_id mismatch),
802+
* wedging the next removal's cleanup barrier. Same event_id = an idempotent
803+
* re-announce: keep progress (do not reset survivor_acked).
804+
*/
733805
LWLockAcquire(&nr_state->lock, LW_EXCLUSIVE);
734-
if (nr_state->target_node_id < 0) {
806+
if (nr_state->removal_event_id != p->removal_event_id
807+
|| nr_state->target_node_id != p->target_node_id) {
735808
nr_state->target_node_id = p->target_node_id;
736809
nr_state->coordinator_node_id = p->coordinator_node_id;
737810
nr_state->remove_epoch = p->remove_epoch;
738811
nr_state->removal_event_id = p->removal_event_id;
812+
nr_state->target_last_incarnation = p->removed_incarnation;
813+
pg_atomic_write_u32(&nr_state->survivor_acked, 0); /* re-apply for the new attempt */
739814
}
740815
LWLockRelease(&nr_state->lock);
741816

817+
/* INV-LF11: apply the removal locally + ACK when clean (the survivor lmon_tick
818+
* retries until verify converges). */
742819
cluster_node_remove_survivor_ack(p->target_node_id, p->remove_epoch);
743820
}
744821

@@ -753,11 +830,18 @@ nr_ack_handler(const ClusterICEnvelope *env, const void *payload)
753830
return;
754831
if (p->survivor_node_id < 0 || p->survivor_node_id >= CLUSTER_NODE_REMOVE_ACK_BITMAP_BYTES * 8)
755832
return;
756-
if (p->removal_event_id != nr_state->removal_event_id)
757-
return; /* stale ACK from a prior attempt */
758833

834+
/*
835+
* HF-4: an ACK counts toward the barrier only for THIS exact removal attempt —
836+
* validate the full identity tuple (target, epoch, event_id) under the lock, not
837+
* just event_id, so a stale ACK from a prior attempt can never satisfy the
838+
* current barrier (and the snapshot is consistent with the bitmap write).
839+
*/
759840
LWLockAcquire(&nr_state->lock, LW_EXCLUSIVE);
760-
nr_state->ack_bitmap[p->survivor_node_id / 8] |= (uint8)(1u << (p->survivor_node_id % 8));
841+
if (p->removal_event_id == nr_state->removal_event_id
842+
&& p->target_node_id == nr_state->target_node_id
843+
&& p->remove_epoch == nr_state->remove_epoch)
844+
nr_state->ack_bitmap[p->survivor_node_id / 8] |= (uint8)(1u << (p->survivor_node_id % 8));
761845
LWLockRelease(&nr_state->lock);
762846
}
763847

@@ -785,7 +869,7 @@ cluster_node_remove_register_ic_msg_types(void)
785869

786870
void
787871
cluster_node_remove_ic_broadcast_announce(int32 target_node_id, uint64 remove_epoch,
788-
uint64 removal_event_id)
872+
uint64 removal_event_id, uint64 removed_incarnation)
789873
{
790874
ClusterNodeRemoveAnnouncePayload p;
791875
ClusterICFanoutResult per_peer[CLUSTER_MAX_NODES];
@@ -797,6 +881,7 @@ cluster_node_remove_ic_broadcast_announce(int32 target_node_id, uint64 remove_ep
797881
p.target_node_id = target_node_id;
798882
p.remove_epoch = remove_epoch;
799883
p.removal_event_id = removal_event_id;
884+
p.removed_incarnation = removed_incarnation; /* HF-1: incarnation floor for survivor seed */
800885
cluster_node_remove_announce_compute_crc(&p);
801886

802887
cluster_ic_send_envelope_fanout(PGRAC_IC_MSG_NODE_REMOVE_ANNOUNCE, &p, (uint32)sizeof(p),

src/backend/cluster/cluster_reconfig.c

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -844,6 +844,25 @@ cluster_reconfig_is_removed(int32 node_id)
844844
return removed;
845845
}
846846

847+
/*
848+
* spec-5.18 HF-2: lock-free removed test for the 53R64 self-demote write gate
849+
* (cluster_node_remove_self_is_removed), called at every writable-xid assignment.
850+
* Reads a single removed_bitmap bit without the reconfig lock — safe because the
851+
* removed set is monotonic at runtime (a removal is terminal, INV-LF1; only an
852+
* operator un-fence, not implemented, could clear it), so a one-byte read cannot
853+
* tear and the bit never spuriously clears. The durable bitmap is the
854+
* authoritative floor: unlike membership_state[self] (which the joiner / lmon
855+
* self-state paths rewrite each tick), it cannot be flipped REMOVED -> not-removed
856+
* by membership churn, so the write gate stays fail-closed for a removed node.
857+
*/
858+
bool
859+
cluster_reconfig_is_removed_unlocked(int32 node_id)
860+
{
861+
if (ReconfigShmem == NULL || node_id < 0 || node_id >= CLUSTER_RECONFIG_DEAD_BITMAP_BYTES * 8)
862+
return false;
863+
return (ReconfigShmem->removed_bitmap[node_id / 8] & (uint8)(1u << (node_id % 8))) != 0;
864+
}
865+
847866
uint64
848867
cluster_reconfig_get_removed_epoch(int32 node_id)
849868
{
@@ -1246,6 +1265,17 @@ cluster_reconfig_joiner_self_tick(void)
12461265
if (cluster_node_id < 0 || cluster_node_id >= CLUSTER_MAX_NODES)
12471266
return;
12481267

1268+
/*
1269+
* spec-5.18 INV-LF9 (HF-2): a durably-removed node (this node was permanently
1270+
* removed; startup rebuild seeded removed_bitmap[self]) must NOT run the joiner
1271+
* gate — doing so would flip its own membership_state REMOVED -> JOINING/MEMBER
1272+
* (the REJOINER/bootstrap branches below) and defeat the 53R64 self-demote write
1273+
* gate. A removed node can only return via operator un-fence + a fresh-
1274+
* incarnation join (external plane, §1.3) — never by re-running this gate.
1275+
*/
1276+
if (cluster_reconfig_is_removed_unlocked(cluster_node_id))
1277+
return;
1278+
12491279
now_us = (uint64)GetCurrentTimestamp();
12501280

12511281
/*
@@ -1444,9 +1474,21 @@ cluster_reconfig_lmon_tick(void)
14441474

14451475
LWLockAcquire(&ReconfigShmem->lock, LW_EXCLUSIVE);
14461476

1477+
/*
1478+
* spec-5.18 INV-LF9 (HF-2): REMOVED is TERMINAL for self too. A removed
1479+
* node still running / restarted (rebuild seeded removed_bitmap[self] +
1480+
* membership_state[self]=REMOVED) must keep self REMOVED — the joiner gate
1481+
* below must NOT flip it back to MEMBER/JOINING, which would defeat the
1482+
* 53R64 self-demote write gate and let a removed node serve writes. The
1483+
* durable removed_bitmap is the authoritative floor (mirrors the peer
1484+
* REMOVED terminal guard further down + the joiner_self_tick guard).
1485+
*/
1486+
if (clean_departed_test_bit_locked(ReconfigShmem->removed_bitmap, self_id)
1487+
|| cluster_membership_get_state(self_id) == CLUSTER_MEMBER_REMOVED)
1488+
cluster_membership_set_state(self_id, CLUSTER_MEMBER_REMOVED);
14471489
/* self-state follows the joiner gate (D5): MEMBER when admitted / steady,
14481490
* JOINING while this node is itself a joiner whose gate is closed. */
1449-
if (ReconfigShmem->self_join_admitted && !ReconfigShmem->self_join_failed)
1491+
else if (ReconfigShmem->self_join_admitted && !ReconfigShmem->self_join_failed)
14501492
cluster_membership_set_state(self_id, CLUSTER_MEMBER_MEMBER);
14511493
else
14521494
cluster_membership_set_state(self_id, CLUSTER_MEMBER_JOINING);

src/include/cluster/cluster_node_remove.h

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -251,7 +251,13 @@ StaticAssertDecl(sizeof(ClusterNodeRemoveState) <= 512, "ClusterNodeRemoveState
251251
* message being acted on as a removal command (8.A defense-in-depth).
252252
* ============================================================ */
253253
#define CLUSTER_NODE_REMOVE_IC_MAGIC 0x50524943 /* "PRIC" — pgrac remove IC */
254-
#define CLUSTER_NODE_REMOVE_IC_VERSION 1
254+
/*
255+
* Wire version. v2 (spec-5.18 Hardening v1.1, HF-1): the announce payload
256+
* carries removed_incarnation so a non-target survivor can seed its own
257+
* membership_state[N]=REMOVED with the correct incarnation floor when it
258+
* applies the removal locally (INV-LF11), not just drop its N-refs.
259+
*/
260+
#define CLUSTER_NODE_REMOVE_IC_VERSION 2
255261

256262
/*
257263
* NODE_REMOVE_ANNOUNCE payload — coordinator -> all survivors (broadcast).
@@ -263,10 +269,12 @@ typedef struct ClusterNodeRemoveAnnouncePayload {
263269
uint16 version;
264270
uint16 _pad0;
265271
int32 coordinator_node_id;
266-
int32 target_node_id; /* who is removed */
267-
uint64 remove_epoch; /* the removal reconfig new_epoch */
268-
uint64 removal_event_id; /* identity bound to this removal attempt */
269-
uint32 crc; /* CRC32C over [magic..removal_event_id] */
272+
int32 target_node_id; /* who is removed */
273+
uint64 remove_epoch; /* the removal reconfig new_epoch */
274+
uint64 removal_event_id; /* identity bound to this removal attempt */
275+
uint64 removed_incarnation; /* HF-1: target's pinned incarnation floor (for
276+
* survivor-local membership REMOVED seed, INV-LF11) */
277+
uint32 crc; /* CRC32C over [magic..removed_incarnation] */
270278
} ClusterNodeRemoveAnnouncePayload;
271279

272280
/*
@@ -435,7 +443,8 @@ extern bool cluster_node_remove_verify_no_leftover(int32 node_id);
435443
* msg-type registration block, postmaster phase 1). */
436444
extern void cluster_node_remove_register_ic_msg_types(void);
437445
extern void cluster_node_remove_ic_broadcast_announce(int32 target_node_id, uint64 remove_epoch,
438-
uint64 removal_event_id);
446+
uint64 removal_event_id,
447+
uint64 removed_incarnation);
439448
extern void cluster_node_remove_ic_send_ack(int32 dest_node_id, int32 target_node_id,
440449
uint64 remove_epoch, uint64 removal_event_id);
441450

src/include/cluster/cluster_reconfig.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -659,6 +659,8 @@ extern uint64 cluster_reconfig_get_clean_departed_count(void);
659659
extern void cluster_reconfig_record_removed(int32 node_id, uint64 remove_epoch,
660660
bool raise_epoch_floor);
661661
extern bool cluster_reconfig_is_removed(int32 node_id);
662+
/* HF-2: lock-free durable removed test for the 53R64 self-demote write-gate hot path. */
663+
extern bool cluster_reconfig_is_removed_unlocked(int32 node_id);
662664
extern uint64 cluster_reconfig_get_removed_epoch(int32 node_id);
663665
extern uint64 cluster_reconfig_get_removed_count(void);
664666
extern void

src/test/cluster_tap/t/325_cluster_5_18_node_remove.pl

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -181,6 +181,34 @@ sub poll_until
181181
'L8 node1 write fails with a cluster fail-closed error (never a silent success, INV-LF8)');
182182

183183

184+
# ----------
185+
# L12 (Hardening v1.1, HF-2): a removed node that RESTARTS must STAY removed. On
186+
# bring-up node1 reads its own durable removal marker (removed_bitmap[self] +
187+
# membership_state[self]=REMOVED); its lmon self-state maintenance must NOT flip
188+
# that back to JOINING/MEMBER (before the fix the joiner / self-state path rewrote
189+
# self each tick, defeating the 53R64 self-demote write gate and letting a removed
190+
# node serve writes). node0 stays up so node1 is held removed + fenced.
191+
# ----------
192+
$pair->node1->restart;
193+
usleep(6_000_000); # several lmon ticks (1s heartbeat / 500ms qvotec poll)
194+
# node1's own view of itself stays removed across the tick churn (the HF-2 signal:
195+
# self membership_state is NOT flipped REMOVED -> JOINING/MEMBER).
196+
ok(poll_until($pair->node1,
197+
q{SELECT state = 'removed' FROM pg_cluster_membership WHERE node_id = 1},
198+
't', 30, 'L12 node1 self stays removed after restart'),
199+
'L12 node1 self-state stays removed across lmon ticks after restart (HF-2)');
200+
# Forcing an xid assignment (txid_current) hits the AssignTransactionId
201+
# self-demote gate directly: a removed node must fail closed there (53R64),
202+
# proving cluster_node_remove_self_is_removed held across the tick churn (HF-2) and
203+
# was not defeated by a stale self-state write.
204+
my ($hf2rc, $hf2out, $hf2err) =
205+
$pair->node1->psql('postgres', 'SELECT txid_current()');
206+
ok($hf2rc != 0, 'L12 restarted removed node fail-closes xid assignment (HF-2)');
207+
like(($hf2err // ''),
208+
qr/53R64|removed/i,
209+
'L12 restarted removed node xid assignment blocked by 53R64 self-demote gate (HF-2)');
210+
211+
184212
# ----------
185213
# L10 crash-recovery-from-marker (INV-LF7): restart node0; the durable §2.5
186214
# SHRUNK/REMOVED marker must rebuild removed_bitmap + membership REMOVED so node1

0 commit comments

Comments
 (0)