Skip to content

Commit b0fd6dc

Browse files
author
SqlRush
committed
feat(cluster): spec-5.18 online node leave — permanent removal: fence + cluster-wide cleanup
Permanent decommission of an already-left (clean-leave / fail-stop) declared node: durable membership shrink (membership_state REMOVED) + spec-4.12 cooperative write-fence arm (revival defense) + cluster-wide cleanup_on_exit (GRD remaster / GES+PCM clear / verify_no_leftover), driven by a survivor coordinator. Opt-in, off by default. - cluster_node_remove{,_policy,_views}.c + cluster_node_remove.h: 10-phase survivor-coordinator state machine, three-phase durable removal marker (REMOVING/SHRUNK/REMOVED) riding the voting-slot _reserved1 with R12 carry-forward, IC announce/ack orchestration, crash-recovery from the marker, self-demote predicate. - cluster_reconfig: append reconfig_kind NODE_REMOVED, removed_bitmap + removed_epoch, effective_dead mask (dead & ~clean_departed & ~removed), removal event-id, coordinator commit (fence-before-shrink ordering). - cluster_membership: REMOVED terminal state (kept terminal in the lmon maintenance loop), REJECT_REMOVED_FENCED vet, member_count, shrink_to_removed. - cluster_qvotec: permanent fence baseline (dead|removed, guarded), removal marker carry-forward + clean-shutdown preserve. - cluster_write_fence: marker_kind NODE_REMOVED. - xact: self-demote writable-tx gate (53R64) for a removed node. - 2 GUC (cluster.online_node_removal off, cluster.node_removal_cleanup_timeout_ms), 53R63/53R64, ReconfigNodeRemoveCleanupWait wait event, 7 inject points, pg_cluster_node_removal_state SRF + pg_cluster_remove_node(int) UDF, pg_cluster_membership +2 cols (removed, removed_epoch), catversion bump. Tests: test_cluster_node_remove (pure policy) + membership U5/U7; cluster_regress cluster_node_remove; cluster_tap t/325 (2-node e2e: fence-before-shrink, membership shrink, no-leftover, zombie-write-safe, crash-recovery-from-marker, terminal-REMOVED); D21 baseline syncs. Manual: docs/cluster/node-removal.md. Spec: spec-5.18-online-node-leave-fence-cleanup.md
1 parent b352a46 commit b0fd6dc

51 files changed

Lines changed: 4022 additions & 53 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

docs/cluster/node-removal.md

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
# Permanent Node Removal — User Manual
2+
3+
> Status: introduced in linkdb stage 5.18.
4+
5+
## Overview
6+
7+
**Permanent node removal** decommissions a declared cluster node for good. It is
8+
the finishing step after a node has already left — either by a cooperative
9+
[clean leave](clean-leave.md) or by a crash (fail-stop) — and you have decided it
10+
will not come back.
11+
12+
Removing a node does three things:
13+
14+
1. **Shrinks the membership.** The node's effective membership state becomes
15+
`removed`. Surviving nodes stop counting it as a member, stop expecting it in
16+
membership barriers, and never again treat its heartbeats as a returning
17+
member.
18+
2. **Fences it.** A removed node is write-fenced on the shared storage: even if a
19+
stale instance of it comes back, it cannot write shared data, cannot fool a
20+
surviving node, and cannot passively rejoin.
21+
3. **Cleans up after it cluster-wide.** Every lock-directory shard the removed
22+
node mastered is permanently reassigned to a surviving member, its global lock
23+
and page-lock leftovers are cleared, its voting-disk record is tombstoned, and
24+
the cluster verifies that zero references to it remain anywhere.
25+
26+
Removal is **opt-in** and **off by default**. It only works on a node that has
27+
already left — you cannot remove a live, actively-serving node (you must clean-
28+
leave or stop it first, so its in-memory data is not lost).
29+
30+
## Configuration
31+
32+
| GUC | Default | Range | Context | Description |
33+
|---|---|---|---|---|
34+
| `cluster.online_node_removal` | `off` | boolean | `POSTMASTER` | Enable permanent node removal on this node. When `off`, `pg_cluster_remove_node()` returns `rejected:feature_disabled` and no removal, fence, or cleanup runs. |
35+
| `cluster.node_removal_cleanup_timeout_ms` | `30000` | `[5000, 120000]` | `SIGHUP` | Deadline for the post-shrink cluster-wide cleanup. If cleanup does not finish (all surviving members acknowledge and zero leftovers are proven) within this window, the removal enters a resumable `cleanup_blocked` state — it is never reported complete until cleanup actually succeeds. |
36+
37+
To use node removal, set `cluster.online_node_removal = on` in `pgrac.conf` (or
38+
`postgresql.conf`) on the surviving nodes and restart, then issue the request
39+
below on a surviving node.
40+
41+
## Removing a node
42+
43+
```sql
44+
SELECT pg_cluster_remove_node(<node_id>);
45+
```
46+
47+
Superuser only. Run it on a **surviving** node (not the node being removed). The
48+
function returns a short text status:
49+
50+
| Result | Meaning |
51+
|---|---|
52+
| `accepted` | The removal was accepted and is running. Watch `pg_cluster_node_removal_state` for progress until `phase` reaches `committed`. |
53+
| `noop:already_removed` | The node was already permanently removed. |
54+
| `resume:cleanup_pending` | The node is already shrunk out and fenced, but its cleanup had not finished; the request resumes and completes the cleanup. |
55+
| `rejected:feature_disabled` | `cluster.online_node_removal` is `off` on this node. |
56+
| `rejected:cannot_remove_self` | You cannot remove the node you are connected to. |
57+
| `rejected:not_declared` | `node_id` is not a declared cluster node. |
58+
| `rejected:node_not_drained` | The node is still active and has not left. Clean-leave or stop it first — removal never force-evicts a live node. |
59+
| `rejected:not_in_quorum` | This node is not currently in quorum. |
60+
| `rejected:removal_in_progress` | A different removal is already in progress. |
61+
62+
After the request returns `accepted`, poll the progress view until `phase`
63+
reaches `committed`:
64+
65+
```sql
66+
SELECT phase FROM pg_cluster_node_removal_state; -- wait for 'committed'
67+
```
68+
69+
## Progress: `pg_cluster_node_removal_state`
70+
71+
An always-one-row view of the removal in progress (read-only; granted to PUBLIC).
72+
When idle, `phase` is `idle` and `target_node_id` is `-1`.
73+
74+
| Column | Type | Description |
75+
|---|---|---|
76+
| `phase` | text | `idle`, `requested`, `precheck`, `fence_arming`, `shrink_committing`, `cleanup`, `cleanup_blocked`, `committed`, `aborted`, `aborted_escalate`. |
77+
| `target_node_id` | int4 | The node being removed, or `-1` when idle. |
78+
| `coordinator_node_id` | int4 | The surviving node driving the removal. |
79+
| `remove_epoch` | int8 | The membership epoch the removal is bound to. |
80+
| `fence_armed` | bool | The removed node's write fence is majority-durable. |
81+
| `membership_shrunk` | bool | The membership has been shrunk (the node is a non-member). |
82+
| `grd_cleaned` | bool | Lock-directory remaster + cleanup is done. |
83+
| `pcm_cleaned` | bool | Page-lock cleanup is done. |
84+
| `ack_count` | int4 | Surviving members that acknowledged the cleanup. |
85+
| `deadline_us` | int8 | Cleanup deadline, or `NULL` before cleanup starts. |
86+
| `removal_committed_count` | int8 | Lifetime count of completed removals. |
87+
| `cleanup_blocked_count` | int8 | Lifetime count of cleanups that hit the deadline and resumed. |
88+
| `leftover_detected_count` | int8 | Lifetime count of leftover-reference detections (fail-closed). |
89+
| `zombie_write_rejected_count` | int8 | Lifetime count of write attempts rejected from removed nodes. |
90+
91+
A committed removal also surfaces in `pg_cluster_reconfig_state` with
92+
`reconfig_kind = 'node_removed'`, and in `pg_cluster_membership` the node's row
93+
shows `state = 'removed'`, `removed = true`, and a non-zero `removed_epoch`.
94+
95+
## Error codes
96+
97+
| Code | Name | Meaning | Retry |
98+
|---|---|---|---|
99+
| `53R63` | `cluster_node_removal_in_progress` | A writable transaction was rolled back while a removal epoch was publishing. | Retry — retry is safe on the new epoch. |
100+
| `53R64` | `cluster_node_removed_fenced` | A removed node tried to serve or rejoin the cluster. | **No.** The node must be re-admitted by an operator before it can return (see below). |
101+
| `53R51` | `cluster_write_fenced` | A removed (fenced) node tried to write shared storage. | The write is refused; the node is no longer a member. |
102+
103+
## A removed node does not come back automatically
104+
105+
This is the most important operational point. **In this version, a removed node
106+
cannot rejoin on its own.** Restarting it, or reconnecting it to the cluster,
107+
does not bring it back — it stays fenced and is refused (`53R64`). Bringing a
108+
removed node back is a deliberate operator action (un-fencing it) that is **not
109+
provided as a command in this version**; it is a future operational procedure.
110+
111+
A single `join` is therefore only half of what a return would require — the node
112+
must first be explicitly un-fenced. This is intentional: a removed node is treated
113+
as gone until an operator decides, out of band, that it is safe to return.
114+
115+
## Production note: external fencing
116+
117+
The write fence applied by node removal is a **cooperative** fence: it relies on
118+
the removed node's own storage and startup paths to honor it. For a node that is
119+
healthy and correctly configured, that is sufficient — it will not write shared
120+
data after removal.
121+
122+
For a hard guarantee against a malfunctioning or malicious instance (one that
123+
does not honor the cooperative fence), a production deployment should pair node
124+
removal with an **external fencer** at the node level (for example STONITH, IPMI
125+
power control, or a cloud power/network API). External fencing is outside the
126+
database and is configured by your cluster operations tooling.

src/backend/access/transam/xact.c

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -155,6 +155,7 @@
155155
#include "cluster/cluster_undo_record_api.h" /* PGRAC: spec-3.7 D16 PREPARE guard */
156156
#include "cluster/cluster_touched_peers.h" /* PGRAC: spec-5.14 D1 per-tx reset */
157157
#include "cluster/cluster_clean_leave.h" /* PGRAC: spec-5.13 §3.1 refuse-writes gate */
158+
#include "cluster/cluster_node_remove.h" /* PGRAC: spec-5.18 INV-LF9 self-demote gate */
158159
#include "cluster/cluster_reconfig.h" /* PGRAC: spec-5.15 §2.4 joiner write gate */
159160
#include "cluster/storage/cluster_undo_xlog.h" /* PGRAC: spec-3.18 D4.1 TT fold redo stamp */
160161
#endif
@@ -756,6 +757,26 @@ AssignTransactionId(TransactionState s)
756757
errhint("reconnect to a surviving node and retry;"
757758
" the retry is safe")));
758759

760+
/*
761+
* PGRAC MODIFICATIONS (spec-5.18 INV-LF9, self-demote gate)
762+
*
763+
* What changed: if THIS node observes that it has itself been permanently
764+
* removed from the cluster (durable removed set / membership_state==REMOVED),
765+
* it must stop serving — a removed node is a non-member and is fenced off the
766+
* shared storage, so any write it starts is both refused at the smgr gate
767+
* (53R51) and semantically invalid (it is no longer part of the cluster). We
768+
* fail-closed at xid assignment with 53R64 (FATAL, non-retry): the node cannot
769+
* rejoin by retrying — it must be re-admitted (un-fenced) by an operator.
770+
* Read-only transactions assign no xid and are unaffected.
771+
*/
772+
if (cluster_node_remove_self_is_removed())
773+
ereport(FATAL,
774+
(errcode(ERRCODE_CLUSTER_NODE_REMOVED_FENCED),
775+
errmsg("cannot start a writable transaction:"
776+
" this node has been permanently removed from the cluster"),
777+
errhint("this node must be re-admitted (un-fenced) by an operator"
778+
" before it can rejoin the cluster")));
779+
759780
/*
760781
* PGRAC MODIFICATIONS (spec-5.15 §2.4 / INV-J9, node-local joiner write gate)
761782
*

src/backend/catalog/system_views.sql

Lines changed: 37 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1609,13 +1609,17 @@ GRANT SELECT ON pg_cluster_reconfig_state TO PUBLIC;
16091609
-- observed), last_admitted_incarnation (the monotonic floor, INV-J1),
16101610
-- admitted_epoch (the membership epoch observed for the node). Backed by
16111611
-- cluster_get_membership (OID 8962). cluster.enabled=off returns 0 rows.
1612+
-- spec-5.18 D15 extends this with removed (permanently-removed flag, the
1613+
-- terminal state) + removed_epoch (the removal epoch, 0 when not removed).
16121614
CREATE VIEW pg_cluster_membership AS
16131615
SELECT node_id,
16141616
declared,
16151617
state,
16161618
presented_incarnation,
16171619
last_admitted_incarnation,
1618-
admitted_epoch
1620+
admitted_epoch,
1621+
removed,
1622+
removed_epoch
16191623
FROM cluster_get_membership();
16201624

16211625
REVOKE ALL ON pg_cluster_membership FROM PUBLIC;
@@ -1649,6 +1653,38 @@ GRANT SELECT ON pg_cluster_clean_leave_state TO PUBLIC;
16491653
-- superuser()); REVOKE EXECUTE FROM PUBLIC for defense-in-depth (L7).
16501654
REVOKE ALL ON FUNCTION pg_cluster_clean_leave_request() FROM PUBLIC;
16511655

1656+
-- PGRAC: pg_cluster_node_removal_state (spec-5.18 D15).
1657+
-- Always-1-row view of permanent node-removal progress: phase (idle/requested/
1658+
-- precheck/fence_arming/shrink_committing/cleanup/cleanup_blocked/committed/
1659+
-- aborted/aborted_escalate), target_node_id (-1 when idle), coordinator_node_id,
1660+
-- remove_epoch, fence_armed, membership_shrunk, grd_cleaned, pcm_cleaned,
1661+
-- ack_count, deadline_us (NULL pre-cleanup), and lifetime counters. Backed by
1662+
-- cluster_get_node_removal_state (OID 8963); read-only → public. enabled=off → 0 rows.
1663+
CREATE VIEW pg_cluster_node_removal_state AS
1664+
SELECT phase,
1665+
target_node_id,
1666+
coordinator_node_id,
1667+
remove_epoch,
1668+
fence_armed,
1669+
membership_shrunk,
1670+
grd_cleaned,
1671+
pcm_cleaned,
1672+
ack_count,
1673+
deadline_us,
1674+
removal_committed_count,
1675+
cleanup_blocked_count,
1676+
leftover_detected_count,
1677+
zombie_write_rejected_count
1678+
FROM cluster_get_node_removal_state();
1679+
1680+
REVOKE ALL ON pg_cluster_node_removal_state FROM PUBLIC;
1681+
GRANT SELECT ON pg_cluster_node_removal_state TO PUBLIC;
1682+
1683+
-- PGRAC: pg_cluster_remove_node(int) is a mutating operator entry (permanently
1684+
-- removes a declared node). Superuser-only (the C body also gates on superuser());
1685+
-- REVOKE EXECUTE FROM PUBLIC for defense-in-depth (L7).
1686+
REVOKE ALL ON FUNCTION pg_cluster_remove_node(int) FROM PUBLIC;
1687+
16521688
-- PGRAC: pg_cluster_ic_msg_types (spec-2.3 D8; 2026-05-08).
16531689
-- Lists every IC message type registered in the process-local
16541690
-- dispatch_table[] under cluster_ic_router.c. Diagnostic /

src/backend/cluster/Makefile

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -131,6 +131,9 @@ OBJS = \
131131
cluster_lmon.o \
132132
cluster_lms.o \
133133
cluster_membership.o \
134+
cluster_node_remove.o \
135+
cluster_node_remove_policy.o \
136+
cluster_node_remove_views.o \
134137
cluster_pcm_lock.o \
135138
cluster_pgstat.o \
136139
cluster_quorum_decision.o \
@@ -205,13 +208,15 @@ else
205208
OBJS = cluster_conf.o cluster_debug.o cluster_ic.o cluster_inject.o cluster_undo_srf.o \
206209
cluster_cr_srf.o cluster_block_apply_srf.o cluster_block_recovery_srf.o cluster_thread_recovery_apply_srf.o cluster_thread_recovery_replay_srf.o cluster_thread_recovery_driver_srf.o cluster_thread_recovery_orchestrator_srf.o cluster_pgstat.o cluster_scn.o cluster_views.o cluster_ges_mode_backend.o \
207210
cluster_ir_srf.o cluster_ts_srf.o cluster_ko_srf.o \
208-
cluster_hang_resolve.o cluster_clean_leave_views.o
211+
cluster_hang_resolve.o cluster_clean_leave_views.o cluster_node_remove_views.o
209212
# spec-5.12: cluster_hang_resolve.o provides the pg_cluster_hang_victims /
210213
# pg_cluster_hang_resolve SQL symbols (real bodies #ifdef USE_PGRAC_CLUSTER,
211214
# --disable-cluster stubs raise ERRCODE_FEATURE_NOT_SUPPORTED); the symbols
212215
# must resolve at link time in both build modes.
213216
# spec-5.13: cluster_clean_leave_views.o provides the cluster_get_clean_leave_state
214217
# SRF + pg_cluster_clean_leave_request UDF symbols, same unconditional-link reason.
218+
# spec-5.18: cluster_node_remove_views.o provides cluster_get_node_removal_state +
219+
# pg_cluster_remove_node, same unconditional-link reason.
215220
endif
216221

217222
include $(top_srcdir)/src/backend/common.mk

src/backend/cluster/cluster_guc.c

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -301,6 +301,10 @@ bool cluster_online_join = false;
301301
/* spec-5.15 D7 — join convergence / commit deadline (PGC_SIGHUP). */
302302
int cluster_join_convergence_timeout_ms = 30000;
303303

304+
/* spec-5.18 D13 — permanent node removal opt-in + cleanup ACK-barrier deadline. */
305+
bool cluster_online_node_removal = false;
306+
int cluster_node_removal_cleanup_timeout_ms = 30000;
307+
304308
/* spec-2.28 Sprint A Step 1 D7: 4 fence-lite GUCs (Q8 user approve). */
305309
bool cluster_self_fence_enabled = true; /* default fail-safe */
306310
int cluster_self_fence_grace_ms = 30000; /* 30s = 7.5x lease */
@@ -2230,6 +2234,31 @@ cluster_init_guc(void)
22302234
&cluster_join_convergence_timeout_ms, 30000, 5000, 120000, PGC_SIGHUP,
22312235
GUC_UNIT_MS, NULL, NULL, NULL);
22322236

2237+
/*
2238+
* cluster.online_node_removal -- opt-in permanent node removal (spec-5.18).
2239+
* Default OFF (5.6 opt-in paradigm): pg_cluster_remove_node() returns
2240+
* rejected:feature_disabled and no removal path runs. Permanent removal is a
2241+
* high-risk irreversible operation (fence + member-set shrink), so rollout is
2242+
* conservative; flipped on after the spec-5.19 4-node fault-matrix acceptance.
2243+
*/
2244+
DefineCustomBoolVariable(
2245+
"cluster.online_node_removal",
2246+
gettext_noop("Enable permanent removal (decommission) of a declared node."),
2247+
gettext_noop("Off: pg_cluster_remove_node() is rejected (feature_disabled) and no "
2248+
"removal/fence/cleanup path runs. On: an already-left or non-returning "
2249+
"declared node is permanently fenced, shrunk out of the member set, and "
2250+
"its GRD/GES/PCM leftover cleaned up cluster-wide."),
2251+
&cluster_online_node_removal, false, PGC_POSTMASTER, 0, NULL, NULL, NULL);
2252+
2253+
DefineCustomIntVariable("cluster.node_removal_cleanup_timeout_ms",
2254+
gettext_noop("Deadline for the post-shrink cluster-wide removal cleanup."),
2255+
gettext_noop("If verify_no_leftover + all-survivor cleanup ACKs do not "
2256+
"complete within this bound, the removal enters "
2257+
"CLEANUP_BLOCKED (resumable, fail-closed — never reports "
2258+
"complete). Range [5000, 120000] ms."),
2259+
&cluster_node_removal_cleanup_timeout_ms, 30000, 5000, 120000, PGC_SIGHUP,
2260+
GUC_UNIT_MS, NULL, NULL, NULL);
2261+
22332262
/* spec-2.28 Sprint A Step 1 D7: 4 fence-lite GUCs (Q8 user approve). */
22342263

22352264
/*

src/backend/cluster/cluster_inject.c

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -595,6 +595,19 @@ static ClusterInjectPoint cluster_injection_points[] = {
595595
* (t/315 P1): the leaver must reject:preflight_incomplete, NOT fail open.
596596
*/
597597
{ .name = "cluster-clean-leave-survivor-suppress-preflight-ack" },
598+
/*
599+
* spec-5.18 D14 (7 NEW points) — permanent node-removal state-machine
600+
* checkpoints, one per phase transition, so the cluster_tap can drive the
601+
* fence-before-shrink ordering (INV-LF2), the three-phase marker recovery
602+
* (INV-LF7), and the post-SHRUNK CLEANUP_BLOCKED resume (v0.4 P1).
603+
*/
604+
{ .name = "cluster-node-remove-request" },
605+
{ .name = "cluster-node-remove-precheck" },
606+
{ .name = "cluster-node-remove-fence-armed" },
607+
{ .name = "cluster-node-remove-shrink-committing" },
608+
{ .name = "cluster-node-remove-shrink-committed" },
609+
{ .name = "cluster-node-remove-cleanup-done" },
610+
{ .name = "cluster-node-remove-escalate" },
598611
};
599612

600613
#define CLUSTER_INJECTION_COUNT lengthof(cluster_injection_points)

src/backend/cluster/cluster_lmon.c

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@
5656
#include "utils/timestamp.h"
5757

5858
#include "cluster/cluster_clean_leave.h" /* cluster_clean_leave_register_ic_msg_types (spec-5.13 D8) */
59+
#include "cluster/cluster_node_remove.h" /* cluster_node_remove_lmon_tick + register (spec-5.18 D9/D10) */
5960
#include "cluster/cluster_conf.h"
6061
#include "cluster/cluster_cssd.h" /* cluster_cssd_outbound_slots (spec-2.5 D2.6) */
6162
#include "cluster/cluster_fence.h" /* cluster_fence_lmon_tick (spec-2.28 D5) */
@@ -414,6 +415,15 @@ cluster_lmon_shmem_init(void)
414415
clean_leave_registered = true;
415416
}
416417
}
418+
/* spec-5.18 D10: register the permanent node-removal IC msg types. */
419+
{
420+
static bool node_remove_registered = false;
421+
422+
if (!node_remove_registered) {
423+
cluster_node_remove_register_ic_msg_types();
424+
node_remove_registered = true;
425+
}
426+
}
417427
}
418428

419429

@@ -972,6 +982,10 @@ LmonMain(void)
972982
* is recorded before reconfig builds its dead set (CL-I13). */
973983
cluster_clean_leave_lmon_tick();
974984

985+
/* spec-5.18 D9: drive permanent node removal before reconfig (a removal
986+
* masks the node out of effective_dead, INV-LF1). */
987+
cluster_node_remove_lmon_tick();
988+
975989
cluster_reconfig_lmon_tick();
976990

977991
/*
@@ -1522,6 +1536,10 @@ LmonMain(void)
15221536
/* spec-5.13 D6: clean-leave orchestration before the reconfig tick. */
15231537
cluster_clean_leave_lmon_tick();
15241538

1539+
/* spec-5.18 D9: drive permanent node removal before reconfig (a removal
1540+
* masks the node out of effective_dead, INV-LF1). */
1541+
cluster_node_remove_lmon_tick();
1542+
15251543
cluster_reconfig_lmon_tick();
15261544
/* spec-4.6 D1: GRD recovery sequence (see main-loop site). */
15271545
cluster_grd_recovery_lmon_tick();

0 commit comments

Comments
 (0)