sqlrush
diff --git a/‎docs/cluster/node-removal.md‎
Lines changed: 126 additions & 0 deletions b/‎docs/cluster/node-removal.md‎
Lines changed: 126 additions & 0 deletions
diff --git a/‎src/backend/access/transam/xact.c‎
Lines changed: 21 additions & 0 deletions b/‎src/backend/access/transam/xact.c‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎src/backend/catalog/system_views.sql‎
Lines changed: 37 additions & 1 deletion b/‎src/backend/catalog/system_views.sql‎
Lines changed: 37 additions & 1 deletion
diff --git a/‎src/backend/cluster/Makefile‎
Lines changed: 6 additions & 1 deletion b/‎src/backend/cluster/Makefile‎
Lines changed: 6 additions & 1 deletion
diff --git a/‎src/backend/cluster/cluster_guc.c‎
Lines changed: 29 additions & 0 deletions b/‎src/backend/cluster/cluster_guc.c‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎src/backend/cluster/cluster_inject.c‎
Lines changed: 13 additions & 0 deletions b/‎src/backend/cluster/cluster_inject.c‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎src/backend/cluster/cluster_lmon.c‎
Lines changed: 18 additions & 0 deletions b/‎src/backend/cluster/cluster_lmon.c‎
Lines changed: 18 additions & 0 deletions
@@ -0,0 +1,126 @@
+# Permanent Node Removal — User Manual
+
+> Status: introduced in linkdb stage 5.18.
+
+## Overview
+
+**Permanent node removal** decommissions a declared cluster node for good. It is
+the finishing step after a node has already left — either by a cooperative
+[clean leave](clean-leave.md) or by a crash (fail-stop) — and you have decided it
+will not come back.
+
+Removing a node does three things:
+
+1. **Shrinks the membership.** The node's effective membership state becomes
+   `removed`. Surviving nodes stop counting it as a member, stop expecting it in
+   membership barriers, and never again treat its heartbeats as a returning
+   member.
+2. **Fences it.** A removed node is write-fenced on the shared storage: even if a
+   stale instance of it comes back, it cannot write shared data, cannot fool a
+   surviving node, and cannot passively rejoin.
+3. **Cleans up after it cluster-wide.** Every lock-directory shard the removed
+   node mastered is permanently reassigned to a surviving member, its global lock
+   and page-lock leftovers are cleared, its voting-disk record is tombstoned, and
+   the cluster verifies that zero references to it remain anywhere.
+
+Removal is **opt-in** and **off by default**. It only works on a node that has
+already left — you cannot remove a live, actively-serving node (you must clean-
+leave or stop it first, so its in-memory data is not lost).
+
+## Configuration
+
+| GUC | Default | Range | Context | Description |
+|---|---|---|---|---|
+| `cluster.online_node_removal` | `off` | boolean | `POSTMASTER` | Enable permanent node removal on this node. When `off`, `pg_cluster_remove_node()` returns `rejected:feature_disabled` and no removal, fence, or cleanup runs. |
+| `cluster.node_removal_cleanup_timeout_ms` | `30000` | `[5000, 120000]` | `SIGHUP` | Deadline for the post-shrink cluster-wide cleanup. If cleanup does not finish (all surviving members acknowledge and zero leftovers are proven) within this window, the removal enters a resumable `cleanup_blocked` state — it is never reported complete until cleanup actually succeeds. |
+
+To use node removal, set `cluster.online_node_removal = on` in `pgrac.conf` (or
+`postgresql.conf`) on the surviving nodes and restart, then issue the request
+below on a surviving node.
+
+## Removing a node
+
+```sql
+SELECT pg_cluster_remove_node(<node_id>);
+```
+
+Superuser only. Run it on a **surviving** node (not the node being removed). The
+function returns a short text status:
+
+| Result | Meaning |
+|---|---|
+| `accepted` | The removal was accepted and is running. Watch `pg_cluster_node_removal_state` for progress until `phase` reaches `committed`. |
+| `noop:already_removed` | The node was already permanently removed. |
+| `resume:cleanup_pending` | The node is already shrunk out and fenced, but its cleanup had not finished; the request resumes and completes the cleanup. |
+| `rejected:feature_disabled` | `cluster.online_node_removal` is `off` on this node. |
+| `rejected:cannot_remove_self` | You cannot remove the node you are connected to. |
+| `rejected:not_declared` | `node_id` is not a declared cluster node. |
+| `rejected:node_not_drained` | The node is still active and has not left. Clean-leave or stop it first — removal never force-evicts a live node. |
+| `rejected:not_in_quorum` | This node is not currently in quorum. |
+| `rejected:removal_in_progress` | A different removal is already in progress. |
+
+After the request returns `accepted`, poll the progress view until `phase`
+reaches `committed`:
+
+```sql
+SELECT phase FROM pg_cluster_node_removal_state;   -- wait for 'committed'
+```
+
+## Progress: `pg_cluster_node_removal_state`
+
+An always-one-row view of the removal in progress (read-only; granted to PUBLIC).
+When idle, `phase` is `idle` and `target_node_id` is `-1`.
+
+| Column | Type | Description |
+|---|---|---|
+| `phase` | text | `idle`, `requested`, `precheck`, `fence_arming`, `shrink_committing`, `cleanup`, `cleanup_blocked`, `committed`, `aborted`, `aborted_escalate`. |
+| `target_node_id` | int4 | The node being removed, or `-1` when idle. |
+| `coordinator_node_id` | int4 | The surviving node driving the removal. |
+| `remove_epoch` | int8 | The membership epoch the removal is bound to. |
+| `fence_armed` | bool | The removed node's write fence is majority-durable. |
+| `membership_shrunk` | bool | The membership has been shrunk (the node is a non-member). |
+| `grd_cleaned` | bool | Lock-directory remaster + cleanup is done. |
+| `pcm_cleaned` | bool | Page-lock cleanup is done. |
+| `ack_count` | int4 | Surviving members that acknowledged the cleanup. |
+| `deadline_us` | int8 | Cleanup deadline, or `NULL` before cleanup starts. |
+| `removal_committed_count` | int8 | Lifetime count of completed removals. |
+| `cleanup_blocked_count` | int8 | Lifetime count of cleanups that hit the deadline and resumed. |
+| `leftover_detected_count` | int8 | Lifetime count of leftover-reference detections (fail-closed). |
+| `zombie_write_rejected_count` | int8 | Lifetime count of write attempts rejected from removed nodes. |
+
+A committed removal also surfaces in `pg_cluster_reconfig_state` with
+`reconfig_kind = 'node_removed'`, and in `pg_cluster_membership` the node's row
+shows `state = 'removed'`, `removed = true`, and a non-zero `removed_epoch`.
+
+## Error codes
+
+| Code | Name | Meaning | Retry |
+|---|---|---|---|
+| `53R63` | `cluster_node_removal_in_progress` | A writable transaction was rolled back while a removal epoch was publishing. | Retry — retry is safe on the new epoch. |
+| `53R64` | `cluster_node_removed_fenced` | A removed node tried to serve or rejoin the cluster. | **No.** The node must be re-admitted by an operator before it can return (see below). |
+| `53R51` | `cluster_write_fenced` | A removed (fenced) node tried to write shared storage. | The write is refused; the node is no longer a member. |
+
+## A removed node does not come back automatically
+
+This is the most important operational point. **In this version, a removed node
+cannot rejoin on its own.** Restarting it, or reconnecting it to the cluster,
+does not bring it back — it stays fenced and is refused (`53R64`). Bringing a
+removed node back is a deliberate operator action (un-fencing it) that is **not
+provided as a command in this version**; it is a future operational procedure.
+
+A single `join` is therefore only half of what a return would require — the node
+must first be explicitly un-fenced. This is intentional: a removed node is treated
+as gone until an operator decides, out of band, that it is safe to return.
+
+## Production note: external fencing
+
+The write fence applied by node removal is a **cooperative** fence: it relies on
+the removed node's own storage and startup paths to honor it. For a node that is
+healthy and correctly configured, that is sufficient — it will not write shared
+data after removal.
+
+For a hard guarantee against a malfunctioning or malicious instance (one that
+does not honor the cooperative fence), a production deployment should pair node
+removal with an **external fencer** at the node level (for example STONITH, IPMI
+power control, or a cloud power/network API). External fencing is outside the
+database and is configured by your cluster operations tooling.
@@ -155,6 +155,7 @@
 #include "cluster/cluster_undo_record_api.h"  /* PGRAC: spec-3.7 D16 PREPARE guard */
 #include "cluster/cluster_touched_peers.h"	  /* PGRAC: spec-5.14 D1 per-tx reset */
 #include "cluster/cluster_clean_leave.h"		  /* PGRAC: spec-5.13 §3.1 refuse-writes gate */
+#include "cluster/cluster_node_remove.h"		  /* PGRAC: spec-5.18 INV-LF9 self-demote gate */
 #include "cluster/cluster_reconfig.h"			  /* PGRAC: spec-5.15 §2.4 joiner write gate */
 #include "cluster/storage/cluster_undo_xlog.h" /* PGRAC: spec-3.18 D4.1 TT fold redo stamp */
 #endif
@@ -756,6 +757,26 @@ AssignTransactionId(TransactionState s)
 				 errhint("reconnect to a surviving node and retry;"
 						 " the retry is safe")));
 
+	/*
+	 * PGRAC MODIFICATIONS (spec-5.18 INV-LF9, self-demote gate)
+	 *
+	 * What changed: if THIS node observes that it has itself been permanently
+	 * removed from the cluster (durable removed set / membership_state==REMOVED),
+	 * it must stop serving — a removed node is a non-member and is fenced off the
+	 * shared storage, so any write it starts is both refused at the smgr gate
+	 * (53R51) and semantically invalid (it is no longer part of the cluster).  We
+	 * fail-closed at xid assignment with 53R64 (FATAL, non-retry): the node cannot
+	 * rejoin by retrying — it must be re-admitted (un-fenced) by an operator.
+	 * Read-only transactions assign no xid and are unaffected.
+	 */
+	if (cluster_node_remove_self_is_removed())
+		ereport(FATAL,
+				(errcode(ERRCODE_CLUSTER_NODE_REMOVED_FENCED),
+				 errmsg("cannot start a writable transaction:"
+						" this node has been permanently removed from the cluster"),
+				 errhint("this node must be re-admitted (un-fenced) by an operator"
+						 " before it can rejoin the cluster")));
+
 	/*
 	 * PGRAC MODIFICATIONS (spec-5.15 §2.4 / INV-J9, node-local joiner write gate)
 	 *
 
@@ -1609,13 +1609,17 @@ GRANT SELECT ON pg_cluster_reconfig_state TO PUBLIC;
 --   observed), last_admitted_incarnation (the monotonic floor, INV-J1),
 --   admitted_epoch (the membership epoch observed for the node).  Backed by
 --   cluster_get_membership (OID 8962).  cluster.enabled=off returns 0 rows.
+--   spec-5.18 D15 extends this with removed (permanently-removed flag, the
+--   terminal state) + removed_epoch (the removal epoch, 0 when not removed).
 CREATE VIEW pg_cluster_membership AS
     SELECT node_id,
            declared,
            state,
            presented_incarnation,
            last_admitted_incarnation,
-           admitted_epoch
+           admitted_epoch,
+           removed,
+           removed_epoch
       FROM cluster_get_membership();
 
 REVOKE ALL ON pg_cluster_membership FROM PUBLIC;
@@ -1649,6 +1653,38 @@ GRANT SELECT ON pg_cluster_clean_leave_state TO PUBLIC;
 -- superuser()); REVOKE EXECUTE FROM PUBLIC for defense-in-depth (L7).
 REVOKE ALL ON FUNCTION pg_cluster_clean_leave_request() FROM PUBLIC;
 
+-- PGRAC: pg_cluster_node_removal_state (spec-5.18 D15).
+--   Always-1-row view of permanent node-removal progress: phase (idle/requested/
+--   precheck/fence_arming/shrink_committing/cleanup/cleanup_blocked/committed/
+--   aborted/aborted_escalate), target_node_id (-1 when idle), coordinator_node_id,
+--   remove_epoch, fence_armed, membership_shrunk, grd_cleaned, pcm_cleaned,
+--   ack_count, deadline_us (NULL pre-cleanup), and lifetime counters.  Backed by
+--   cluster_get_node_removal_state (OID 8963); read-only → public.  enabled=off → 0 rows.
+CREATE VIEW pg_cluster_node_removal_state AS
+    SELECT phase,
+           target_node_id,
+           coordinator_node_id,
+           remove_epoch,
+           fence_armed,
+           membership_shrunk,
+           grd_cleaned,
+           pcm_cleaned,
+           ack_count,
+           deadline_us,
+           removal_committed_count,
+           cleanup_blocked_count,
+           leftover_detected_count,
+           zombie_write_rejected_count
+      FROM cluster_get_node_removal_state();
+
+REVOKE ALL ON pg_cluster_node_removal_state FROM PUBLIC;
+GRANT SELECT ON pg_cluster_node_removal_state TO PUBLIC;
+
+-- PGRAC: pg_cluster_remove_node(int) is a mutating operator entry (permanently
+-- removes a declared node).  Superuser-only (the C body also gates on superuser());
+-- REVOKE EXECUTE FROM PUBLIC for defense-in-depth (L7).
+REVOKE ALL ON FUNCTION pg_cluster_remove_node(int) FROM PUBLIC;
+
 -- PGRAC: pg_cluster_ic_msg_types (spec-2.3 D8; 2026-05-08).
 --   Lists every IC message type registered in the process-local
 --   dispatch_table[] under cluster_ic_router.c.  Diagnostic /
 
@@ -131,6 +131,9 @@ OBJS = \
 	cluster_lmon.o \
 	cluster_lms.o \
 	cluster_membership.o \
+	cluster_node_remove.o \
+	cluster_node_remove_policy.o \
+	cluster_node_remove_views.o \
 	cluster_pcm_lock.o \
 	cluster_pgstat.o \
 	cluster_quorum_decision.o \
@@ -205,13 +208,15 @@ else
 OBJS = cluster_conf.o cluster_debug.o cluster_ic.o cluster_inject.o cluster_undo_srf.o \
 	cluster_cr_srf.o cluster_block_apply_srf.o cluster_block_recovery_srf.o cluster_thread_recovery_apply_srf.o cluster_thread_recovery_replay_srf.o cluster_thread_recovery_driver_srf.o cluster_thread_recovery_orchestrator_srf.o cluster_pgstat.o cluster_scn.o cluster_views.o cluster_ges_mode_backend.o \
 	cluster_ir_srf.o cluster_ts_srf.o cluster_ko_srf.o \
-	cluster_hang_resolve.o cluster_clean_leave_views.o
+	cluster_hang_resolve.o cluster_clean_leave_views.o cluster_node_remove_views.o
 # spec-5.12: cluster_hang_resolve.o provides the pg_cluster_hang_victims /
 # pg_cluster_hang_resolve SQL symbols (real bodies #ifdef USE_PGRAC_CLUSTER,
 # --disable-cluster stubs raise ERRCODE_FEATURE_NOT_SUPPORTED); the symbols
 # must resolve at link time in both build modes.
 # spec-5.13: cluster_clean_leave_views.o provides the cluster_get_clean_leave_state
 # SRF + pg_cluster_clean_leave_request UDF symbols, same unconditional-link reason.
+# spec-5.18: cluster_node_remove_views.o provides cluster_get_node_removal_state +
+# pg_cluster_remove_node, same unconditional-link reason.
 endif
 
 include $(top_srcdir)/src/backend/common.mk
@@ -301,6 +301,10 @@ bool cluster_online_join = false;
 /* spec-5.15 D7 — join convergence / commit deadline (PGC_SIGHUP). */
 int cluster_join_convergence_timeout_ms = 30000;
 
+/* spec-5.18 D13 — permanent node removal opt-in + cleanup ACK-barrier deadline. */
+bool cluster_online_node_removal = false;
+int cluster_node_removal_cleanup_timeout_ms = 30000;
+
 /* spec-2.28 Sprint A Step 1 D7: 4 fence-lite GUCs (Q8 user approve). */
 bool cluster_self_fence_enabled = true;	   /* default fail-safe */
 int cluster_self_fence_grace_ms = 30000;   /* 30s = 7.5x lease */
@@ -2230,6 +2234,31 @@ cluster_init_guc(void)
 							&cluster_join_convergence_timeout_ms, 30000, 5000, 120000, PGC_SIGHUP,
 							GUC_UNIT_MS, NULL, NULL, NULL);
 
+	/*
+	 * cluster.online_node_removal -- opt-in permanent node removal (spec-5.18).
+	 * Default OFF (5.6 opt-in paradigm): pg_cluster_remove_node() returns
+	 * rejected:feature_disabled and no removal path runs.  Permanent removal is a
+	 * high-risk irreversible operation (fence + member-set shrink), so rollout is
+	 * conservative; flipped on after the spec-5.19 4-node fault-matrix acceptance.
+	 */
+	DefineCustomBoolVariable(
+		"cluster.online_node_removal",
+		gettext_noop("Enable permanent removal (decommission) of a declared node."),
+		gettext_noop("Off: pg_cluster_remove_node() is rejected (feature_disabled) and no "
+					 "removal/fence/cleanup path runs.  On: an already-left or non-returning "
+					 "declared node is permanently fenced, shrunk out of the member set, and "
+					 "its GRD/GES/PCM leftover cleaned up cluster-wide."),
+		&cluster_online_node_removal, false, PGC_POSTMASTER, 0, NULL, NULL, NULL);
+
+	DefineCustomIntVariable("cluster.node_removal_cleanup_timeout_ms",
+							gettext_noop("Deadline for the post-shrink cluster-wide removal cleanup."),
+							gettext_noop("If verify_no_leftover + all-survivor cleanup ACKs do not "
+										 "complete within this bound, the removal enters "
+										 "CLEANUP_BLOCKED (resumable, fail-closed — never reports "
+										 "complete).  Range [5000, 120000] ms."),
+							&cluster_node_removal_cleanup_timeout_ms, 30000, 5000, 120000, PGC_SIGHUP,
+							GUC_UNIT_MS, NULL, NULL, NULL);
+
 	/* spec-2.28 Sprint A Step 1 D7: 4 fence-lite GUCs (Q8 user approve). */
 
 	/*
 
@@ -595,6 +595,19 @@ static ClusterInjectPoint cluster_injection_points[] = {
 	 * (t/315 P1): the leaver must reject:preflight_incomplete, NOT fail open.
 	 */
 	{ .name = "cluster-clean-leave-survivor-suppress-preflight-ack" },
+	/*
+	 * spec-5.18 D14 (7 NEW points) — permanent node-removal state-machine
+	 * checkpoints, one per phase transition, so the cluster_tap can drive the
+	 * fence-before-shrink ordering (INV-LF2), the three-phase marker recovery
+	 * (INV-LF7), and the post-SHRUNK CLEANUP_BLOCKED resume (v0.4 P1).
+	 */
+	{ .name = "cluster-node-remove-request" },
+	{ .name = "cluster-node-remove-precheck" },
+	{ .name = "cluster-node-remove-fence-armed" },
+	{ .name = "cluster-node-remove-shrink-committing" },
+	{ .name = "cluster-node-remove-shrink-committed" },
+	{ .name = "cluster-node-remove-cleanup-done" },
+	{ .name = "cluster-node-remove-escalate" },
 };
 
 #define CLUSTER_INJECTION_COUNT lengthof(cluster_injection_points)
 
@@ -56,6 +56,7 @@
 #include "utils/timestamp.h"
 
 #include "cluster/cluster_clean_leave.h" /* cluster_clean_leave_register_ic_msg_types (spec-5.13 D8) */
+#include "cluster/cluster_node_remove.h" /* cluster_node_remove_lmon_tick + register (spec-5.18 D9/D10) */
 #include "cluster/cluster_conf.h"
 #include "cluster/cluster_cssd.h"	   /* cluster_cssd_outbound_slots (spec-2.5 D2.6) */
 #include "cluster/cluster_fence.h"	   /* cluster_fence_lmon_tick (spec-2.28 D5) */
@@ -414,6 +415,15 @@ cluster_lmon_shmem_init(void)
 			clean_leave_registered = true;
 		}
 	}
+	/* spec-5.18 D10: register the permanent node-removal IC msg types. */
+	{
+		static bool node_remove_registered = false;
+
+		if (!node_remove_registered) {
+			cluster_node_remove_register_ic_msg_types();
+			node_remove_registered = true;
+		}
+	}
 }
 
 
@@ -972,6 +982,10 @@ LmonMain(void)
 			 * is recorded before reconfig builds its dead set (CL-I13). */
 			cluster_clean_leave_lmon_tick();
 
+			/* spec-5.18 D9: drive permanent node removal before reconfig (a removal
+			 * masks the node out of effective_dead, INV-LF1). */
+			cluster_node_remove_lmon_tick();
+
 			cluster_reconfig_lmon_tick();
 
 			/*
@@ -1522,6 +1536,10 @@ LmonMain(void)
 			/* spec-5.13 D6: clean-leave orchestration before the reconfig tick. */
 			cluster_clean_leave_lmon_tick();
 
+			/* spec-5.18 D9: drive permanent node removal before reconfig (a removal
+			 * masks the node out of effective_dead, INV-LF1). */
+			cluster_node_remove_lmon_tick();
+
 			cluster_reconfig_lmon_tick();
 			/* spec-4.6 D1:  GRD recovery sequence (see main-loop site). */
 			cluster_grd_recovery_lmon_tick();