|
| 1 | +# Permanent Node Removal — User Manual |
| 2 | + |
| 3 | +> Status: introduced in linkdb stage 5.18. |
| 4 | +
|
| 5 | +## Overview |
| 6 | + |
| 7 | +**Permanent node removal** decommissions a declared cluster node for good. It is |
| 8 | +the finishing step after a node has already left — either by a cooperative |
| 9 | +[clean leave](clean-leave.md) or by a crash (fail-stop) — and you have decided it |
| 10 | +will not come back. |
| 11 | + |
| 12 | +Removing a node does three things: |
| 13 | + |
| 14 | +1. **Shrinks the membership.** The node's effective membership state becomes |
| 15 | + `removed`. Surviving nodes stop counting it as a member, stop expecting it in |
| 16 | + membership barriers, and never again treat its heartbeats as a returning |
| 17 | + member. |
| 18 | +2. **Fences it.** A removed node is write-fenced on the shared storage: even if a |
| 19 | + stale instance of it comes back, it cannot write shared data, cannot fool a |
| 20 | + surviving node, and cannot passively rejoin. |
| 21 | +3. **Cleans up after it cluster-wide.** Every lock-directory shard the removed |
| 22 | + node mastered is permanently reassigned to a surviving member, its global lock |
| 23 | + and page-lock leftovers are cleared, its voting-disk record is tombstoned, and |
| 24 | + the cluster verifies that zero references to it remain anywhere. |
| 25 | + |
| 26 | +Removal is **opt-in** and **off by default**. It only works on a node that has |
| 27 | +already left — you cannot remove a live, actively-serving node (you must clean- |
| 28 | +leave or stop it first, so its in-memory data is not lost). |
| 29 | + |
| 30 | +## Configuration |
| 31 | + |
| 32 | +| GUC | Default | Range | Context | Description | |
| 33 | +|---|---|---|---|---| |
| 34 | +| `cluster.online_node_removal` | `off` | boolean | `POSTMASTER` | Enable permanent node removal on this node. When `off`, `pg_cluster_remove_node()` returns `rejected:feature_disabled` and no removal, fence, or cleanup runs. | |
| 35 | +| `cluster.node_removal_cleanup_timeout_ms` | `30000` | `[5000, 120000]` | `SIGHUP` | Deadline for the post-shrink cluster-wide cleanup. If cleanup does not finish (all surviving members acknowledge and zero leftovers are proven) within this window, the removal enters a resumable `cleanup_blocked` state — it is never reported complete until cleanup actually succeeds. | |
| 36 | + |
| 37 | +To use node removal, set `cluster.online_node_removal = on` in `pgrac.conf` (or |
| 38 | +`postgresql.conf`) on the surviving nodes and restart, then issue the request |
| 39 | +below on a surviving node. |
| 40 | + |
| 41 | +## Removing a node |
| 42 | + |
| 43 | +```sql |
| 44 | +SELECT pg_cluster_remove_node(<node_id>); |
| 45 | +``` |
| 46 | + |
| 47 | +Superuser only. Run it on a **surviving** node (not the node being removed). The |
| 48 | +function returns a short text status: |
| 49 | + |
| 50 | +| Result | Meaning | |
| 51 | +|---|---| |
| 52 | +| `accepted` | The removal was accepted and is running. Watch `pg_cluster_node_removal_state` for progress until `phase` reaches `committed`. | |
| 53 | +| `noop:already_removed` | The node was already permanently removed. | |
| 54 | +| `resume:cleanup_pending` | The node is already shrunk out and fenced, but its cleanup had not finished; the request resumes and completes the cleanup. | |
| 55 | +| `rejected:feature_disabled` | `cluster.online_node_removal` is `off` on this node. | |
| 56 | +| `rejected:cannot_remove_self` | You cannot remove the node you are connected to. | |
| 57 | +| `rejected:not_declared` | `node_id` is not a declared cluster node. | |
| 58 | +| `rejected:node_not_drained` | The node is still active and has not left. Clean-leave or stop it first — removal never force-evicts a live node. | |
| 59 | +| `rejected:not_in_quorum` | This node is not currently in quorum. | |
| 60 | +| `rejected:removal_in_progress` | A different removal is already in progress. | |
| 61 | + |
| 62 | +After the request returns `accepted`, poll the progress view until `phase` |
| 63 | +reaches `committed`: |
| 64 | + |
| 65 | +```sql |
| 66 | +SELECT phase FROM pg_cluster_node_removal_state; -- wait for 'committed' |
| 67 | +``` |
| 68 | + |
| 69 | +## Progress: `pg_cluster_node_removal_state` |
| 70 | + |
| 71 | +An always-one-row view of the removal in progress (read-only; granted to PUBLIC). |
| 72 | +When idle, `phase` is `idle` and `target_node_id` is `-1`. |
| 73 | + |
| 74 | +| Column | Type | Description | |
| 75 | +|---|---|---| |
| 76 | +| `phase` | text | `idle`, `requested`, `precheck`, `fence_arming`, `shrink_committing`, `cleanup`, `cleanup_blocked`, `committed`, `aborted`, `aborted_escalate`. | |
| 77 | +| `target_node_id` | int4 | The node being removed, or `-1` when idle. | |
| 78 | +| `coordinator_node_id` | int4 | The surviving node driving the removal. | |
| 79 | +| `remove_epoch` | int8 | The membership epoch the removal is bound to. | |
| 80 | +| `fence_armed` | bool | The removed node's write fence is majority-durable. | |
| 81 | +| `membership_shrunk` | bool | The membership has been shrunk (the node is a non-member). | |
| 82 | +| `grd_cleaned` | bool | Lock-directory remaster + cleanup is done. | |
| 83 | +| `pcm_cleaned` | bool | Page-lock cleanup is done. | |
| 84 | +| `ack_count` | int4 | Surviving members that acknowledged the cleanup. | |
| 85 | +| `deadline_us` | int8 | Cleanup deadline, or `NULL` before cleanup starts. | |
| 86 | +| `removal_committed_count` | int8 | Lifetime count of completed removals. | |
| 87 | +| `cleanup_blocked_count` | int8 | Lifetime count of cleanups that hit the deadline and resumed. | |
| 88 | +| `leftover_detected_count` | int8 | Lifetime count of leftover-reference detections (fail-closed). | |
| 89 | +| `zombie_write_rejected_count` | int8 | Lifetime count of write attempts rejected from removed nodes. | |
| 90 | + |
| 91 | +A committed removal also surfaces in `pg_cluster_reconfig_state` with |
| 92 | +`reconfig_kind = 'node_removed'`, and in `pg_cluster_membership` the node's row |
| 93 | +shows `state = 'removed'`, `removed = true`, and a non-zero `removed_epoch`. |
| 94 | + |
| 95 | +## Error codes |
| 96 | + |
| 97 | +| Code | Name | Meaning | Retry | |
| 98 | +|---|---|---|---| |
| 99 | +| `53R63` | `cluster_node_removal_in_progress` | A writable transaction was rolled back while a removal epoch was publishing. | Retry — retry is safe on the new epoch. | |
| 100 | +| `53R64` | `cluster_node_removed_fenced` | A removed node tried to serve or rejoin the cluster. | **No.** The node must be re-admitted by an operator before it can return (see below). | |
| 101 | +| `53R51` | `cluster_write_fenced` | A removed (fenced) node tried to write shared storage. | The write is refused; the node is no longer a member. | |
| 102 | + |
| 103 | +## A removed node does not come back automatically |
| 104 | + |
| 105 | +This is the most important operational point. **In this version, a removed node |
| 106 | +cannot rejoin on its own.** Restarting it, or reconnecting it to the cluster, |
| 107 | +does not bring it back — it stays fenced and is refused (`53R64`). Bringing a |
| 108 | +removed node back is a deliberate operator action (un-fencing it) that is **not |
| 109 | +provided as a command in this version**; it is a future operational procedure. |
| 110 | + |
| 111 | +A single `join` is therefore only half of what a return would require — the node |
| 112 | +must first be explicitly un-fenced. This is intentional: a removed node is treated |
| 113 | +as gone until an operator decides, out of band, that it is safe to return. |
| 114 | + |
| 115 | +## Production note: external fencing |
| 116 | + |
| 117 | +The write fence applied by node removal is a **cooperative** fence: it relies on |
| 118 | +the removed node's own storage and startup paths to honor it. For a node that is |
| 119 | +healthy and correctly configured, that is sufficient — it will not write shared |
| 120 | +data after removal. |
| 121 | + |
| 122 | +For a hard guarantee against a malfunctioning or malicious instance (one that |
| 123 | +does not honor the cooperative fence), a production deployment should pair node |
| 124 | +removal with an **external fencer** at the node level (for example STONITH, IPMI |
| 125 | +power control, or a cloud power/network API). External fencing is outside the |
| 126 | +database and is configured by your cluster operations tooling. |
0 commit comments