Skip to content

Commit d7236b7

Browse files
committed
docs: import CloudNativePG main
1 parent 8900bcb commit d7236b7

1 file changed

Lines changed: 43 additions & 0 deletions

File tree

website/docs/failover.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,49 @@ expected outage.
103103
Enabling a new configuration option to delay failover provides a mechanism to
104104
prevent premature failover for short-lived network or node instability.
105105

106+
## Detection of node-level failures
107+
108+
When the node hosting the primary becomes unreachable (for example, due to a
109+
kubelet crash or a network partition between the node and the Kubernetes API
110+
server), the operator relies on the pod's `Ready` condition to decide that the
111+
primary is no longer serviceable. While the node is healthy the kubelet keeps
112+
that condition up to date from the readiness probe; once the node stops
113+
reporting, the Kubernetes node lifecycle controller is the one that flips the
114+
condition to `False` as soon as it declares the node `Unknown`.
115+
116+
With stock kube-controller-manager settings, the transition is governed by
117+
`--node-monitor-grace-period` (default `40s` on Kubernetes 1.29-1.31, raised
118+
to `50s` in 1.32 and later): after that window the controller marks the node
119+
`Unknown` and, in the same monitoring pass, issues a patch per pod on that
120+
node to flip the `Ready` condition. In practice the operator observes the
121+
primary as unready about **40 to 55 seconds** after the node becomes
122+
unreachable (the grace period plus up to one `--node-monitor-period` poll,
123+
default `5s`). Managed Kubernetes distributions (GKE, EKS, AKS) may tune
124+
these values; consult the provider's documentation if the observed timing
125+
does not match. After that, the failover procedure starts (further gated by
126+
`.spec.failoverDelay`).
127+
128+
The `Ready` condition flip is not subject to the rate limiters that throttle
129+
pod *eviction* during partial-zonal or large-cluster disruptions
130+
(`--node-eviction-rate`, `--secondary-node-eviction-rate`,
131+
`--unhealthy-zone-threshold`). The operator reacts to the condition flip as
132+
soon as the controller emits the patch, regardless of the zone or cluster-wide
133+
health state.
134+
135+
Pod *eviction* (actual deletion from the unreachable node) is a separate
136+
mechanism, driven by `tolerationSeconds` on the
137+
`node.kubernetes.io/unreachable` `NoExecute` taint (`300s` by default). That
138+
timer does not hold up the operator's failover decision; CloudNativePG
139+
promotes a new primary as soon as the `Ready` condition flips. By that point
140+
the kubelet on the isolated node has already stopped the old PostgreSQL
141+
container locally: with the default
142+
`.spec.probes.liveness.isolationCheck.enabled: true`, the instance manager
143+
fails its own liveness probe once it can reach neither the API server nor
144+
the rest of the cluster, and the kubelet kills the container within
145+
approximately three probe periods (`~30s`). Full high availability
146+
(recreation of the old primary on a healthy node by the operator) is still
147+
gated on the taint-based eviction actually deleting the pod.
148+
106149
## Failover Quorum (Quorum-based Failover)
107150

108151
Failover quorum is a mechanism that enhances data durability and safety during

0 commit comments

Comments
 (0)