You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/topics/health-checking.md
+93Lines changed: 93 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,3 +23,96 @@ on the infrastructure provider.
23
23
24
24
Refer to the [Cluster API documentation](https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking)
25
25
for further information on configuring and using `MachineHealthChecks`.
26
+
27
+
## Replacing Machines Scheduled for Maintenance
28
+
29
+
CAPL detects upcoming Linode infrastructure maintenance windows and sets a `MaintenanceScheduled` condition on
30
+
the corresponding CAPI `Machine` objects. This condition can be used as a trigger for `MachineHealthCheck` to
31
+
automatically replace machines before their maintenance window begins.
32
+
33
+
### How it works
34
+
35
+
During each `LinodeCluster` reconciliation, CAPL queries the Linode API for maintenance events scheduled within
36
+
the next 72 hours. For each Linode instance that matches a `LinodeMachine` in the cluster, CAPL sets:
37
+
38
+
```
39
+
condition:
40
+
type: MaintenanceScheduled
41
+
status: "True"
42
+
```
43
+
44
+
on the owning CAPI `Machine` object. A `MachineHealthCheck` with `unhealthyMachineConditions` targeting this
45
+
condition will then trigger remediation — replacing the machine before the maintenance window starts.
46
+
47
+
### Example MachineHealthCheck
48
+
49
+
The following `MachineHealthCheck` replaces worker machines when `MaintenanceScheduled=True` has been set for
50
+
more than 1 hour:
51
+
52
+
```yaml
53
+
apiVersion: cluster.x-k8s.io/v1beta2
54
+
kind: MachineHealthCheck
55
+
metadata:
56
+
name: ${CLUSTER_NAME}-maintenance
57
+
spec:
58
+
clusterName: ${CLUSTER_NAME}
59
+
selector:
60
+
matchLabels:
61
+
cluster.x-k8s.io/deployment-name: ${CLUSTER_NAME}
62
+
checks:
63
+
unhealthyMachineConditions:
64
+
- type: MaintenanceScheduled
65
+
status: "True"
66
+
timeoutSeconds: 3600
67
+
remediation:
68
+
triggerIf:
69
+
unhealthyLessThanOrEqualTo: 1
70
+
```
71
+
72
+
For control plane machines managed by `KubeadmControlPlane`:
73
+
74
+
```yaml
75
+
apiVersion: cluster.x-k8s.io/v1beta2
76
+
kind: MachineHealthCheck
77
+
metadata:
78
+
name: ${CLUSTER_NAME}-cp-maintenance
79
+
spec:
80
+
clusterName: ${CLUSTER_NAME}
81
+
selector:
82
+
matchLabels:
83
+
cluster.x-k8s.io/control-plane: ""
84
+
checks:
85
+
unhealthyMachineConditions:
86
+
- type: MaintenanceScheduled
87
+
status: "True"
88
+
timeoutSeconds: 3600
89
+
remediation:
90
+
triggerIf:
91
+
unhealthyLessThanOrEqualTo: 1
92
+
```
93
+
94
+
### Field reference
95
+
96
+
| Field | Description |
97
+
|-------|-------------|
98
+
| `checks.unhealthyMachineConditions` | Conditions checked on the CAPI `Machine` object (not the Node). `MaintenanceScheduled` is set here by CAPL. |
99
+
| `type: MaintenanceScheduled` | The condition type set by CAPL when a Linode maintenance event is scheduled within 72 hours. |
100
+
| `status: "True"` | The condition status that indicates maintenance is scheduled. |
101
+
| `timeoutSeconds` | How long the condition must be present before remediation is triggered. Set this to a value less than the expected lead time before the maintenance window starts. |
102
+
| `remediation.triggerIf.unhealthyLessThanOrEqualTo` | Prevents remediation if too many machines are already unhealthy. For control plane clusters, set to `1` to avoid remediating multiple control plane nodes simultaneously and losing etcd quorum. |
103
+
104
+
### Choosing a timeout
105
+
106
+
CAPL sets `MaintenanceScheduled` up to 72 hours before the maintenance window. A `timeoutSeconds` of `3600`
107
+
(1 hour) means remediation begins 71 hours before the window at the earliest. Adjust this value based on
108
+
how much lead time your workloads require for graceful draining.
109
+
110
+
### Limitations
111
+
112
+
- Only machines owned by a `MachineSet` or `KubeadmControlPlane` can be remediated by a `MachineHealthCheck`.
113
+
Standalone machines are not eligible.
114
+
- The `MaintenanceScheduled` condition is never explicitly cleared by CAPL. Machines will be replaced by the
115
+
`MachineHealthCheck`before the condition is removed, which is the intended behavior.
116
+
- Control plane remediation preserves etcd quorum: CAPI will not remediate a second control plane machine
117
+
until the replacement for the first is healthy. Set `unhealthyLessThanOrEqualTo: 1` for control plane
0 commit comments