Add node failure tolerations to all service operators and openstackclient#1545
Conversation
| - key: "node.kubernetes.io/not-ready" | ||
| operator: "Exists" | ||
| effect: "NoExecute" | ||
| tolerationSeconds: 120 |
There was a problem hiding this comment.
wondering if we should have an interface to customize the tolerations, like we did for the resource limits/requests?
There was a problem hiding this comment.
Sure, could do that as a followup I suppose
There was a problem hiding this comment.
sounds good and is probably better to do it in a follow up, instead of increasing the size of this PR.
These changes ensure OpenStackClient pods are automatically rescheduled when nodes fail, instead of requiring manual intervention to delete stuck pods. The 120-second tolerations provide faster failover compared to the 5min default, while the stuck pod detection handles edge cases where normal eviction fails. - Adds tolerations for faster pod eviction (120s vs 5min default) * Handle node.kubernetes.io/not-ready taints * Handle node.kubernetes.io/unreachable taints - Force delete stuck pods with grace period 0 Note: - going lower then 120s could be too aggressive and result in pod eviction e.g. during a network issue, or kubelet restarts - in a follow up same tolerations should be added to the operator controller manager deployments, since the openstack-operator-controller-manager is the one handling the openstackclient pod. Jira: OSPRH-18450 Signed-off-by: Martin Schuppert <mschuppert@redhat.com>
This change adds 120s tolerations for node.kubernetes.io/not-ready and unreachable taints to reduce pod failover during a node failure. The total eviction time is ~160s (5min+ default). 120s was choosen to prevents pod rescheduling e.g. on kubelet restarts or network issues Jira: OSPRH-18450 Signed-off-by: Martin Schuppert <mschuppert@redhat.com>
|
/test functional |
|
/retest |
|
I do not see the functional test error when running local |
|
/retest |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abays, stuggi The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
fcc5124
into
openstack-k8s-operators:main
|
/cherry-pick 18.0-fr3 |
|
@stuggi: new pull request created: #1549 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This change adds 120s tolerations for node.kubernetes.io/not-ready and unreachable taints to reduce pod failover during a node failure.
The total eviction time is ~160s (5min+ default). 120s was choosen to prevents pod rescheduling e.g. on kubelet restarts or network issues
Also fix OpenStackClient pod relocation during node failures
These changes ensure OpenStackClient pods are automatically rescheduled when nodes fail, instead of requiring manual intervention to delete stuck pods. The 120-second tolerations provide faster failover compared to the 5min default, while the stuck pod detection handles edge cases where normal eviction fails.
Note: going lower then 120s could be too aggressive and result in pod eviction e.g. during a network issue, or kubelet restarts
Jira: OSPRH-18450