Add node failure tolerations to all service operators and openstackclient by stuggi · Pull Request #1545 · openstack-k8s-operators/openstack-operator

stuggi · 2025-07-24T16:06:06Z

This change adds 120s tolerations for node.kubernetes.io/not-ready and unreachable taints to reduce pod failover during a node failure.

The total eviction time is ~160s (5min+ default). 120s was choosen to prevents pod rescheduling e.g. on kubelet restarts or network issues

Also fix OpenStackClient pod relocation during node failures

These changes ensure OpenStackClient pods are automatically rescheduled when nodes fail, instead of requiring manual intervention to delete stuck pods. The 120-second tolerations provide faster failover compared to the 5min default, while the stuck pod detection handles edge cases where normal eviction fails.

Adds tolerations for faster pod eviction (120s vs 5min default)
- Handle node.kubernetes.io/not-ready taints
- Handle node.kubernetes.io/unreachable taints
Force delete stuck pods with grace period 0

Note: going lower then 120s could be too aggressive and result in pod eviction e.g. during a network issue, or kubelet restarts

Jira: OSPRH-18450

stuggi · 2025-07-24T16:06:57Z

+      - key: "node.kubernetes.io/not-ready"
+        operator: "Exists"
+        effect: "NoExecute"
+        tolerationSeconds: 120


wondering if we should have an interface to customize the tolerations, like we did for the resource limits/requests?

Sure, could do that as a followup I suppose

sounds good and is probably better to do it in a follow up, instead of increasing the size of this PR.

These changes ensure OpenStackClient pods are automatically rescheduled when nodes fail, instead of requiring manual intervention to delete stuck pods. The 120-second tolerations provide faster failover compared to the 5min default, while the stuck pod detection handles edge cases where normal eviction fails. - Adds tolerations for faster pod eviction (120s vs 5min default) * Handle node.kubernetes.io/not-ready taints * Handle node.kubernetes.io/unreachable taints - Force delete stuck pods with grace period 0 Note: - going lower then 120s could be too aggressive and result in pod eviction e.g. during a network issue, or kubelet restarts - in a follow up same tolerations should be added to the operator controller manager deployments, since the openstack-operator-controller-manager is the one handling the openstackclient pod. Jira: OSPRH-18450 Signed-off-by: Martin Schuppert <mschuppert@redhat.com>

This change adds 120s tolerations for node.kubernetes.io/not-ready and unreachable taints to reduce pod failover during a node failure. The total eviction time is ~160s (5min+ default). 120s was choosen to prevents pod rescheduling e.g. on kubelet restarts or network issues Jira: OSPRH-18450 Signed-off-by: Martin Schuppert <mschuppert@redhat.com>

stuggi · 2025-07-25T07:17:35Z

/test functional

stuggi · 2025-07-25T08:21:41Z

/retest

stuggi · 2025-07-25T10:13:48Z

I do not see the functional test error when running local

stuggi · 2025-07-25T13:54:17Z

/retest

abays

/lgtm

openshift-ci · 2025-07-29T09:03:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abays, stuggi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [abays,stuggi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

stuggi · 2025-07-30T10:55:13Z

/retest

stuggi · 2025-07-30T14:52:04Z

/cherry-pick 18.0-fr3

openshift-cherrypick-robot · 2025-07-30T14:52:43Z

@stuggi: new pull request created: #1549

Details

In response to this:

/cherry-pick 18.0-fr3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci Bot requested review from fultonj and rabi July 24, 2025 16:06

openshift-ci Bot added the approved label Jul 24, 2025

stuggi commented Jul 24, 2025

View reviewed changes

stuggi requested review from abays, dprince and olliewalsh and removed request for fultonj and rabi July 24, 2025 16:07

stuggi added the do-not-merge/hold label Jul 24, 2025

stuggi added 2 commits July 25, 2025 09:09

stuggi force-pushed the OSPRH-18450 branch from dd690c3 to 0ef2dfb Compare July 25, 2025 07:10

abays approved these changes Jul 29, 2025

View reviewed changes

openshift-ci Bot assigned abays Jul 29, 2025

openshift-ci Bot added the lgtm label Jul 29, 2025

stuggi removed the do-not-merge/hold label Jul 30, 2025

openshift-merge-bot Bot merged commit fcc5124 into openstack-k8s-operators:main Jul 30, 2025
10 checks passed

openshift-cherrypick-robot mentioned this pull request Jul 30, 2025

[18.0-fr3] Add node failure tolerations to all service operators and openstackclient #1549

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add node failure tolerations to all service operators and openstackclient#1545

Add node failure tolerations to all service operators and openstackclient#1545
openshift-merge-bot[bot] merged 2 commits into
openstack-k8s-operators:mainfrom
stuggi:OSPRH-18450

stuggi commented Jul 24, 2025 •

edited by openshift-ci Bot

Loading

Uh oh!

stuggi Jul 24, 2025

Uh oh!

abays Jul 29, 2025

Uh oh!

stuggi Jul 30, 2025

Uh oh!

stuggi commented Jul 25, 2025

Uh oh!

stuggi commented Jul 25, 2025

Uh oh!

stuggi commented Jul 25, 2025

Uh oh!

stuggi commented Jul 25, 2025

Uh oh!

abays left a comment

Uh oh!

openshift-ci Bot commented Jul 29, 2025

Uh oh!

stuggi commented Jul 30, 2025

Uh oh!

Uh oh!

stuggi commented Jul 30, 2025

Uh oh!

openshift-cherrypick-robot commented Jul 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

stuggi commented Jul 24, 2025 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stuggi Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

abays Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

stuggi Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

stuggi commented Jul 25, 2025

Uh oh!

stuggi commented Jul 25, 2025

Uh oh!

stuggi commented Jul 25, 2025

Uh oh!

stuggi commented Jul 25, 2025

Uh oh!

abays left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Jul 29, 2025

Uh oh!

stuggi commented Jul 30, 2025

Uh oh!

Uh oh!

stuggi commented Jul 30, 2025

Uh oh!

openshift-cherrypick-robot commented Jul 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stuggi commented Jul 24, 2025 •

edited by openshift-ci Bot

Loading