Skip to content

Add node failure tolerations to all service operators and openstackclient#1545

Merged
openshift-merge-bot[bot] merged 2 commits into
openstack-k8s-operators:mainfrom
stuggi:OSPRH-18450
Jul 30, 2025
Merged

Add node failure tolerations to all service operators and openstackclient#1545
openshift-merge-bot[bot] merged 2 commits into
openstack-k8s-operators:mainfrom
stuggi:OSPRH-18450

Conversation

@stuggi

@stuggi stuggi commented Jul 24, 2025

Copy link
Copy Markdown
Contributor

This change adds 120s tolerations for node.kubernetes.io/not-ready and unreachable taints to reduce pod failover during a node failure.

The total eviction time is ~160s (5min+ default). 120s was choosen to prevents pod rescheduling e.g. on kubelet restarts or network issues

Also fix OpenStackClient pod relocation during node failures

These changes ensure OpenStackClient pods are automatically rescheduled when nodes fail, instead of requiring manual intervention to delete stuck pods. The 120-second tolerations provide faster failover compared to the 5min default, while the stuck pod detection handles edge cases where normal eviction fails.

  • Adds tolerations for faster pod eviction (120s vs 5min default)
    • Handle node.kubernetes.io/not-ready taints
    • Handle node.kubernetes.io/unreachable taints
  • Force delete stuck pods with grace period 0

Note: going lower then 120s could be too aggressive and result in pod eviction e.g. during a network issue, or kubelet restarts

Jira: OSPRH-18450

@openshift-ci openshift-ci Bot requested review from fultonj and rabi July 24, 2025 16:06
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 120

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if we should have an interface to customize the tolerations, like we did for the resource limits/requests?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, could do that as a followup I suppose

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good and is probably better to do it in a follow up, instead of increasing the size of this PR.

@stuggi stuggi requested review from abays, dprince and olliewalsh and removed request for fultonj and rabi July 24, 2025 16:07
stuggi added 2 commits July 25, 2025 09:09
These changes ensure OpenStackClient pods are automatically rescheduled
when nodes fail, instead of requiring manual intervention to delete
stuck pods. The 120-second tolerations provide faster failover compared
to the 5min default, while the stuck pod detection handles edge cases
where normal eviction fails.

- Adds tolerations for faster pod eviction (120s vs 5min default)
  * Handle node.kubernetes.io/not-ready taints
  * Handle node.kubernetes.io/unreachable taints
- Force delete stuck pods with grace period 0

Note:
- going lower then 120s could be too aggressive and result in pod
eviction e.g. during a network issue, or kubelet restarts
- in a follow up same tolerations should be added to the operator
controller manager deployments, since the
openstack-operator-controller-manager is the one handling the
openstackclient pod.

Jira: OSPRH-18450

Signed-off-by: Martin Schuppert <mschuppert@redhat.com>
This change adds 120s tolerations for node.kubernetes.io/not-ready
and unreachable taints to reduce pod failover during a node failure.

The total eviction time is  ~160s (5min+ default). 120s was choosen
to prevents pod rescheduling e.g. on kubelet restarts or network issues

Jira: OSPRH-18450

Signed-off-by: Martin Schuppert <mschuppert@redhat.com>
@stuggi

stuggi commented Jul 25, 2025

Copy link
Copy Markdown
Contributor Author

/test functional

@stuggi

stuggi commented Jul 25, 2025

Copy link
Copy Markdown
Contributor Author

/retest

@stuggi

stuggi commented Jul 25, 2025

Copy link
Copy Markdown
Contributor Author

I do not see the functional test error when running local

@stuggi

stuggi commented Jul 25, 2025

Copy link
Copy Markdown
Contributor Author

/retest

@abays abays left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci

openshift-ci Bot commented Jul 29, 2025

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abays, stuggi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@stuggi

stuggi commented Jul 30, 2025

Copy link
Copy Markdown
Contributor Author

/retest

@openshift-merge-bot openshift-merge-bot Bot merged commit fcc5124 into openstack-k8s-operators:main Jul 30, 2025
10 checks passed
@stuggi

stuggi commented Jul 30, 2025

Copy link
Copy Markdown
Contributor Author

/cherry-pick 18.0-fr3

@openshift-cherrypick-robot

Copy link
Copy Markdown

@stuggi: new pull request created: #1549

Details

In response to this:

/cherry-pick 18.0-fr3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants