Skip to content

Leader election lease tunables#1442

Merged
openshift-merge-bot[bot] merged 1 commit into
openstack-k8s-operators:mainfrom
dprince:leader-election
May 15, 2025
Merged

Leader election lease tunables#1442
openshift-merge-bot[bot] merged 1 commit into
openstack-k8s-operators:mainfrom
dprince:leader-election

Conversation

@dprince

@dprince dprince commented May 14, 2025

Copy link
Copy Markdown
Contributor

Set leader election LeaseDuration, RenewDealine and RetryPeriod via env variables.

The values used are the ones mentioned as recommended in:
https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability

Co-authored-by: Dan Prince dprince@redhat.com

Jira: OSPRH-16335

@openshift-ci openshift-ci Bot requested review from frenzyfriday and rabi May 14, 2025 20:01
@dprince dprince requested review from hjensas and stuggi May 14, 2025 20:01
@dprince

dprince commented May 14, 2025

Copy link
Copy Markdown
Contributor Author

@hjensas Resurected and this PR works for me. I made a few adjustments. k8s env's like the values to be strings. I also split the envs into individual variables for the templates as this just seemed cleaner. Also the managers.yaml template is a bit special since it was nested loops.

I just reread the openshift guidance, and I would somewhat like to adjust our defaults a bit lower. As is it appears to cause an initial election of the controller-operator to immediately wait over a minute for election. And then it settles out. If you want to make a change with this PR you just edit the CSV for openstack-operator and it applies them to all our operators accordingly. Increasing the settings would have the impact of making things get applied slower I think so with these settings it takes over a minute to apply that config change :/

@dprince dprince requested a review from abays May 14, 2025 20:08
@dprince dprince force-pushed the leader-election branch from ddcdc9f to 8fbbbc4 Compare May 14, 2025 20:18
value: quay.io/openstack-k8s-operators/openstack-operator:latest
- name: ENABLE_WEBHOOKS
value: false
- name: LEASE_DURATION

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hjensas I'm inclined to decrease these back towards the defaults. Perhaps 30/20/5 would be a reasonable default for us. And then folks who want to try out OpenShift cluster high availability mode have that option which we could document. The defaults seem to be https://pkg.go.dev/k8s.io/client-go/tools/leaderelection 15/10/2 I think.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, yeah my eyes when blink blink a couple of times when I saw how much difference there was the openshift/library-go compared to https://pkg.go.dev/k8s.io/client-go/tools/leaderelection.

I also noticed in my testing that if the operators where already running - the "old" operator would terminate and then the new one would wait for over a minute to get the lock. It made me think, it would be good if controlled termination of a leader released/deleted the lease ...

Let's try it a bit lower, 30/20/5 sounds good. I can test that a bit, and if I need to patch the csv in labs to bump it a bit more that is ok.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, yeah my eyes when blink blink a couple of times when I saw how much difference there was the openshift/library-go compared to https://pkg.go.dev/k8s.io/client-go/tools/leaderelection.

thank you for that link. I was wondering how lose values were chosen

@hjensas hjensas left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this job I see: https://softwarefactory-project.io/zuul/t/rdoproject.org/build/2f399e8aa8da40cfa514ec52a9dda98a

openstack-operator-controller-operator

2025-05-14T20:50:08.193Z	INFO	setup	manager configured with lease duration	{"seconds": 137}
2025-05-14T20:50:08.193Z	INFO	setup	manager configured with renew deadline	{"seconds": 107}
2025-05-14T20:50:08.193Z	INFO	setup	manager configured with retry period	{"seconds": 26}

And it propagates to the other operators.
ironic-operator-controller-manager

2025-05-14T20:50:54Z	INFO	setup	manager configured with lease duration	{"seconds": 137}
2025-05-14T20:50:54Z	INFO	setup	manager configured with renew deadline	{"seconds": 107}
2025-05-14T20:50:54Z	INFO	setup	manager configured with retry period	{"seconds": 26}

@openshift-ci

openshift-ci Bot commented May 14, 2025

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprince, hjensas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@abays

abays commented May 15, 2025

Copy link
Copy Markdown
Contributor

/test openstack-operator-build-deploy-kuttl-4-18

Set leader election LeaseDuration, RenewDealine and RetryPeriod via env
variables.

Doubling the default values from here:
 https://pkg.go.dev/k8s.io/client-go/tools/leaderelection

See also: https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability

Co-authored-by: Dan Prince <dprince@redhat.com>

Jira: OSPRH-16335
@dprince dprince force-pushed the leader-election branch from 8fbbbc4 to aa32a2b Compare May 15, 2025 12:46
@openshift-ci openshift-ci Bot removed the lgtm label May 15, 2025
@openshift-ci

openshift-ci Bot commented May 15, 2025

Copy link
Copy Markdown
Contributor

New changes are detected. LGTM label has been removed.

@openshift-merge-bot openshift-merge-bot Bot merged commit 8e6d567 into openstack-k8s-operators:main May 15, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants