Skip to content

fix(lb): prevent empty backend pools on IPAM failures#213

Merged
Tomy2e merged 1 commit intoscaleway:masterfrom
kommodity-io:fix/prevent-empty-lb-backends-on-ipam-failure-cleaned2
Feb 17, 2026
Merged

fix(lb): prevent empty backend pools on IPAM failures#213
Tomy2e merged 1 commit intoscaleway:masterfrom
kommodity-io:fix/prevent-empty-lb-backends-on-ipam-failure-cleaned2

Conversation

@andreaswachs
Copy link
Copy Markdown
Contributor

@andreaswachs andreaswachs commented Feb 3, 2026

Summary

Prevents load balancer backend pools from being emptied when IPAM data is unavailable during node initialization.

Changes:

  • Add IPAM fallback query when node InternalIPs are empty
  • Refuse to clear existing backends when no replacement IPs are found
  • Use internal IPs when service annotation scw-loadbalancer-pn-ids is set (not just global PN_ID env var)

This has been manually tested to work by:

  • Deploying a Scaleway CCM container with these changes
  • Manually removed all backends from the LB fronting our LoadBalancer type K8s service
  • Waited and validated that the node IPs are added back ✅

@andreaswachs andreaswachs marked this pull request as draft February 3, 2026 09:00
@andreaswachs andreaswachs force-pushed the fix/prevent-empty-lb-backends-on-ipam-failure-cleaned2 branch from 543753c to 51afd0e Compare February 3, 2026 09:09
@andreaswachs andreaswachs marked this pull request as ready for review February 3, 2026 09:18
@nlm nlm requested review from Tomy2e and jtherin February 5, 2026 16:02
@Tomy2e
Copy link
Copy Markdown
Member

Tomy2e commented Feb 6, 2026

Hello, thanks for your contribution.

There was already a fix in #175 to handle internal IPs being missing during node initialization. It requires the node.cloudprovider.kubernetes.io/uninitialized taint to be present on the node. Do you start kubelet with the --cloud-provider=external flag?

Comment thread scaleway/loadbalancers.go Outdated
targetIPs = extractNodesInternalIps(nodes)

// Fallback: if no internal IPs and we're in private network mode, try IPAM directly
if len(targetIPs) == 0 && (l.pnID != "" || len(pnIDs) > 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extractNodesInternalIps() return all private ips, but it can contains ips from pns not from pnIDs.
so in the case of this annotation, I think it would be better to not use extractNodesInternalIps ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment - that makes sense.

I have updated the PR to make it work without the function

@andreaswachs
Copy link
Copy Markdown
Contributor Author

Hello, thanks for your contribution.

There was already a fix in #175 to handle internal IPs being missing during node initialization. It requires the node.cloudprovider.kubernetes.io/uninitialized taint to be present on the node. Do you start kubelet with the --cloud-provider=external flag?

Thank you!

Yes, we start the kubelet with said flag

@andreaswachs andreaswachs requested a review from jtherin February 9, 2026 14:26
Comment thread scaleway/loadbalancers.go Outdated
@Tomy2e
Copy link
Copy Markdown
Member

Tomy2e commented Feb 13, 2026

Is this PR still needed if we update this

klog.Warningf("error getting IPs for private NIC %s on node %s: %v", pNIC.ID, server.Name, err)
continue

to be:

return addresses, fmt.Errorf("unable to query ipam for node %s: %v", server.Name, err)

And also

klog.Warningf("error getting private network addresses for node %s: %v", server.Name, err)

to be:

return nil, fmt.Errorf("error getting private network addresses for node %s: %v", server.Name, err)

That would prevent Nodes to be initialized without InternalIPs when there is a transient IPAM failure.

If possible, I'd like to avoid putting additional pressure on the IPAM API.

@nfm-corti nfm-corti force-pushed the fix/prevent-empty-lb-backends-on-ipam-failure-cleaned2 branch 2 times, most recently from 7d1db9a to 73ea234 Compare February 16, 2026 22:54
Prevent nodes from being initialized without InternalIPs when IPAM
queries fail transiently. By returning errors instead of only logging
warnings, the node keeps its uninitialized taint and Kubernetes retries
until IPAM is available.

Additionally, use internal IPs when the service annotation
scw-loadbalancer-pn-ids is set (not just the global PN_ID env var),
and refuse to clear existing LB backends when no replacement IPs are
found.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@nfm-corti nfm-corti force-pushed the fix/prevent-empty-lb-backends-on-ipam-failure-cleaned2 branch from 73ea234 to 6eeac07 Compare February 16, 2026 22:56
@nfm-corti
Copy link
Copy Markdown
Contributor

@andreaswachs I pushed some changes based on the above comment. Please test this and let us know the results. Once verified, this should be good to merge.

@andreaswachs
Copy link
Copy Markdown
Contributor Author

@andreaswachs I pushed some changes based on the above comment. Please test this and let us know the results. Once verified, this should be good to merge.

I have just manually tested this to work:

  1. Removed IPs from the backend pool for a Scaleway LB that fronts a LB type Kubernetes SVC 💣
  2. Triggered a full reconsiliation by adding an annotation on said K8s SVC ⏯️
  3. Observed the CCM to restore all needed IPs on the SCW LB ✅

@Tomy2e Tomy2e merged commit 876732d into scaleway:master Feb 17, 2026
5 checks passed
@andreaswachs
Copy link
Copy Markdown
Contributor Author

Thank you so much @Tomy2e and @jtherin! ❤️

andreaswachs added a commit to kommodity-io/scaleway-cloud-controller-manager that referenced this pull request Feb 23, 2026
* feat(instance): avoid missing Node Refs for Private Network Instances without pnNIC

Signed-off-by: Steffen Karlsson <steffen.karlsson@gmail.com>

* fix(instances,lb): propagate IPAM errors and harden backend pool updates (scaleway#213)

Prevent nodes from being initialized without InternalIPs when IPAM
queries fail transiently. By returning errors instead of only logging
warnings, the node keeps its uninitialized taint and Kubernetes retries
until IPAM is available.

Additionally, use internal IPs when the service annotation
scw-loadbalancer-pn-ids is set (not just the global PN_ID env var),
and refuse to clear existing LB backends when no replacement IPs are
found.

Co-authored-by: Nicklas Frahm <nfm@corti.ai>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* feat: env to disable tags sync from ccm

* fix(examples): update container image for CCM

Signed-off-by: Nicklas Frahm <nfm@corti.ai>

* Update Scaleway CCM image version

Signed-off-by: Andreas Wachs <awa@corti.ai>

* Update Scaleway CCM image version to v0.36.0-kommodity.3

Signed-off-by: Andreas Wachs <awa@corti.ai>

* WIP

Signed-off-by: Andreas Wachs <awa@corti.ai>

* Update k8s-scaleway-ccm-latest.yml

Signed-off-by: Andreas Wachs <awa@corti.ai>

---------

Signed-off-by: Steffen Karlsson <steffen.karlsson@gmail.com>
Signed-off-by: Nicklas Frahm <nfm@corti.ai>
Signed-off-by: Andreas Wachs <awa@corti.ai>
Co-authored-by: Steffen Karlsson <steffen.karlsson@gmail.com>
Co-authored-by: Jérémy THERIN <jtherin@users.noreply.github.com>
Co-authored-by: Nicklas Frahm <nfm@corti.ai>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Jeremy THERIN <jtherin@scaleway.com>
andreaswachs added a commit to kommodity-io/scaleway-cloud-controller-manager that referenced this pull request Feb 23, 2026
* feat(instance): avoid missing Node Refs for Private Network Instances without pnNIC

Signed-off-by: Steffen Karlsson <steffen.karlsson@gmail.com>

* fix(instances,lb): propagate IPAM errors and harden backend pool updates (scaleway#213)

Prevent nodes from being initialized without InternalIPs when IPAM
queries fail transiently. By returning errors instead of only logging
warnings, the node keeps its uninitialized taint and Kubernetes retries
until IPAM is available.

Additionally, use internal IPs when the service annotation
scw-loadbalancer-pn-ids is set (not just the global PN_ID env var),
and refuse to clear existing LB backends when no replacement IPs are
found.

Co-authored-by: Nicklas Frahm <nfm@corti.ai>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* feat: env to disable tags sync from ccm

* fix(examples): update container image for CCM

Signed-off-by: Nicklas Frahm <nfm@corti.ai>

* Update Scaleway CCM image version

Signed-off-by: Andreas Wachs <awa@corti.ai>

* Update Scaleway CCM image version to v0.36.0-kommodity.3

Signed-off-by: Andreas Wachs <awa@corti.ai>

* WIP

Signed-off-by: Andreas Wachs <awa@corti.ai>

* Update k8s-scaleway-ccm-latest.yml

Signed-off-by: Andreas Wachs <awa@corti.ai>

---------

Signed-off-by: Steffen Karlsson <steffen.karlsson@gmail.com>
Signed-off-by: Nicklas Frahm <nfm@corti.ai>
Signed-off-by: Andreas Wachs <awa@corti.ai>
Co-authored-by: Steffen Karlsson <steffen.karlsson@gmail.com>
Co-authored-by: Jérémy THERIN <jtherin@users.noreply.github.com>
Co-authored-by: Nicklas Frahm <nfm@corti.ai>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Jeremy THERIN <jtherin@scaleway.com>
andreaswachs added a commit to kommodity-io/scaleway-cloud-controller-manager that referenced this pull request Feb 23, 2026
* feat(instance): avoid missing Node Refs for Private Network Instances without pnNIC

Signed-off-by: Steffen Karlsson <steffen.karlsson@gmail.com>

* fix(instances,lb): propagate IPAM errors and harden backend pool updates (scaleway#213)

Prevent nodes from being initialized without InternalIPs when IPAM
queries fail transiently. By returning errors instead of only logging
warnings, the node keeps its uninitialized taint and Kubernetes retries
until IPAM is available.

Additionally, use internal IPs when the service annotation
scw-loadbalancer-pn-ids is set (not just the global PN_ID env var),
and refuse to clear existing LB backends when no replacement IPs are
found.

Co-authored-by: Nicklas Frahm <nfm@corti.ai>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* feat: env to disable tags sync from ccm

* fix(examples): update container image for CCM

Signed-off-by: Nicklas Frahm <nfm@corti.ai>

* Update Scaleway CCM image version

Signed-off-by: Andreas Wachs <awa@corti.ai>

* Update Scaleway CCM image version to v0.36.0-kommodity.3

Signed-off-by: Andreas Wachs <awa@corti.ai>

* WIP

Signed-off-by: Andreas Wachs <awa@corti.ai>

* Update k8s-scaleway-ccm-latest.yml

Signed-off-by: Andreas Wachs <awa@corti.ai>

---------

Signed-off-by: Steffen Karlsson <steffen.karlsson@gmail.com>
Signed-off-by: Nicklas Frahm <nfm@corti.ai>
Signed-off-by: Andreas Wachs <awa@corti.ai>
Co-authored-by: Steffen Karlsson <steffen.karlsson@gmail.com>
Co-authored-by: Jérémy THERIN <jtherin@users.noreply.github.com>
Co-authored-by: Nicklas Frahm <nfm@corti.ai>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Jeremy THERIN <jtherin@scaleway.com>
andreaswachs added a commit to kommodity-io/scaleway-cloud-controller-manager that referenced this pull request Apr 15, 2026
* feat(instance): avoid missing Node Refs for Private Network Instances without pnNIC

Signed-off-by: Steffen Karlsson <steffen.karlsson@gmail.com>

* fix(instances,lb): propagate IPAM errors and harden backend pool updates (scaleway#213)

Prevent nodes from being initialized without InternalIPs when IPAM
queries fail transiently. By returning errors instead of only logging
warnings, the node keeps its uninitialized taint and Kubernetes retries
until IPAM is available.

Additionally, use internal IPs when the service annotation
scw-loadbalancer-pn-ids is set (not just the global PN_ID env var),
and refuse to clear existing LB backends when no replacement IPs are
found.

Co-authored-by: Nicklas Frahm <nfm@corti.ai>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* feat: env to disable tags sync from ccm

* fix(examples): update container image for CCM

Signed-off-by: Nicklas Frahm <nfm@corti.ai>

* Update Scaleway CCM image version

Signed-off-by: Andreas Wachs <awa@corti.ai>

* Update Scaleway CCM image version to v0.36.0-kommodity.3

Signed-off-by: Andreas Wachs <awa@corti.ai>

* WIP

Signed-off-by: Andreas Wachs <awa@corti.ai>

* Update k8s-scaleway-ccm-latest.yml

Signed-off-by: Andreas Wachs <awa@corti.ai>

---------

Signed-off-by: Steffen Karlsson <steffen.karlsson@gmail.com>
Signed-off-by: Nicklas Frahm <nfm@corti.ai>
Signed-off-by: Andreas Wachs <awa@corti.ai>
Co-authored-by: Steffen Karlsson <steffen.karlsson@gmail.com>
Co-authored-by: Jérémy THERIN <jtherin@users.noreply.github.com>
Co-authored-by: Nicklas Frahm <nfm@corti.ai>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Jeremy THERIN <jtherin@scaleway.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants