/kind bug
What happened:
OCCM identifies an existing Octavia load balancer for a Service by name on
the first reconcile (getLoadbalancerByName in pkg/openstack/loadbalancer.go).
The name is built from kube_service_<cluster-name>_<namespace>_<service>,
where <cluster-name> defaults to kubernetes. When two Kubernetes clusters
share the same OpenStack project and use the same --cluster-name (which is
the default for many distributions: kubeadm, RKE2, etc.), Services with
identical namespace and name produce identical load balancer names.
Octavia does not require load balancer names to be unique inside a project,
so OCCM in cluster B happily picks up cluster A's load balancer, sets the
loadbalancer.openstack.org/load-balancer-id annotation on its own Service,
and starts driving cluster A's load balancer (rewriting listeners,
members, FIP, etc.). Cluster A then loses its load balancer.
This is the same root cause discussed in the closed issues #2241, #2571 and
#2624. The accepted upstream guidance is "use a unique --cluster-name",
which is correct but does not defend against the failure mode at all -
two operators independently bootstrapping clusters in the same tenant will
keep hitting it.
What you expected to happen:
OCCM should never adopt a load balancer that is owned by a different
Kubernetes cluster, even when names collide. A unique --cluster-name
should be a recommendation, not the only safety mechanism.
How to reproduce it (as minimally and precisely as possible):
- Create two Kubernetes clusters (cluster A and cluster B) in the same
OpenStack project. Both run OCCM with the default
--cluster-name=kubernetes (or any matching value).
- On cluster A:
kubectl create deployment web --image nginx --port 80 && kubectl expose deployment web --type LoadBalancer --port 6666 --target-port 80.
An Octavia LB named kube_service_kubernetes_default_web is created
in OpenStack.
- On cluster B: same commands, exposing the service on port 8888.
- Observe that no second load balancer is created. Instead OCCM in
cluster B locates cluster A's LB by name, annotates its own Service
with the same load-balancer-id, and rewrites the LB to point at
cluster B's nodes on port 8888. Cluster A's Service is now broken.
Anything else we need to know?:
I'd like to propose adding a stable cluster identifier (the UID of the
kube-system namespace) as a load balancer tag of the form
kube_cluster_id_<uid>. The lookup would treat a load balancer with a
foreign kube_cluster_id_* tag as not-found instead of adopting it, and
fall back to the legacy behaviour for load balancers that don't carry any
kube_cluster_id_* tag (existing deployments and externally-created LBs).
This is a strictly additive, backward-compatible change that defends
against the failure mode without forcing operators to coordinate
--cluster-name values. I have a working implementation and will open a
PR shortly that links this issue.
Environment:
- openstack-cloud-controller-manager version: master (reproduced against
v1.33.0 as well)
- OpenStack version: any with Octavia tags support (>= API v2.5 / Stein)
- Others: N/A
/kind bug
What happened:
OCCM identifies an existing Octavia load balancer for a Service by name on
the first reconcile (
getLoadbalancerByNameinpkg/openstack/loadbalancer.go).The name is built from
kube_service_<cluster-name>_<namespace>_<service>,where
<cluster-name>defaults tokubernetes. When two Kubernetes clustersshare the same OpenStack project and use the same
--cluster-name(which isthe default for many distributions: kubeadm, RKE2, etc.), Services with
identical namespace and name produce identical load balancer names.
Octavia does not require load balancer names to be unique inside a project,
so OCCM in cluster B happily picks up cluster A's load balancer, sets the
loadbalancer.openstack.org/load-balancer-idannotation on its own Service,and starts driving cluster A's load balancer (rewriting listeners,
members, FIP, etc.). Cluster A then loses its load balancer.
This is the same root cause discussed in the closed issues #2241, #2571 and
#2624. The accepted upstream guidance is "use a unique
--cluster-name",which is correct but does not defend against the failure mode at all -
two operators independently bootstrapping clusters in the same tenant will
keep hitting it.
What you expected to happen:
OCCM should never adopt a load balancer that is owned by a different
Kubernetes cluster, even when names collide. A unique
--cluster-nameshould be a recommendation, not the only safety mechanism.
How to reproduce it (as minimally and precisely as possible):
OpenStack project. Both run OCCM with the default
--cluster-name=kubernetes(or any matching value).kubectl create deployment web --image nginx --port 80 && kubectl expose deployment web --type LoadBalancer --port 6666 --target-port 80.An Octavia LB named
kube_service_kubernetes_default_webis createdin OpenStack.
cluster B locates cluster A's LB by name, annotates its own Service
with the same
load-balancer-id, and rewrites the LB to point atcluster B's nodes on port 8888. Cluster A's Service is now broken.
Anything else we need to know?:
I'd like to propose adding a stable cluster identifier (the UID of the
kube-systemnamespace) as a load balancer tag of the formkube_cluster_id_<uid>. The lookup would treat a load balancer with aforeign
kube_cluster_id_*tag as not-found instead of adopting it, andfall back to the legacy behaviour for load balancers that don't carry any
kube_cluster_id_*tag (existing deployments and externally-created LBs).This is a strictly additive, backward-compatible change that defends
against the failure mode without forcing operators to coordinate
--cluster-namevalues. I have a working implementation and will open aPR shortly that links this issue.
Environment:
v1.33.0 as well)