Skip to content

[occm] Cross-cluster load balancer name collision when multiple Kubernetes clusters share an OpenStack project #3102

@enginrect

Description

@enginrect

/kind bug

What happened:

OCCM identifies an existing Octavia load balancer for a Service by name on
the first reconcile (getLoadbalancerByName in pkg/openstack/loadbalancer.go).
The name is built from kube_service_<cluster-name>_<namespace>_<service>,
where <cluster-name> defaults to kubernetes. When two Kubernetes clusters
share the same OpenStack project and use the same --cluster-name (which is
the default for many distributions: kubeadm, RKE2, etc.), Services with
identical namespace and name produce identical load balancer names.

Octavia does not require load balancer names to be unique inside a project,
so OCCM in cluster B happily picks up cluster A's load balancer, sets the
loadbalancer.openstack.org/load-balancer-id annotation on its own Service,
and starts driving cluster A's load balancer (rewriting listeners,
members, FIP, etc.). Cluster A then loses its load balancer.

This is the same root cause discussed in the closed issues #2241, #2571 and
#2624. The accepted upstream guidance is "use a unique --cluster-name",
which is correct but does not defend against the failure mode at all -
two operators independently bootstrapping clusters in the same tenant will
keep hitting it.

What you expected to happen:

OCCM should never adopt a load balancer that is owned by a different
Kubernetes cluster, even when names collide. A unique --cluster-name
should be a recommendation, not the only safety mechanism.

How to reproduce it (as minimally and precisely as possible):

  1. Create two Kubernetes clusters (cluster A and cluster B) in the same
    OpenStack project. Both run OCCM with the default
    --cluster-name=kubernetes (or any matching value).
  2. On cluster A: kubectl create deployment web --image nginx --port 80 && kubectl expose deployment web --type LoadBalancer --port 6666 --target-port 80.
    An Octavia LB named kube_service_kubernetes_default_web is created
    in OpenStack.
  3. On cluster B: same commands, exposing the service on port 8888.
  4. Observe that no second load balancer is created. Instead OCCM in
    cluster B locates cluster A's LB by name, annotates its own Service
    with the same load-balancer-id, and rewrites the LB to point at
    cluster B's nodes on port 8888. Cluster A's Service is now broken.

Anything else we need to know?:

I'd like to propose adding a stable cluster identifier (the UID of the
kube-system namespace) as a load balancer tag of the form
kube_cluster_id_<uid>. The lookup would treat a load balancer with a
foreign kube_cluster_id_* tag as not-found instead of adopting it, and
fall back to the legacy behaviour for load balancers that don't carry any
kube_cluster_id_* tag (existing deployments and externally-created LBs).

This is a strictly additive, backward-compatible change that defends
against the failure mode without forcing operators to coordinate
--cluster-name values. I have a working implementation and will open a
PR shortly that links this issue.

Environment:

  • openstack-cloud-controller-manager version: master (reproduced against
    v1.33.0 as well)
  • OpenStack version: any with Octavia tags support (>= API v2.5 / Stein)
  • Others: N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions