What steps did you take and what happened?
During initial spoke cluster deployment via CAPI with a Metal3 infrastructure provider, the MachineDeployment controller created 4 MachineSets (all with identical template specs) for a MachineDeployment with replicas: 2. This resulted in 8 worker Machines being created instead of 2, wasting bare-metal host resources and triggering an unnecessary scale-down.
Timeline from CAPI controller logs:
07:57:53 MachineSet spoke-0-worker-mlwjz created, scaled to 2 replicas
07:58:03 ERROR: "Failed to wait for cache to be up-to-date: timed out: MachineSet spoke-0-worker-mlwjz not found" (10s timeout)
07:58:03 MachineSet spoke-0-worker-zrgkk created, scaled to 2 replicas
07:58:13 ERROR: "Failed to wait for cache to be up-to-date: timed out: MachineSet spoke-0-worker-zrgkk not found" (10s timeout)
07:58:14 MachineSet spoke-0-worker-5qz58 created, scaled to 2 replicas
07:58:24 ERROR: "Failed to wait for cache to be up-to-date: timed out: MachineSet spoke-0-worker-5qz58 not found" (10s timeout)
07:58:25 MachineSet spoke-0-worker-vxl8q created, scaled to 2 replicas
07:58:29 Cache finally populates — all 4 MachineSets scale up simultaneously (8 Machines total)
Root cause: After creating a MachineSet, the controller waits up to 10 seconds for the informer cache to reflect it (internal/util/client/client.go). When this times out, the reconcile errors out and re-enters. On re-entry, computeDesiredMachineSet cannot find the previously created MachineSet in the cache, so it computes a new "desired" MachineSet with a fresh machine-template-hash random suffix and creates it. This repeats until the cache eventually syncs.
The machine-template-hash label includes a random suffix (by design, since #8585), so each new MachineSet gets a unique label value even though the template spec is identical. This prevents the controller from recognizing the previously created MachineSet as matching the desired state.
Evidence that all 4 MachineSets have identical specs:
spoke-0-worker-mlwjz: hash=694949418-mlwjz, revision=2, replicas=2
spoke-0-worker-zrgkk: hash=694949418-zrgkk, revision=1, replicas=0
spoke-0-worker-5qz58: hash=694949418-5qz58, revision=1, replicas=0
spoke-0-worker-vxl8q: hash=694949418-vxl8q, revision=1, replicas=2
All share the same base hash (694949418) — the template specs are identical. Only the random suffixes differ.
What did you expect to happen?
A MachineDeployment with replicas: 2 should create exactly 1 MachineSet with 2 replicas during initial deployment, even if the informer cache is slow to sync.
Cluster API version
v1beta2 (shipped via MCE 2.17, based on CAPI v1.12.x)
Kubernetes version
OpenShift 4.21 (k8s 1.34)
Anything else you would like to add?
Suggested fixes
- Guard against duplicate creation: Before calling computeDesiredMachineSet, query the API server directly (not just the cache) for existing MachineSets owned by this MachineDeployment. If one already exists with generation == 1, skip creation.
- Increase cache wait timeout: The 10-second timeout in internal/util/client/client.go may be too short for clusters under load. Consider making it configurable or increasing the default.
- Use MachineDeployment generation as guard: Track which generation created the current MachineSet. If the MachineDeployment observedGeneration hasn't changed since the last MachineSet creation, don't create a new one.
What steps did you take and what happened?
During initial spoke cluster deployment via CAPI with a Metal3 infrastructure provider, the MachineDeployment controller created 4 MachineSets (all with identical template specs) for a MachineDeployment with replicas: 2. This resulted in 8 worker Machines being created instead of 2, wasting bare-metal host resources and triggering an unnecessary scale-down.
Timeline from CAPI controller logs:
07:57:53 MachineSet spoke-0-worker-mlwjz created, scaled to 2 replicas
07:58:03 ERROR: "Failed to wait for cache to be up-to-date: timed out: MachineSet spoke-0-worker-mlwjz not found" (10s timeout)
07:58:03 MachineSet spoke-0-worker-zrgkk created, scaled to 2 replicas
07:58:13 ERROR: "Failed to wait for cache to be up-to-date: timed out: MachineSet spoke-0-worker-zrgkk not found" (10s timeout)
07:58:14 MachineSet spoke-0-worker-5qz58 created, scaled to 2 replicas
07:58:24 ERROR: "Failed to wait for cache to be up-to-date: timed out: MachineSet spoke-0-worker-5qz58 not found" (10s timeout)
07:58:25 MachineSet spoke-0-worker-vxl8q created, scaled to 2 replicas
07:58:29 Cache finally populates — all 4 MachineSets scale up simultaneously (8 Machines total)
Root cause: After creating a MachineSet, the controller waits up to 10 seconds for the informer cache to reflect it (internal/util/client/client.go). When this times out, the reconcile errors out and re-enters. On re-entry, computeDesiredMachineSet cannot find the previously created MachineSet in the cache, so it computes a new "desired" MachineSet with a fresh machine-template-hash random suffix and creates it. This repeats until the cache eventually syncs.
The machine-template-hash label includes a random suffix (by design, since #8585), so each new MachineSet gets a unique label value even though the template spec is identical. This prevents the controller from recognizing the previously created MachineSet as matching the desired state.
Evidence that all 4 MachineSets have identical specs:
spoke-0-worker-mlwjz: hash=694949418-mlwjz, revision=2, replicas=2
spoke-0-worker-zrgkk: hash=694949418-zrgkk, revision=1, replicas=0
spoke-0-worker-5qz58: hash=694949418-5qz58, revision=1, replicas=0
spoke-0-worker-vxl8q: hash=694949418-vxl8q, revision=1, replicas=2
All share the same base hash (694949418) — the template specs are identical. Only the random suffixes differ.
What did you expect to happen?
A MachineDeployment with replicas: 2 should create exactly 1 MachineSet with 2 replicas during initial deployment, even if the informer cache is slow to sync.
Cluster API version
v1beta2 (shipped via MCE 2.17, based on CAPI v1.12.x)
Kubernetes version
OpenShift 4.21 (k8s 1.34)
Anything else you would like to add?
Suggested fixes