fix(proxmoxmachine): tolerate owner Machine NotFound to unblock stuck deletes#734
Conversation
… deletes
The ProxmoxMachine reconciler fetches the owner Machine via
util.GetOwnerMachine and returns the error verbatim if the Machine is gone:
machine, err := util.GetOwnerMachine(ctx, r.Client, proxmoxMachine.ObjectMeta)
if err != nil {
return ctrl.Result{}, err
}
GetOwnerMachine returns nil, nil only when no Machine OwnerRef exists at all.
When the OwnerRef is present but the Machine object has been removed, it
calls client.Get and returns the apierrors.NotFound. The reconciler returns
that error, controller-runtime backs off exponentially, and the delete
branch lower in Reconcile (gated by DeletionTimestamp.IsZero) is never
reached — so the MachineFinalizer is never removed and the ProxmoxMachine
sits in Terminating forever. The parent ProxmoxCluster's
"waiting for machines to be deleted" loop never progresses, and the Cluster
itself can't finalize.
This deadlock is reachable any time the owner Machine is removed before
the InfraMachine is finalized: a manual finalizer strip, the
`cluster.x-k8s.io/force-delete-machine` annotation, or any other actor
reaping the Machine out-of-band. The known IONOS troubleshooting workaround
(docs/Troubleshooting.md "Machine deletion deadlock") is to manually strip
the ProxmoxMachine finalizer — i.e. work around this exact branch.
Fix:
- If GetOwnerMachine returns IsNotFound and the ProxmoxMachine has a
DeletionTimestamp, remove the MachineFinalizer and return cleanly so the
API server can reap the object. The underlying Proxmox VM may still
exist; the log line tells operators to verify, since we cannot construct
a full machineScope to call vmservice.DeleteVM without the owner Machine.
- If GetOwnerMachine returns IsNotFound and the ProxmoxMachine is not being
deleted, log and return nil instead of error-looping. Either the OwnerRef
will be corrected (re-reconcile via watch) or the ProxmoxMachine will be
deleted (re-reconcile via the delete event).
Includes an envtest covering the deletion-with-missing-owner case.
mcbenjemaa
left a comment
There was a problem hiding this comment.
I wonder how did you reproduce this?
as the InfraMachines will be deleted as long as the parent Machine is being deleted which is why it has an ownerReference.
in fact, the CAPI machine controller waits for the infraMachine to be deleted first, then cleanup the finalizer and deletes the Machine.
Do you have something specific that lets the parent machine gone, and the inframachine remain?
|
Hi there... this seems to be a corner case. When creating a cluster one worker vm creation got stuck in cloud-init, i just destroyed the vm in proxmox, and then proceeded to delete the cluster. This is how i hit this behaviour. |
| // The owner Machine OwnerRef is set but the Machine object itself is | ||
| // gone. This is the stuck-finalizer deadlock: without this branch the | ||
| // reconciler returns the NotFound error to controller-runtime, which | ||
| // requeues with exponential backoff forever, never reaching | ||
| // reconcileDelete and never removing our finalizer. The ProxmoxMachine | ||
| // then blocks the parent ProxmoxCluster (waiting for machines) and the | ||
| // Cluster from finalizing. | ||
| // | ||
| // We can hit this whenever the owner Machine is removed before the | ||
| // InfraMachine is finalized: a manual finalizer strip, the | ||
| // `cluster.x-k8s.io/force-delete-machine` annotation, or any other | ||
| // actor that out-of-band reaps the Machine. |
There was a problem hiding this comment.
This comment (that looks AI-generated btw, please attribute it) is excessive. Convert the comments in your PR to one-liners and instead elaborate in the commit message.
What this fixes
The
ProxmoxMachinereconciler currently deadlocks anyProxmoxMachinewhose ownerMachinehas been removed before the InfraMachine was finalized. Reproduction: take any cluster, remove aMachineout-of-band (manual finalizer strip,cluster.x-k8s.io/force-delete-machineannotation, or any other actor) before itsProxmoxMachinefinishes finalizing. TheProxmoxMachinethen sits inTerminatingforever and blocks the parentProxmoxClusterandClusterfrom finalizing.Root cause
internal/controller/proxmoxmachine_controller.go#L97-L105:util.GetOwnerMachinereturnsnil, nilonly when there is no Machine OwnerRef at all. When the OwnerRef is present but the Machine object is gone, it callsclient.Getand returnsapierrors.NotFound. The reconciler returns that error verbatim, controller-runtime backs off exponentially, and the delete branch (if !proxmoxMachine.ObjectMeta.DeletionTimestamp.IsZero()) is never reached. TheMachineFinalizernever comes off.The log signature observed in production:
repeating every backoff cycle for hours.
This is a known class of issue:
docs/Troubleshooting.mdunder "Machine deletion deadlock" already documents the manual workaround (strip theproxmoxmachinefinalizer by hand). This PR addresses the underlying controller behaviour so the manual workaround is no longer needed.What this PR does
In
Reconcile, whenGetOwnerMachinereturnsIsNotFound:ProxmoxMachinehas aDeletionTimestamp: remove theMachineFinalizerand return cleanly. A log line tells the operator to verify the underlying Proxmox VM, since a fullmachineScopecannot be constructed without the owner Machine andvmservice.DeleteVMrequires the scope. This is the conservative choice: unblock the controller-level deadlock without taking destructive action against the hypervisor under partial state.ProxmoxMachineis not being deleted: log and returnnilinstead of error-looping. Either the OwnerRef will be corrected (which triggers another reconcile via the watch) or theProxmoxMachinewill be deleted (which triggers another reconcile via the delete event).All other failure modes still propagate as before.
Tests
Adds an envtest in
proxmoxmachine_controller_test.gocovering the deletion-with-missing-owner case: the test creates aProxmoxMachinewith a Machine OwnerRef pointing to a non-existent Machine and theMachineFinalizerset, deletes it (the API server records adeletionTimestampsince the finalizer is held), runsReconcile, and asserts the object is reaped (i.e. the finalizer was removed).Compatibility
Why I'm sending this
We hit this in production (CAPMOX v0.8.1) when a workload cluster's apiserver became unreachable mid-delete and a Machine ended up reaped before its InfraMachine finalized — the exact deadlock above. Walking the diagnosis backward led here. Happy to iterate on the patch shape, the log wording, or the test scaffolding.