fix(proxmoxmachine): tolerate owner Machine NotFound to unblock stuck deletes by raul-cq · Pull Request #734 · ionos-cloud/cluster-api-provider-proxmox

raul-cq · 2026-04-26T16:24:56Z

What this fixes

The ProxmoxMachine reconciler currently deadlocks any ProxmoxMachine whose owner Machine has been removed before the InfraMachine was finalized. Reproduction: take any cluster, remove a Machine out-of-band (manual finalizer strip, cluster.x-k8s.io/force-delete-machine annotation, or any other actor) before its ProxmoxMachine finishes finalizing. The ProxmoxMachine then sits in Terminating forever and blocks the parent ProxmoxCluster and Cluster from finalizing.

Root cause

internal/controller/proxmoxmachine_controller.go#L97-L105:

machine, err := util.GetOwnerMachine(ctx, r.Client, proxmoxMachine.ObjectMeta)
if err != nil {
    return ctrl.Result{}, err
}

util.GetOwnerMachine returns nil, nil only when there is no Machine OwnerRef at all. When the OwnerRef is present but the Machine object is gone, it calls client.Get and returns apierrors.NotFound. The reconciler returns that error verbatim, controller-runtime backs off exponentially, and the delete branch (if !proxmoxMachine.ObjectMeta.DeletionTimestamp.IsZero()) is never reached. The MachineFinalizer never comes off.

The log signature observed in production:

"Reconciler error" err="Machine.cluster.x-k8s.io \"<name>\" not found"

repeating every backoff cycle for hours.

This is a known class of issue: docs/Troubleshooting.md under "Machine deletion deadlock" already documents the manual workaround (strip the proxmoxmachine finalizer by hand). This PR addresses the underlying controller behaviour so the manual workaround is no longer needed.

What this PR does

In Reconcile, when GetOwnerMachine returns IsNotFound:

If the ProxmoxMachine has a DeletionTimestamp: remove the MachineFinalizer and return cleanly. A log line tells the operator to verify the underlying Proxmox VM, since a full machineScope cannot be constructed without the owner Machine and vmservice.DeleteVM requires the scope. This is the conservative choice: unblock the controller-level deadlock without taking destructive action against the hypervisor under partial state.
If the ProxmoxMachine is not being deleted: log and return nil instead of error-looping. Either the OwnerRef will be corrected (which triggers another reconcile via the watch) or the ProxmoxMachine will be deleted (which triggers another reconcile via the delete event).

All other failure modes still propagate as before.

Tests

Adds an envtest in proxmoxmachine_controller_test.go covering the deletion-with-missing-owner case: the test creates a ProxmoxMachine with a Machine OwnerRef pointing to a non-existent Machine and the MachineFinalizer set, deletes it (the API server records a deletionTimestamp since the finalizer is held), runs Reconcile, and asserts the object is reaped (i.e. the finalizer was removed).

$ make test WHAT=./internal/controller/...
ok  github.com/ionos-cloud/cluster-api-provider-proxmox/internal/controller  30.534s  coverage: 50.6% of statements

Compatibility

No CRD changes.
No public API changes.
The IONOS Troubleshooting doc workaround still works as a runtime escape hatch for clusters that were stuck before this fix is deployed.

Why I'm sending this

We hit this in production (CAPMOX v0.8.1) when a workload cluster's apiserver became unreachable mid-delete and a Machine ended up reaped before its InfraMachine finalized — the exact deadlock above. Walking the diagnosis backward led here. Happy to iterate on the patch shape, the log wording, or the test scaffolding.

… deletes The ProxmoxMachine reconciler fetches the owner Machine via util.GetOwnerMachine and returns the error verbatim if the Machine is gone: machine, err := util.GetOwnerMachine(ctx, r.Client, proxmoxMachine.ObjectMeta) if err != nil { return ctrl.Result{}, err } GetOwnerMachine returns nil, nil only when no Machine OwnerRef exists at all. When the OwnerRef is present but the Machine object has been removed, it calls client.Get and returns the apierrors.NotFound. The reconciler returns that error, controller-runtime backs off exponentially, and the delete branch lower in Reconcile (gated by DeletionTimestamp.IsZero) is never reached — so the MachineFinalizer is never removed and the ProxmoxMachine sits in Terminating forever. The parent ProxmoxCluster's "waiting for machines to be deleted" loop never progresses, and the Cluster itself can't finalize. This deadlock is reachable any time the owner Machine is removed before the InfraMachine is finalized: a manual finalizer strip, the `cluster.x-k8s.io/force-delete-machine` annotation, or any other actor reaping the Machine out-of-band. The known IONOS troubleshooting workaround (docs/Troubleshooting.md "Machine deletion deadlock") is to manually strip the ProxmoxMachine finalizer — i.e. work around this exact branch. Fix: - If GetOwnerMachine returns IsNotFound and the ProxmoxMachine has a DeletionTimestamp, remove the MachineFinalizer and return cleanly so the API server can reap the object. The underlying Proxmox VM may still exist; the log line tells operators to verify, since we cannot construct a full machineScope to call vmservice.DeleteVM without the owner Machine. - If GetOwnerMachine returns IsNotFound and the ProxmoxMachine is not being deleted, log and return nil instead of error-looping. Either the OwnerRef will be corrected (re-reconcile via watch) or the ProxmoxMachine will be deleted (re-reconcile via the delete event). Includes an envtest covering the deletion-with-missing-owner case.

mcbenjemaa

I wonder how did you reproduce this?
as the InfraMachines will be deleted as long as the parent Machine is being deleted which is why it has an ownerReference.
in fact, the CAPI machine controller waits for the infraMachine to be deleted first, then cleanup the finalizer and deletes the Machine.

Do you have something specific that lets the parent machine gone, and the inframachine remain?

raul-cq · 2026-04-27T18:18:22Z

Hi there... this seems to be a corner case. When creating a cluster one worker vm creation got stuck in cloud-init, i just destroyed the vm in proxmox, and then proceeded to delete the cluster. This is how i hit this behaviour.

wikkyk · 2026-05-05T08:08:49Z

+		// The owner Machine OwnerRef is set but the Machine object itself is
+		// gone. This is the stuck-finalizer deadlock: without this branch the
+		// reconciler returns the NotFound error to controller-runtime, which
+		// requeues with exponential backoff forever, never reaching
+		// reconcileDelete and never removing our finalizer. The ProxmoxMachine
+		// then blocks the parent ProxmoxCluster (waiting for machines) and the
+		// Cluster from finalizing.
+		//
+		// We can hit this whenever the owner Machine is removed before the
+		// InfraMachine is finalized: a manual finalizer strip, the
+		// `cluster.x-k8s.io/force-delete-machine` annotation, or any other
+		// actor that out-of-band reaps the Machine.


This comment (that looks AI-generated btw, please attribute it) is excessive. Convert the comments in your PR to one-liners and instead elaborate in the commit message.

raul-cq requested review from 65278, mcbenjemaa, pborn-ionos and wikkyk as code owners April 26, 2026 16:24

raul-cq had a problem deploying to e2e April 26, 2026 16:25 — with GitHub Actions Error

mcbenjemaa reviewed Apr 27, 2026

View reviewed changes

wikkyk reviewed May 5, 2026

View reviewed changes

wikkyk added the deletion Deletion bugs label May 5, 2026

wikkyk requested a deployment to e2e May 5, 2026 09:28 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(proxmoxmachine): tolerate owner Machine NotFound to unblock stuck deletes#734

fix(proxmoxmachine): tolerate owner Machine NotFound to unblock stuck deletes#734
raul-cq wants to merge 1 commit intoionos-cloud:mainfrom
raul-cq:fix/proxmoxmachine-owner-machine-notfound-deadlock

raul-cq commented Apr 26, 2026

Uh oh!

mcbenjemaa left a comment

Uh oh!

raul-cq commented Apr 27, 2026

Uh oh!

wikkyk May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

raul-cq commented Apr 26, 2026

What this fixes

Root cause

What this PR does

Tests

Compatibility

Why I'm sending this

Uh oh!

mcbenjemaa left a comment

Choose a reason for hiding this comment

Uh oh!

raul-cq commented Apr 27, 2026

Uh oh!

wikkyk May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants