Skip to content

fix(proxmoxmachine): tolerate owner Machine NotFound to unblock stuck deletes#734

Open
raul-cq wants to merge 1 commit intoionos-cloud:mainfrom
raul-cq:fix/proxmoxmachine-owner-machine-notfound-deadlock
Open

fix(proxmoxmachine): tolerate owner Machine NotFound to unblock stuck deletes#734
raul-cq wants to merge 1 commit intoionos-cloud:mainfrom
raul-cq:fix/proxmoxmachine-owner-machine-notfound-deadlock

Conversation

@raul-cq
Copy link
Copy Markdown

@raul-cq raul-cq commented Apr 26, 2026

What this fixes

The ProxmoxMachine reconciler currently deadlocks any ProxmoxMachine whose owner Machine has been removed before the InfraMachine was finalized. Reproduction: take any cluster, remove a Machine out-of-band (manual finalizer strip, cluster.x-k8s.io/force-delete-machine annotation, or any other actor) before its ProxmoxMachine finishes finalizing. The ProxmoxMachine then sits in Terminating forever and blocks the parent ProxmoxCluster and Cluster from finalizing.

Root cause

internal/controller/proxmoxmachine_controller.go#L97-L105:

machine, err := util.GetOwnerMachine(ctx, r.Client, proxmoxMachine.ObjectMeta)
if err != nil {
    return ctrl.Result{}, err
}

util.GetOwnerMachine returns nil, nil only when there is no Machine OwnerRef at all. When the OwnerRef is present but the Machine object is gone, it calls client.Get and returns apierrors.NotFound. The reconciler returns that error verbatim, controller-runtime backs off exponentially, and the delete branch (if !proxmoxMachine.ObjectMeta.DeletionTimestamp.IsZero()) is never reached. The MachineFinalizer never comes off.

The log signature observed in production:

"Reconciler error" err="Machine.cluster.x-k8s.io \"<name>\" not found"

repeating every backoff cycle for hours.

This is a known class of issue: docs/Troubleshooting.md under "Machine deletion deadlock" already documents the manual workaround (strip the proxmoxmachine finalizer by hand). This PR addresses the underlying controller behaviour so the manual workaround is no longer needed.

What this PR does

In Reconcile, when GetOwnerMachine returns IsNotFound:

  • If the ProxmoxMachine has a DeletionTimestamp: remove the MachineFinalizer and return cleanly. A log line tells the operator to verify the underlying Proxmox VM, since a full machineScope cannot be constructed without the owner Machine and vmservice.DeleteVM requires the scope. This is the conservative choice: unblock the controller-level deadlock without taking destructive action against the hypervisor under partial state.
  • If the ProxmoxMachine is not being deleted: log and return nil instead of error-looping. Either the OwnerRef will be corrected (which triggers another reconcile via the watch) or the ProxmoxMachine will be deleted (which triggers another reconcile via the delete event).

All other failure modes still propagate as before.

Tests

Adds an envtest in proxmoxmachine_controller_test.go covering the deletion-with-missing-owner case: the test creates a ProxmoxMachine with a Machine OwnerRef pointing to a non-existent Machine and the MachineFinalizer set, deletes it (the API server records a deletionTimestamp since the finalizer is held), runs Reconcile, and asserts the object is reaped (i.e. the finalizer was removed).

$ make test WHAT=./internal/controller/...
ok  github.com/ionos-cloud/cluster-api-provider-proxmox/internal/controller  30.534s  coverage: 50.6% of statements

Compatibility

  • No CRD changes.
  • No public API changes.
  • The IONOS Troubleshooting doc workaround still works as a runtime escape hatch for clusters that were stuck before this fix is deployed.

Why I'm sending this

We hit this in production (CAPMOX v0.8.1) when a workload cluster's apiserver became unreachable mid-delete and a Machine ended up reaped before its InfraMachine finalized — the exact deadlock above. Walking the diagnosis backward led here. Happy to iterate on the patch shape, the log wording, or the test scaffolding.

… deletes

The ProxmoxMachine reconciler fetches the owner Machine via
util.GetOwnerMachine and returns the error verbatim if the Machine is gone:

    machine, err := util.GetOwnerMachine(ctx, r.Client, proxmoxMachine.ObjectMeta)
    if err != nil {
        return ctrl.Result{}, err
    }

GetOwnerMachine returns nil, nil only when no Machine OwnerRef exists at all.
When the OwnerRef is present but the Machine object has been removed, it
calls client.Get and returns the apierrors.NotFound. The reconciler returns
that error, controller-runtime backs off exponentially, and the delete
branch lower in Reconcile (gated by DeletionTimestamp.IsZero) is never
reached — so the MachineFinalizer is never removed and the ProxmoxMachine
sits in Terminating forever. The parent ProxmoxCluster's
"waiting for machines to be deleted" loop never progresses, and the Cluster
itself can't finalize.

This deadlock is reachable any time the owner Machine is removed before
the InfraMachine is finalized: a manual finalizer strip, the
`cluster.x-k8s.io/force-delete-machine` annotation, or any other actor
reaping the Machine out-of-band. The known IONOS troubleshooting workaround
(docs/Troubleshooting.md "Machine deletion deadlock") is to manually strip
the ProxmoxMachine finalizer — i.e. work around this exact branch.

Fix:
- If GetOwnerMachine returns IsNotFound and the ProxmoxMachine has a
  DeletionTimestamp, remove the MachineFinalizer and return cleanly so the
  API server can reap the object. The underlying Proxmox VM may still
  exist; the log line tells operators to verify, since we cannot construct
  a full machineScope to call vmservice.DeleteVM without the owner Machine.
- If GetOwnerMachine returns IsNotFound and the ProxmoxMachine is not being
  deleted, log and return nil instead of error-looping. Either the OwnerRef
  will be corrected (re-reconcile via watch) or the ProxmoxMachine will be
  deleted (re-reconcile via the delete event).

Includes an envtest covering the deletion-with-missing-owner case.
Copy link
Copy Markdown
Member

@mcbenjemaa mcbenjemaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how did you reproduce this?
as the InfraMachines will be deleted as long as the parent Machine is being deleted which is why it has an ownerReference.
in fact, the CAPI machine controller waits for the infraMachine to be deleted first, then cleanup the finalizer and deletes the Machine.

Do you have something specific that lets the parent machine gone, and the inframachine remain?

@raul-cq
Copy link
Copy Markdown
Author

raul-cq commented Apr 27, 2026

Hi there... this seems to be a corner case. When creating a cluster one worker vm creation got stuck in cloud-init, i just destroyed the vm in proxmox, and then proceeded to delete the cluster. This is how i hit this behaviour.

Comment on lines +100 to +111
// The owner Machine OwnerRef is set but the Machine object itself is
// gone. This is the stuck-finalizer deadlock: without this branch the
// reconciler returns the NotFound error to controller-runtime, which
// requeues with exponential backoff forever, never reaching
// reconcileDelete and never removing our finalizer. The ProxmoxMachine
// then blocks the parent ProxmoxCluster (waiting for machines) and the
// Cluster from finalizing.
//
// We can hit this whenever the owner Machine is removed before the
// InfraMachine is finalized: a manual finalizer strip, the
// `cluster.x-k8s.io/force-delete-machine` annotation, or any other
// actor that out-of-band reaps the Machine.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment (that looks AI-generated btw, please attribute it) is excessive. Convert the comments in your PR to one-liners and instead elaborate in the commit message.

@wikkyk wikkyk added the deletion Deletion bugs label May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deletion Deletion bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants