Skip to content

ServerCreateFailedIrrecoverableErrorReason --> Remediation is halted. #1969

@guettli

Description

@guettli

Currently (v1.1.x), when ServerCreateFailedIrrecoverableErrorReason is set, the hcloud remediation will just stop reconciling:

	// Skip remediation for machines that failed to create with irrecoverable errors (e.g. invalid_input, resource_unavailable).
	// These errors cannot be fixed by rebooting or replacing the machine.
	// We return without error so the MHC does not keep retrying remediation.
	if conditions.IsFalse(hcloudMachine, infrav1.ServerCreateSucceededCondition) &&
		conditions.GetReason(hcloudMachine, infrav1.ServerCreateSucceededCondition) == infrav1.ServerCreateFailedIrrecoverableErrorReason {
		log.Info("Skipping remediation for machine with irrecoverable creation failure",
			"reason", conditions.GetMessage(hcloudMachine, infrav1.ServerCreateSucceededCondition),
		)

		// signal remediation done.
		return reconcile.Result{}, nil
	}

This means, the Remediation resource does not even have a status (Conditions):

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HCloudRemediation
metadata:
  annotations:
    cluster.x-k8s.io/cloned-from-groupkind: HCloudRemediationTemplate.infrastructure.cluster.x-k8s.io
    cluster.x-k8s.io/cloned-from-name: hetzner-apalla-1-35-v0-sha.ow82ztk-remediation-request
  creationTimestamp: "2026-04-15T15:17:37Z"
  generation: 1
  labels:
    cluster.x-k8s.io/cluster-name: tcs-guettli-tm9-1-35-v0-sha-ow82ztk
  name: tcs-guettli-tm9-1-35-v0-sha-ow82ztk-md-arm-r6f99-zccjj-6vn8l
  namespace: org-testing
  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1beta1
    kind: Machine
    name: tcs-guettli-tm9-1-35-v0-sha-ow82ztk-md-arm-r6f99-zccjj-6vn8l
    uid: a5f9ea94-1ca6-4eb3-ab48-2a84df0d217f
  resourceVersion: "2803968"
  uid: 0671de0c-2cc3-44e7-ab25-e2ca8857595e
spec:
  strategy:
    retryLimit: 1
    timeout: 3m0s
    type: Reboot

This is intentional, because we don't want an endless loop if hcloud machine uses invalid server-type location tuples.

In the current case cax31 was not available for some time, but now they should be available again.

We need to communicate that better.

Desired solution:

Create a Condition on the hcloudmachine with an appropriate error message, and create a Condition on the hcloudremediation.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions