Skip to content

feat: implement CAPI failure domain contract for zones#720

Open
emanuelebosetti wants to merge 2 commits intoionos-cloud:mainfrom
ElmecOSS:feat/failure-domains
Open

feat: implement CAPI failure domain contract for zones#720
emanuelebosetti wants to merge 2 commits intoionos-cloud:mainfrom
ElmecOSS:feat/failure-domains

Conversation

@emanuelebosetti
Copy link
Copy Markdown

Issue: #717
Description of changes:

Map existing CAPMOX zones to CAPI failure domains so that KubeadmControlPlane automatically distributes control plane nodes across Proxmox zones.

API changes (additive, backwards-compatible):

  • Add Nodes []string to ZoneConfigSpec for per-zone node lists
  • Add ControlPlane *bool to ZoneConfigSpec (defaults to true)
  • Add FailureDomains to ProxmoxClusterStatus (CAPI v1beta2 contract)
  • Add spec.failureDomain to ProxmoxMachineSpec (InfraMachine contract)
  • Add GetZoneNodes() helper on ProxmoxCluster
  • Add FailureDomainNotReady condition reason

Controller changes:

  • Add reconcileFailureDomains in cluster controller (sorted by name to prevent reconcile loops)
  • Read Machine.Spec.FailureDomain in machine controller with retryable FailureDomainNotReady condition
  • Use effectiveZone/effectiveAllowedNodes on MachineScope to avoid spec mutation from the controller
  • Pass zone override to GetInClusterPools for IPAM pool selection
  • Set Zone on NodeLocation in createVM
  • Update scheduler to use GetEffectiveAllowedNodes()

Conversion:

  • Restore Status.FailureDomains and Spec.FailureDomain in v1alpha1 conversion webhook
  • Add manual conversion stub for ProxmoxMachineSpec.FailureDomain

Testing performed:

  • Unit tests for API types, scheduler, VM service helpers, and machine scope
  • Validated on a real Proxmox cluster (10 nodes, 2 zones) with 3 CP + 2 worker nodes
  • KCP correctly distributes CP replicas round-robin across failure domains
  • Workers pinned to target zone, IPAM pools selected per-zone
  • make lint test verify passed
  • Full backwards compatibility verified — clusters without zoneConfig work as before

@wikkyk
Copy link
Copy Markdown
Collaborator

wikkyk commented Apr 13, 2026

Thanks!
If you could please provide attribution if you used AI/coding agent(s) to develop your PR.

Map existing CAPMOX zones to CAPI failure domains so that
KubeadmControlPlane automatically distributes control plane
nodes across Proxmox zones.

API changes:
- Add Nodes []string to ZoneConfigSpec for per-zone node lists
- Add ControlPlane *bool to ZoneConfigSpec (defaults to true)
- Add FailureDomains to ProxmoxClusterStatus (CAPI v1beta2 contract)
- Add spec.failureDomain to ProxmoxMachineSpec (InfraMachine contract)
- Add GetZoneNodes() helper on ProxmoxCluster
- Add FailureDomainNotReady condition reason

Controller changes:
- Add reconcileFailureDomains in cluster controller (sorted by name)
- Read Machine.Spec.FailureDomain in machine controller with
  retryable FailureDomainNotReady condition
- Use effectiveZone/effectiveAllowedNodes on MachineScope to avoid
  spec mutation from the controller
- Pass zone override to GetInClusterPools for IPAM pool selection
- Set Zone on NodeLocation in createVM
- Update scheduler to use GetEffectiveAllowedNodes()

Conversion:
- Restore Status.FailureDomains and Spec.FailureDomain in v1alpha1
  conversion webhook
- Add manual conversion stub for ProxmoxMachineSpec.FailureDomain

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@emanuelebosetti
Copy link
Copy Markdown
Author

I've amended the commit to include the Co-Authored-By trailer for the AI coding agent used during development.
Let me know if there are any additional attribution policies or guidelines I should follow for AI-assisted contributions to this project.

@wikkyk
Copy link
Copy Markdown
Collaborator

wikkyk commented Apr 15, 2026

Thanks. No, we don't have any other policies. The field moves fast and we just didn't get to writing things down yet.

@emanuelebosetti
Copy link
Copy Markdown
Author

Thanks. No, we don't have any other policies. The field moves fast and we just didn't get to writing things down yet.

We're going through the same process internally — introducing AI across dev teams and figuring out contribution policies.

Three references we found particularly useful:

Could be useful input for #716 as well.

@wikkyk wikkyk mentioned this pull request Apr 16, 2026
@emanuelebosetti
Copy link
Copy Markdown
Author

Hi @wikkyk , do you think it would be possible to include these failure domain changes in the upcoming v0.9.0 release? Happy to address any remaining feedback if needed. Thanks!

@wikkyk
Copy link
Copy Markdown
Collaborator

wikkyk commented Apr 29, 2026

Sure. v0.9.0 is still a fair bit far off, though. This only adds a field so it could reasonably be released in the 0.8 series.

However, as the linter points out, the type of the FailureDomain field has the wrong type. It's optional, so it should be a pointer. You'll need to re-check your pointer handling.

Address review feedback (ionos-cloud#720): the field is optional with no
validation, so kube-api-linter requires a pointer. Switch
ProxmoxMachineSpec.FailureDomain to *string in v1alpha2 and regen
deepcopy.

Move the v1alpha2-only restore of FailureDomain into
restoreProxmoxMachineSpec so both ProxmoxMachine and
ProxmoxMachineTemplate ConvertTo paths preserve the field across the
hub→spoke→hub roundtrip. The pointer change shifted the symptom from
"" to nil and exposed the pre-existing missing restore on the Template
path (TestFuzzyConversion/for_ProxmoxMachineTemplate/hub-spoke-hub).

Validated end-to-end on a 10-node Proxmox test cluster: KCP rolled
three fresh control plane Machines and both MachineDeployments rolled
their workers through the failureDomain path
(Machine.Spec.FailureDomain → ApplyFailureDomainNodes → IPAM zone
scope → createVM) without errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@emanuelebosetti
Copy link
Copy Markdown
Author

@wikkyk pushed the pointer fix as a separate commit. Could you re-trigger the workflows? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants