Agent Forge Operator bridges hosted-cluster autoscaling to VM capacity for HyperShift Agent platform clusters on vSphere.
HyperShift and CAPI remain the source of truth. Agent Forge watches the
AgentMachine objects rendered for a HyperShift NodePool; when they report
Ready=False with Reason=NoSuitableAgents, it creates vSphere VMs so matching
Assisted Installer Agent objects can appear.
- Watches a namespace-scoped
VsphereAgentPoolcustom resource. - Watches CAPI
AgentMachinedemand for one HyperShiftNodePool. - Creates capacity only for
AgentMachineobjects waiting onNoSuitableAgents. - Creates vSphere VMs from an
InfraEnvdiscovery ISO when more Agents are needed. - Optionally approves matching Assisted Installer
Agentobjects. - Deletes only owned VMs and stale unbound Agents when scale-down is allowed.
- Supports dry-run mode, status conditions, planned actions, and Kubernetes Events for operational visibility.
Agent Forge is designed for OpenShift environments that use:
- HyperShift hosted clusters on the Agent platform.
- Assisted Installer
InfraEnvandAgentresources. - CAPI
AgentMachineandMachineresources rendered for hosted cluster NodePools. - vSphere as the VM provider.
It does not replace the hosted cluster autoscaler and does not scale NodePools directly. It reacts to AgentMachine demand that already exists.
flowchart TD
autoscaler["Hosted cluster autoscaler"] --> nodepool["HyperShift NodePool"]
nodepool --> agentmachine["CAPI AgentMachines\nReady=False / NoSuitableAgents"]
nodepool --> machine["CAPI Machines"]
pool["VsphereAgentPool\nnamespace-scoped config"] --> reconcile["Agent Forge reconcile"]
agentmachine --> reconcile
machine --> reconcile
infraenv["InfraEnv\nISO URL and Agent ownership"] --> reconcile
agents["Assisted Installer Agents"] --> reconcile
reconcile --> discover["Resolve inputs\nAgentMachines, Machines, InfraEnv, matching Agents, owned/adopted VMs"]
discover --> refresh["Refresh ownedVMs status\nProvisioning, Available, Bound, Released"]
refresh --> plan["Build plan\nAgentMachine count, waiting demand, buffer, available Agents, provisioning VMs"]
plan -->|deficit| iso["Ensure content-addressed ISO cache\nDownload, hash, upload if changed"]
iso --> create["Create vSphere VM\nfolder, datastore cluster, network, CPU, memory, disk, ISO"]
create --> boot["VM boots InfraEnv ISO"]
boot --> newagent["New Agent appears"]
newagent --> agents
plan -->|matching unbound Agent needs prep| prep["Patch Agent\nlabels, role, approved, hostname"]
prep --> bind["Agent CAPI provider binds Agent to Machine"]
bind --> agents
plan -->|scale-down| deleting{"Was the CAPI Machine\nobserved deleting and then gone?"}
deleting -->|no| wait["Wait\nno VM or Agent deletion"]
deleting -->|yes| deletevm["Delete paired vSphere VM\nprefer BIOS UUID, fallback inventory path"]
deletevm --> deleteagent{"VM gone or already missing?"}
deleteagent -->|yes| removeagent["Delete paired stale Agent"]
deleteagent -->|no| retry["Record condition/event\nretry later"]
removeagent --> agents
plan -->|surplus available owned Agent| surplus{"Any waiting or unready\nAgentMachines?"}
surplus -->|yes| wait
surplus -->|no| deletefree["Delete extra unbound VM and Agent\nrespect bufferAgents"]
deletefree --> agents
plan -->|no changes| noop["Noop\nstatus and conditions only"]
wait --> reconcile
retry --> reconcile
noop --> reconcile
The reconciliation loop is driven by watches on VsphereAgentPool,
AgentMachine, Machine, and matching Agent objects. That means new
NoSuitableAgents demand, Machine deletion, and newly discovered Agents are
handled immediately instead of waiting only for the periodic requeue.
Scale-up is demand driven. The controller counts AgentMachine objects for the
NodePool that report Ready=False with Reason=NoSuitableAgents, adds
spec.scaling.bufferAgents, subtracts available matching Agents and
already-provisioning owned VMs, then creates at most
spec.scaling.maxProvisioning VMs per reconcile. Each VM boots the active
InfraEnv ISO. When the Agent appears, the controller applies the configured
labels, role, approval, and hostname so the Agent CAPI provider can bind it to a
Machine.
Existing clusters are adopted through Agents. If a matching Agent already
exists, the controller records it in status.ownedVMs using the Agent hostname,
MAC address, and VMware BIOS UUID from inventory. Bound Agents are marked
Bound; unbound prepared Agents are marked Available; Agents that were
previously bound and then returned by CAPI are marked Released.
Scale-down is deliberately conservative. The controller does not choose random
VMs. It first observes a paired CAPI Machine entering deletion, waits until
that Machine has disappeared, and only then deletes the paired VM and stale
Agent. VM deletion prefers the VMware BIOS UUID and falls back to the full
inventory path. A missing VM is treated as already deleted so the stale Agent can
still be removed.
Install the latest release manifests:
kubectl apply -f https://github.com/containeroo/agent-forge-operator/releases/download/v0.0.9/crds.yaml
kubectl apply -k github.com/containeroo/agent-forge-operator//config/default?ref=v0.0.9The published manager images are:
ghcr.io/containeroo/agent-forge-operator:v0.0.9
containeroo/agent-forge-operator:v0.0.9
For a local image build:
make docker-build docker-push IMG=<registry>/agent-forge-operator:<tag>
make deploy IMG=<registry>/agent-forge-operator:<tag>Start with docs/getting-started.md. It covers the
required cluster objects, vSphere Secret, a dry-run VsphereAgentPool, status
inspection, and switching from dry-run to active reconciliation.
For the complete CRD field contract and status model, see docs/vsphereagentpool-crd.md.
apiVersion: agentforge.containeroo.ch/v1alpha1
kind: VsphereAgentPool
metadata:
name: demo-worker
namespace: demo
spec:
dryRun: true
hostedClusterRef:
name: demo
nodePoolRef:
name: demo-worker
infraEnvRef:
name: demo
controlPlaneNamespace: demo-demo
vsphere:
credentialsSecretRef:
name: vsphere-credentials
datacenter: dc1
datastoreCluster: workload-datastore-cluster
isoDatastore: iso-datastore
resourcePool: cluster/Resources
folder: demo
network: VM Network
template:
namePrefix: demo-worker
numCPUs: 4
memoryMiB: 16384
diskGiB: 100
agent:
role: worker
approve: true
labels:
agentclusterinstalls.extensions.hive.openshift.io/location: lab-a
customer: example
hypershift.openshift.io/nodepool-role: worker
scaling:
bufferAgents: 0
maxProvisioning: 3
deletePolicy: OwnedOnly
iso:
checkInterval: 10m
retainVersions: 2
pathPrefix: agent-forge/demo/demo-workerThe controller caches the InfraEnv discovery ISO by content digest. It downloads
and hashes the ISO at spec.iso.checkInterval, uploads a new <sha256>.iso
object only when the bytes changed or the datastore object is missing, and
inserts the active status.iso.path into every new VM. To force an immediate
refresh, annotate the CR with
agentforge.containeroo.ch/force-iso-refresh=<unique-value>.
Requirements:
- Go 1.26 or newer.
kubectloroc.- Docker or another compatible container tool.
- Access to a Kubernetes or OpenShift cluster for deployment tests.
Common commands:
make test
make lint
make manifests
make crd
make deploy IMG=<registry>/agent-forge-operator:<tag>Run locally against the active kubeconfig:
make install
make runThe controller uses govc for vSphere operations. The container image includes
govc; local make run expects govc at /usr/local/bin/govc unless
GOVC_PATH is set.
Releases are built by GoReleaser from pushed tags:
git tag v0.0.9
git push origin v0.0.9The release workflow publishes multi-architecture images to GHCR and DockerHub
and attaches crds.yaml to the GitHub release.
Licensed under the Apache License, Version 2.0. See LICENSE.