Skip to content

Latest commit

 

History

History
246 lines (178 loc) · 8.72 KB

File metadata and controls

246 lines (178 loc) · 8.72 KB

AKS Flex Node

Overview

AKS Flex Node extends Azure Kubernetes Service (AKS) to customer-managed virtual machines and bare metal hosts, enabling them to run as AKS worker nodes outside standard AKS node pools. It is built on top of Azure Unbounded, which provides the host-side foundation for running and reconciling isolated Kubernetes node environments.

Status: AKS Flex Node is currently alpha software.

Key Features And Scenarios

  • Bootstrap and join virtual machines or bare metal hosts for both amd64 and arm64 as AKS worker nodes.
  • Support hybrid, lab, and specialized hardware scenarios.
  • Use flexible authentication modes, including Azure Arc, managed identity (MSI), and Kubernetes bootstrap token.
  • Automatically detect NVIDIA GPU devices and configure the container runtime for accelerated workloads.
  • Run blue-green in-place updates and upgrades while retaining the existing host.
  • Manage your Flex Node fleet through AKS management APIs for upgrade, repair, reset, and related lifecycle operations.
  • Remediate and repair agent and node state through first-class lifecycle operations.

Getting Started

Before you begin, create or choose an existing AKS cluster and a virtual machine or bare metal host to join as a Flex Node. This example assumes a Linux workstation with Azure CLI, kubectl, curl, and python3. The target host must run systemd, allow root installation, and reach the AKS API server over outbound HTTPS. Use a VM size with enough CPU and memory for nspawn startup and Kubernetes components; the validated quickstart used a 4-vCPU Azure VM.

The flow below will:

  1. Apply the node bootstrap RBAC bindings on the AKS cluster.
  2. Create a Kubernetes bootstrap token while generating the Flex Node config from AKS cluster metadata.
  3. Install aks-flex-node on the target host as root.
  4. Copy the generated config to /etc/aks-flex-node/config.json.
  5. Start the host bootstrap flow and launch the aks-flex-node-agent systemd service.

Expected result: the target host appears in kubectl get nodes, and aks-flex-node-agent is running on the host.

On your workstation, save the config helper script, setup node RBAC permissions, then generate the bootstrap-token config from AKS cluster metadata:

RESOURCE_GROUP="<resource-group>"
CLUSTER_NAME="<cluster-name>"
SUBSCRIPTION_ID="<subscription-id>"

curl -fsSLo ./aks-flex-config https://raw.githubusercontent.com/Azure/AKSFlexNode/main/scripts/aks-flex-config
chmod +x ./aks-flex-config

./aks-flex-config setup-node-rbac \
  --resource-group "$RESOURCE_GROUP" \
  --cluster-name "$CLUSTER_NAME" \
  --subscription "$SUBSCRIPTION_ID"

./aks-flex-config generate-node-config \
  --resource-group "$RESOURCE_GROUP" \
  --cluster-name "$CLUSTER_NAME" \
  --subscription "$SUBSCRIPTION_ID" \
  --bootstrap-token \
  --output ./aks-flex-node-config.json

generate-node-config supports one of the following auth modes: --bootstrap-token, --identity, --service-principal --username <client-id> --password <client-secret>, or --arc.

Example Config With Field Notes

The rendered config should look like this. Comments are shown here only to explain the fields; do not add comments to /etc/aks-flex-node/config.json.

{
  "azure": {
    "subscriptionId": "<subscription-id>", // Azure subscription that owns the AKS cluster.
    "tenantId": "<tenant-id>", // Microsoft Entra tenant for the subscription.
    "cloud": "AzurePublicCloud", // Azure cloud environment.
    "bootstrapToken": {
      "token": "<token-id>.<token-secret>" // Kubernetes bootstrap token created by generate-node-config.
    },
    "arc": { "enabled": false }, // Arc is disabled for this bootstrap-token flow.
    "targetCluster": {
      "resourceId": "<aks-resource-id>", // Full ARM resource ID of the AKS cluster.
      "location": "<aks-location>" // Azure region of the AKS cluster.
    }
  },
  "node": {
    "kubelet": {
      "serverURL": "https://<aks-api-server>", // AKS API server endpoint.
      "caCertData": "<base64-ca-data>" // Cluster CA bundle from kubeconfig.
    }
  },
  "agent": {
    "logLevel": "info", // Agent log verbosity.
    "logDir": "/var/log/aks-flex-node" // Host log directory.
  },
  "kubernetes": { "version": "<aks-kubernetes-version>" } // Kubelet version to install.
}

Copy the generated config to the target host:

TARGET_HOST="<user>@<host>"

scp ./aks-flex-node-config.json "$TARGET_HOST:/tmp/aks-flex-node-config.json"

On the target host, install the agent and move the generated config into place:

sudo su
# Optional: set AKS_FLEX_NODE_VERSION=<release-tag> to install a specific release.
curl -fsSL https://raw.githubusercontent.com/Azure/AKSFlexNode/main/scripts/install.sh | bash
aks-flex-node version

umask 077
mkdir -p /etc/aks-flex-node
cp /tmp/aks-flex-node-config.json /etc/aks-flex-node/config.json
chmod 600 /etc/aks-flex-node/config.json

cat /etc/aks-flex-node/config.json

After reviewing the config, bootstrap the node. This installs the long-running agent service and starts the local Kubernetes worker environment.

aks-flex-node start --config /etc/aks-flex-node/config.json

Verify the node from your workstation:

kubectl get nodes -o wide

Example output:

NAME                   STATUS   ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
aks-flex-config-test   Ready    <none>   12s   v1.34.3   10.0.0.4      <none>        Ubuntu 24.04.4 LTS   6.17.0-1013-azure   containerd://2.0.4

The node name should match the target host's hostname unless you set agent.nodeName in the config.

On the target host, the agent service should be active:

systemctl is-active aks-flex-node-agent
journalctl -u aks-flex-node-agent -f

Example output:

active

Example logs:

Started aks-flex-node-agent.service - AKS Flex Node Agent.
aks-flex-node[3800]: level=INFO msg="running agent daemon" nodeName=aks-flex-config-test
aks-flex-node[3800]: level=INFO msg="machine state reconciled" status=healthy

AKS Flex Node runs the Kubernetes worker inside a local nspawn machine. You can inspect it from the host:

machinectl list
machinectl status kube1
journalctl -M kube1 -u kubelet -f
journalctl -M kube1 -u containerd -f

Try scheduling a test workload onto the node and watch kubelet/containerd logs to see how the nspawn-backed worker handles it.

Reset And Uninstall

To remove AKS Flex Node from the host, run the uninstall script as root:

curl -fsSL https://raw.githubusercontent.com/Azure/AKSFlexNode/main/scripts/uninstall.sh | bash -s -- --force

Example summary:

SUCCESS: Reset completed
SUCCESS: Removed directory: /var/lib/aks-flex-node
SUCCESS: Removed binary: /usr/local/bin/aks-flex-node
SUCCESS: Azure CLI removed successfully
SUCCESS: AKS Flex Node uninstallation completed!

Example reset details:

level=INFO msg="systemd service uninstalled" unit=aks-flex-node-agent.service
level=INFO msg="removing machine rootfs" machine=kube1 dir=/var/lib/machines/kube1
level=INFO msg="removed runtime directory" path=/etc/aks-flex-node
level=INFO msg="removed runtime directory" path=/var/log/aks-flex-node

After uninstall, the host should no longer have the agent service or nspawn machines:

systemctl is-active aks-flex-node-agent
machinectl list

Example output:

inactive
No machines.

Finally, remove the Kubernetes Node object from your workstation:

kubectl delete node <node-name>

Usage Guides And Topics

  • Usage Guide - Installation, configuration, authentication modes, operations, and troubleshooting.
  • GPU Flex Node setup - GPU host image and driver contract, cluster GPU stack, validation, and troubleshooting.
  • Design Documentation - Architecture, lifecycle, Azure integration, and security model.

Development And Security

License

This project is licensed under the MIT License. See LICENSE for details.


🚀 Built with ❤️ for the Kubernetes community

Made with Go Kubernetes Azure