@@ -19,39 +19,170 @@ From a high-level, for each scenario,
1919
2020To write an E2E scenario,
2121
22- - choose a testing cluster. There are a few defined
23- in [ cluster.go] ( https://github.com/Azure/AgentBaker/blob/dev/e2e/cluster.go ) , e.g,
24- - ClusterKubenetAirgap
25- - ClusterAzureNetwork
22+ - choose a testing cluster. There are several defined
23+ in [ cache.go] ( cache.go ) , e.g,
2624 - ClusterKubenet
25+ - ClusterAzureNetwork
26+ - ClusterAzureOverlayNetwork
27+ - ClusterAzureOverlayNetworkDualStack
28+ - ClusterCiliumNetwork
29+ - ClusterLatestKubernetesVersion
30+ - ClusterAzureBootstrapProfileCache (private ACR)
31+ - ClusterAzureNetworkIsolated (no internet access)
2732- use ` NodeBootstrappingConfiugration ` (` nbc ` ) to setup your scenario. it is used to invoke the primary
2833 node-bootstrapping
2934 API [ GetLatestNodeBootstrapping] ( https://github.com/Azure/AgentBaker/blob/2e730b5a498c5be9b082d912fd08ac9346582db9/pkg/agent/bakerapi.go#L14 ) .
3035 to modify agentpool properties, usually you need to set both` nbc.containerService.properties.AgentPoolProfiles[0].xxx `
3136 as well as ` nbc.agentPoolProfile ` . It is because when RP invokes AgentBaker, it will set the properties in this way
3237 and in e2e we follow the pattern.
3338- use ` VMConfigMutator ` to set VMSS properties such as SKU when needed.
34- Check [ vmss] ( https://github.com/Azure/AgentBaker/blob/dev/e2e/ vmss.go) for other configs.
39+ Check [ vmss] ( vmss.go ) for other configs.
3540 it is necessary to set ` nbc.agentPoolProfile.VMSize ` to match the VMSS SKU if you choose to change.
3641- use ` Validator ` to include your own verification of the VM's live state, such as file existsnce, sysctl settings, etc.
3742
43+ ## Infrastructure Architecture
44+
45+ All E2E clusters share a single VNet and Azure Bastion in the ` abe2e-{location} ` resource group. This
46+ avoids creating a per-cluster Bastion (~ 10 min each) and ensures all clusters are reachable from a
47+ single SSH entry point.
48+
49+ ``` mermaid
50+ graph TB
51+ subgraph RG["abe2e-{location} Resource Group"]
52+ subgraph VNET["abe2e-shared-vnet (10.0.0.0/8)"]
53+ BASTION_SUBNET["AzureBastionSubnet<br/>10.0.0.0/26"]
54+ FW_SUBNET["AzureFirewallSubnet<br/>10.0.1.0/24"]
55+ PE_SUBNET["abe2e-pe-subnet<br/>10.0.2.0/24<br/>(shared private endpoints)"]
56+ KUBENET_SUBNET["aks-subnet-abe2e-kubenet-v5<br/>10.x.x.0/20"]
57+ AZNET_SUBNET["aks-subnet-abe2e-azure-network-v4<br/>10.x.x.0/20"]
58+ MORE_SUBNETS["... more cluster subnets"]
59+ end
60+ BASTION["abe2e-shared-bastion<br/>(Standard SKU, Tunneling)"]
61+ FIREWALL["abe2e-fw<br/>(Azure Firewall)"]
62+ IDENTITY["abe2e-cluster-identity<br/>(User-Assigned MSI)"]
63+ PE_ACR["PE-for-abe2eprivate{location}<br/>PE-for-abe2eprivatenonanon{location}<br/>(shared ACR private endpoints)"]
64+ DNS_ZONE["privatelink.azurecr.io<br/>(Private DNS Zone)"]
65+ ACR_ANON["abe2eprivate{location}<br/>(Private ACR)"]
66+ ACR_NONANON["abe2eprivatenonanon{location}<br/>(Non-anonymous Private ACR)"]
67+ end
68+
69+ subgraph MC_KUBENET["MC_abe2e-kubenet-v5 Resource Group"]
70+ VMSS_K["VMSS (system pool)"]
71+ VMSS_K_TEST["VMSS (test VMs)"]
72+ RT_K["Route Table<br/>(pod routes + firewall)"]
73+ end
74+
75+ subgraph MC_NI["MC_abe2e-azure-networkisolated-v2 Resource Group"]
76+ VMSS_NI["VMSS (system pool)"]
77+ NSG_NI["NSG<br/>(blocks internet)"]
78+ end
79+
80+ BASTION --> BASTION_SUBNET
81+ FIREWALL --> FW_SUBNET
82+ PE_ACR --> PE_SUBNET
83+ DNS_ZONE -.->|VNet link| VNET
84+ VMSS_K --> KUBENET_SUBNET
85+ RT_K -.->|associated| KUBENET_SUBNET
86+ VMSS_NI --> AZNET_SUBNET
87+ NSG_NI -.->|associated| AZNET_SUBNET
88+
89+ DEV["Developer / CI"]
90+ DEV -->|SSH via tunnel| BASTION
91+ BASTION -->|"connects to any VM<br/>in shared VNet"| VMSS_K_TEST
92+ ```
93+
94+ ### Shared Infrastructure Setup
95+
96+ The shared infrastructure is created ** automatically** on first test run via cached idempotent
97+ functions — no separate setup script is needed.
98+
99+ | Resource | Name | Details |
100+ | ----------| ------| ---------|
101+ | VNet | ` abe2e-shared-vnet ` | ` 10.0.0.0/8 ` — supports ~ 4096 ` /20 ` cluster subnets |
102+ | Bastion | ` abe2e-shared-bastion ` | Standard SKU with tunneling enabled for native SSH |
103+ | Bastion Subnet | ` AzureBastionSubnet ` | ` 10.0.0.0/26 ` (required by Azure Bastion) |
104+ | Firewall Subnet | ` AzureFirewallSubnet ` | ` 10.0.1.0/24 ` |
105+ | PE Subnet | ` abe2e-pe-subnet ` | ` 10.0.2.0/24 ` — hosts shared private endpoints for ACRs |
106+ | Identity | ` abe2e-cluster-identity ` | User-assigned MSI with Network Contributor on the VNet |
107+ | Private DNS Zone | ` privatelink.azurecr.io ` | Shared zone in ` abe2e-{location} ` RG, linked to the VNet |
108+
109+ Each AKS cluster gets its own ` /20 ` subnet (4091 usable IPs) in the shared VNet. The subnet is
110+ named ` aks-subnet-{clusterName} ` . CIDRs are auto-allocated from a hash of the cluster name to
111+ avoid collisions.
112+
113+ ### Cluster Types
114+
115+ All clusters use BYOV (Bring Your Own VNet) with the shared VNet. They differ in networking
116+ plugin, isolation level, and whether private ACR is needed.
117+
118+ | Cluster | Network Plugin | Special Features | Private ACR |
119+ | ---------| ---------------| -----------------| :-----------:|
120+ | ` abe2e-kubenet-v5 ` | Kubenet | Basic pod routing via route table | ❌ |
121+ | ` abe2e-azure-network-v4 ` | Azure CNI | Pods get IPs from subnet (MaxPods=30) | ❌ |
122+ | ` abe2e-azure-overlay-network-v4 ` | Azure CNI Overlay | Pods in virtual overlay, not subnet | ❌ |
123+ | ` abe2e-azure-overlay-dualstack-v4 ` | Azure CNI Overlay | IPv4+IPv6 dual-stack | ❌ |
124+ | ` abe2e-cilium-network-v4 ` | Azure CNI + Cilium | eBPF dataplane, replaces kube-proxy | ❌ |
125+ | ` abe2e-latest-kubernetes-version-v2 ` | Kubenet | Auto-discovers latest GA K8s version | ❌ |
126+ | ` abe2e-azure-bootstrapprofile-cache-v2 ` | Azure CNI | Bootstrap artifact caching from private ACR | ✅ |
127+ | ` abe2e-azure-networkisolated-v2 ` | Azure CNI | NSG blocks all internet except allowlist | ✅ |
128+
129+ ** Network-isolated cluster** adds an NSG to its subnet that blocks all outbound traffic except
130+ ` management.azure.com ` , the cluster FQDN, and ` packages.aks.azure.com ` . Private endpoints for
131+ the ACRs are in the shared PE subnet, with DNS records in the shared ` privatelink.azurecr.io ` zone.
132+
133+ ### How It Works
134+
135+ 1 . ** ` CachedEnsureSharedInfra ` ** — runs once per location per test run. Creates/verifies the shared
136+ VNet, Bastion, Firewall, PE subnet, and user-assigned identity.
137+ 2 . ** ` configureSharedVNet ` ** — tags the cluster model for BYOV. After the cluster name is hashed,
138+ ** ` CachedEnsureClusterSubnet ` ** creates the cluster's dedicated ` /20 ` subnet.
139+ 3 . ** ` prepareCluster ` ** — creates/gets the AKS cluster, then runs a DAG of parallel tasks:
140+ - Bastion lookup (shared)
141+ - Firewall route table (non-isolated clusters)
142+ - NSG association (network-isolated cluster)
143+ - Private DNS zone + VNet link (if ACR needed, runs once before ACR tasks)
144+ - Private ACR + PE creation (bootstrapprofile-cache and network-isolated)
145+ - VMSS garbage collection
146+ - Debug daemonsets
147+ 4 . SSH to test VMs goes through the shared Bastion, which can reach any VM in the VNet.
148+
149+ ### Test Flow
150+
38151``` mermaid
39152sequenceDiagram
40- E2E->>+ARM: Get or Create AKS Cluster
41- ARM-->>-E2E: Cluster details
42- E2E->>+AgentBakerCode: Fetch VM Configuration (include CSE)
43- AgentBakerCode-->>-E2E: VM Configuration
44- E2E->>+ARM: Create VM using fetched VM Config in cluster network
45- ARM-->>-E2E: VM instance
46- E2E->>+Bastion: Create SSH Tunnel
47- Bastion->>+VM: Forward SSH Connection
48- E2E->>VM: Healthcheck via SSH Tunnel
49- VM-->>E2E: Healthcheck OK
50- E2E->>+KubeAPI: Verify Node Ready
51- KubeAPI-->>-E2E: Node Ready
52- E2E->>VM: Execute test validators via SSH Tunnel
53- VM-->>-E2E: Test results
54- Bastion-->>-E2E: Close SSH Tunnel
153+ participant CI as Developer / CI
154+ participant Infra as Shared Infra (cached)
155+ participant ARM as Azure Resource Manager
156+ participant AB as AgentBaker API
157+ participant Bastion as Shared Bastion
158+ participant VM as Test VM
159+ participant K8s as Kube API Server
160+
161+ CI->>Infra: Ensure shared VNet + Bastion
162+ Infra-->>CI: Ready (cached after first run)
163+
164+ CI->>Infra: Ensure cluster subnet
165+ Infra-->>CI: Subnet ID
166+
167+ CI->>ARM: Create/Get AKS cluster (BYOV subnet)
168+ ARM-->>CI: Cluster details
169+
170+ CI->>AB: Generate CSE + CustomData
171+ AB-->>CI: VM configuration
172+
173+ CI->>ARM: Create VMSS in cluster subnet
174+ ARM-->>CI: VM instance
175+
176+ CI->>Bastion: SSH tunnel to VM private IP
177+ Bastion->>VM: Forward SSH connection
178+
179+ CI->>VM: Run health checks + validators
180+ VM-->>CI: Results
181+
182+ CI->>K8s: Verify node ready
183+ K8s-->>CI: Node ready ✓
184+
185+ Bastion-->>CI: Close tunnel
55186```
56187
57188## Running Locally
0 commit comments