Skip to content

Commit 8315a13

Browse files
committed
fix: use one big vnet and attach AKS clusters to it to avoid creating bastion multiple times
1 parent f927481 commit 8315a13

10 files changed

Lines changed: 1167 additions & 508 deletions

File tree

.pipelines/.vsts-vhd-builder-pr-windows.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ pr:
3434
- parts/windows
3535
- go.mod
3636
- go.sum
37-
- e2e/
37+
- e2e/scenario_win_test.go
3838
- staging/cse/windows/
3939

4040
exclude:

e2e/README.md

Lines changed: 151 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -19,39 +19,170 @@ From a high-level, for each scenario,
1919

2020
To write an E2E scenario,
2121

22-
- choose a testing cluster. There are a few defined
23-
in [cluster.go](https://github.com/Azure/AgentBaker/blob/dev/e2e/cluster.go), e.g,
24-
- ClusterKubenetAirgap
25-
- ClusterAzureNetwork
22+
- choose a testing cluster. There are several defined
23+
in [cache.go](cache.go), e.g,
2624
- ClusterKubenet
25+
- ClusterAzureNetwork
26+
- ClusterAzureOverlayNetwork
27+
- ClusterAzureOverlayNetworkDualStack
28+
- ClusterCiliumNetwork
29+
- ClusterLatestKubernetesVersion
30+
- ClusterAzureBootstrapProfileCache (private ACR)
31+
- ClusterAzureNetworkIsolated (no internet access)
2732
- use `NodeBootstrappingConfiugration` (`nbc`) to setup your scenario. it is used to invoke the primary
2833
node-bootstrapping
2934
API [GetLatestNodeBootstrapping](https://github.com/Azure/AgentBaker/blob/2e730b5a498c5be9b082d912fd08ac9346582db9/pkg/agent/bakerapi.go#L14).
3035
to modify agentpool properties, usually you need to set both`nbc.containerService.properties.AgentPoolProfiles[0].xxx`
3136
as well as `nbc.agentPoolProfile`. It is because when RP invokes AgentBaker, it will set the properties in this way
3237
and in e2e we follow the pattern.
3338
- use `VMConfigMutator` to set VMSS properties such as SKU when needed.
34-
Check [vmss](https://github.com/Azure/AgentBaker/blob/dev/e2e/vmss.go) for other configs.
39+
Check [vmss](vmss.go) for other configs.
3540
it is necessary to set `nbc.agentPoolProfile.VMSize` to match the VMSS SKU if you choose to change.
3641
- use `Validator` to include your own verification of the VM's live state, such as file existsnce, sysctl settings, etc.
3742

43+
## Infrastructure Architecture
44+
45+
All E2E clusters share a single VNet and Azure Bastion in the `abe2e-{location}` resource group. This
46+
avoids creating a per-cluster Bastion (~10 min each) and ensures all clusters are reachable from a
47+
single SSH entry point.
48+
49+
```mermaid
50+
graph TB
51+
subgraph RG["abe2e-{location} Resource Group"]
52+
subgraph VNET["abe2e-shared-vnet (10.0.0.0/8)"]
53+
BASTION_SUBNET["AzureBastionSubnet<br/>10.0.0.0/26"]
54+
FW_SUBNET["AzureFirewallSubnet<br/>10.0.1.0/24"]
55+
PE_SUBNET["abe2e-pe-subnet<br/>10.0.2.0/24<br/>(shared private endpoints)"]
56+
KUBENET_SUBNET["aks-subnet-abe2e-kubenet-v5<br/>10.x.x.0/20"]
57+
AZNET_SUBNET["aks-subnet-abe2e-azure-network-v4<br/>10.x.x.0/20"]
58+
MORE_SUBNETS["... more cluster subnets"]
59+
end
60+
BASTION["abe2e-shared-bastion<br/>(Standard SKU, Tunneling)"]
61+
FIREWALL["abe2e-fw<br/>(Azure Firewall)"]
62+
IDENTITY["abe2e-cluster-identity<br/>(User-Assigned MSI)"]
63+
PE_ACR["PE-for-abe2eprivate{location}<br/>PE-for-abe2eprivatenonanon{location}<br/>(shared ACR private endpoints)"]
64+
DNS_ZONE["privatelink.azurecr.io<br/>(Private DNS Zone)"]
65+
ACR_ANON["abe2eprivate{location}<br/>(Private ACR)"]
66+
ACR_NONANON["abe2eprivatenonanon{location}<br/>(Non-anonymous Private ACR)"]
67+
end
68+
69+
subgraph MC_KUBENET["MC_abe2e-kubenet-v5 Resource Group"]
70+
VMSS_K["VMSS (system pool)"]
71+
VMSS_K_TEST["VMSS (test VMs)"]
72+
RT_K["Route Table<br/>(pod routes + firewall)"]
73+
end
74+
75+
subgraph MC_NI["MC_abe2e-azure-networkisolated-v2 Resource Group"]
76+
VMSS_NI["VMSS (system pool)"]
77+
NSG_NI["NSG<br/>(blocks internet)"]
78+
end
79+
80+
BASTION --> BASTION_SUBNET
81+
FIREWALL --> FW_SUBNET
82+
PE_ACR --> PE_SUBNET
83+
DNS_ZONE -.->|VNet link| VNET
84+
VMSS_K --> KUBENET_SUBNET
85+
RT_K -.->|associated| KUBENET_SUBNET
86+
VMSS_NI --> AZNET_SUBNET
87+
NSG_NI -.->|associated| AZNET_SUBNET
88+
89+
DEV["Developer / CI"]
90+
DEV -->|SSH via tunnel| BASTION
91+
BASTION -->|"connects to any VM<br/>in shared VNet"| VMSS_K_TEST
92+
```
93+
94+
### Shared Infrastructure Setup
95+
96+
The shared infrastructure is created **automatically** on first test run via cached idempotent
97+
functions — no separate setup script is needed.
98+
99+
| Resource | Name | Details |
100+
|----------|------|---------|
101+
| VNet | `abe2e-shared-vnet` | `10.0.0.0/8` — supports ~4096 `/20` cluster subnets |
102+
| Bastion | `abe2e-shared-bastion` | Standard SKU with tunneling enabled for native SSH |
103+
| Bastion Subnet | `AzureBastionSubnet` | `10.0.0.0/26` (required by Azure Bastion) |
104+
| Firewall Subnet | `AzureFirewallSubnet` | `10.0.1.0/24` |
105+
| PE Subnet | `abe2e-pe-subnet` | `10.0.2.0/24` — hosts shared private endpoints for ACRs |
106+
| Identity | `abe2e-cluster-identity` | User-assigned MSI with Network Contributor on the VNet |
107+
| Private DNS Zone | `privatelink.azurecr.io` | Shared zone in `abe2e-{location}` RG, linked to the VNet |
108+
109+
Each AKS cluster gets its own `/20` subnet (4091 usable IPs) in the shared VNet. The subnet is
110+
named `aks-subnet-{clusterName}`. CIDRs are auto-allocated from a hash of the cluster name to
111+
avoid collisions.
112+
113+
### Cluster Types
114+
115+
All clusters use BYOV (Bring Your Own VNet) with the shared VNet. They differ in networking
116+
plugin, isolation level, and whether private ACR is needed.
117+
118+
| Cluster | Network Plugin | Special Features | Private ACR |
119+
|---------|---------------|-----------------|:-----------:|
120+
| `abe2e-kubenet-v5` | Kubenet | Basic pod routing via route table ||
121+
| `abe2e-azure-network-v4` | Azure CNI | Pods get IPs from subnet (MaxPods=30) ||
122+
| `abe2e-azure-overlay-network-v4` | Azure CNI Overlay | Pods in virtual overlay, not subnet ||
123+
| `abe2e-azure-overlay-dualstack-v4` | Azure CNI Overlay | IPv4+IPv6 dual-stack ||
124+
| `abe2e-cilium-network-v4` | Azure CNI + Cilium | eBPF dataplane, replaces kube-proxy ||
125+
| `abe2e-latest-kubernetes-version-v2` | Kubenet | Auto-discovers latest GA K8s version ||
126+
| `abe2e-azure-bootstrapprofile-cache-v2` | Azure CNI | Bootstrap artifact caching from private ACR ||
127+
| `abe2e-azure-networkisolated-v2` | Azure CNI | NSG blocks all internet except allowlist ||
128+
129+
**Network-isolated cluster** adds an NSG to its subnet that blocks all outbound traffic except
130+
`management.azure.com`, the cluster FQDN, and `packages.aks.azure.com`. Private endpoints for
131+
the ACRs are in the shared PE subnet, with DNS records in the shared `privatelink.azurecr.io` zone.
132+
133+
### How It Works
134+
135+
1. **`CachedEnsureSharedInfra`** — runs once per location per test run. Creates/verifies the shared
136+
VNet, Bastion, Firewall, PE subnet, and user-assigned identity.
137+
2. **`configureSharedVNet`** — tags the cluster model for BYOV. After the cluster name is hashed,
138+
**`CachedEnsureClusterSubnet`** creates the cluster's dedicated `/20` subnet.
139+
3. **`prepareCluster`** — creates/gets the AKS cluster, then runs a DAG of parallel tasks:
140+
- Bastion lookup (shared)
141+
- Firewall route table (non-isolated clusters)
142+
- NSG association (network-isolated cluster)
143+
- Private DNS zone + VNet link (if ACR needed, runs once before ACR tasks)
144+
- Private ACR + PE creation (bootstrapprofile-cache and network-isolated)
145+
- VMSS garbage collection
146+
- Debug daemonsets
147+
4. SSH to test VMs goes through the shared Bastion, which can reach any VM in the VNet.
148+
149+
### Test Flow
150+
38151
```mermaid
39152
sequenceDiagram
40-
E2E->>+ARM: Get or Create AKS Cluster
41-
ARM-->>-E2E: Cluster details
42-
E2E->>+AgentBakerCode: Fetch VM Configuration (include CSE)
43-
AgentBakerCode-->>-E2E: VM Configuration
44-
E2E->>+ARM: Create VM using fetched VM Config in cluster network
45-
ARM-->>-E2E: VM instance
46-
E2E->>+Bastion: Create SSH Tunnel
47-
Bastion->>+VM: Forward SSH Connection
48-
E2E->>VM: Healthcheck via SSH Tunnel
49-
VM-->>E2E: Healthcheck OK
50-
E2E->>+KubeAPI: Verify Node Ready
51-
KubeAPI-->>-E2E: Node Ready
52-
E2E->>VM: Execute test validators via SSH Tunnel
53-
VM-->>-E2E: Test results
54-
Bastion-->>-E2E: Close SSH Tunnel
153+
participant CI as Developer / CI
154+
participant Infra as Shared Infra (cached)
155+
participant ARM as Azure Resource Manager
156+
participant AB as AgentBaker API
157+
participant Bastion as Shared Bastion
158+
participant VM as Test VM
159+
participant K8s as Kube API Server
160+
161+
CI->>Infra: Ensure shared VNet + Bastion
162+
Infra-->>CI: Ready (cached after first run)
163+
164+
CI->>Infra: Ensure cluster subnet
165+
Infra-->>CI: Subnet ID
166+
167+
CI->>ARM: Create/Get AKS cluster (BYOV subnet)
168+
ARM-->>CI: Cluster details
169+
170+
CI->>AB: Generate CSE + CustomData
171+
AB-->>CI: VM configuration
172+
173+
CI->>ARM: Create VMSS in cluster subnet
174+
ARM-->>CI: VM instance
175+
176+
CI->>Bastion: SSH tunnel to VM private IP
177+
Bastion->>VM: Forward SSH connection
178+
179+
CI->>VM: Run health checks + validators
180+
VM-->>CI: Results
181+
182+
CI->>K8s: Verify node ready
183+
K8s-->>CI: Node ready ✓
184+
185+
Bastion-->>CI: Close tunnel
55186
```
56187

57188
## Running Locally

0 commit comments

Comments
 (0)