Skip to content

Commit 29b12dd

Browse files
committed
feat: Add workflows for deploying and managing private AKS clusters with managed identity
- Created `deploy-private-aks.yml` for deploying a private AKS cluster, logging IPs, and teardown. - Introduced `cleanup-safety-net.yml` to periodically delete stale resource groups. - Updated `README.md` with detailed documentation on the PoC, architecture, and usage instructions. - Added `deploy-private-aks.sh` script for standalone AKS deployment using managed identity. - Implemented `log-ips.sh` for verifying managed identity traffic and logging IPs from various sources. - Created `setup-runner-vm.sh` for provisioning a self-hosted GitHub Actions runner VM with managed identity. - Added `teardown-runner-vm.sh` for deleting the runner VM and associated resources.
1 parent 0dc2556 commit 29b12dd

7 files changed

Lines changed: 1108 additions & 2 deletions

File tree

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
name: Cleanup Safety Net
2+
3+
on:
4+
schedule:
5+
- cron: '0 * * * *'
6+
workflow_dispatch:
7+
8+
jobs:
9+
cleanup:
10+
runs-on: self-hosted
11+
12+
steps:
13+
# ── 1. Azure Login ─────────────────────────────────────────
14+
- name: Azure Login (Managed Identity)
15+
uses: azure/login@v2
16+
with:
17+
auth-type: IDENTITY
18+
client-id: ${{ secrets.AZURE_CLIENT_ID }}
19+
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
20+
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
21+
22+
# ── 2. Find and delete stale PoC resource groups ───────────
23+
- name: Delete stale PoC resource groups
24+
run: |
25+
echo "Checking for stale aks-poc resource groups..."
26+
STALE_COUNT=0
27+
28+
az group list --tag purpose=aks-poc --query "[].{name:name, created:tags.created}" -o tsv | while IFS=$'\t' read -r name created; do
29+
if [ -z "$name" ]; then
30+
continue
31+
fi
32+
33+
if [ -z "$created" ]; then
34+
echo "WARNING: Resource group '$name' has no 'created' tag timestamp — skipping"
35+
continue
36+
fi
37+
38+
created_epoch=$(date -d "$created" +%s 2>/dev/null || echo 0)
39+
if [ "$created_epoch" -eq 0 ]; then
40+
echo "WARNING: Could not parse timestamp '$created' for resource group '$name' — skipping"
41+
continue
42+
fi
43+
44+
now_epoch=$(date +%s)
45+
age_minutes=$(( (now_epoch - created_epoch) / 60 ))
46+
47+
if [ "$age_minutes" -gt 45 ]; then
48+
echo "Deleting stale resource group: $name (age: ${age_minutes} minutes)"
49+
az group delete --name "$name" --yes --no-wait || true
50+
STALE_COUNT=$((STALE_COUNT + 1))
51+
else
52+
echo "Keeping resource group: $name (age: ${age_minutes} minutes — under 45 min threshold)"
53+
fi
54+
done
55+
56+
echo "Cleanup complete."
57+
58+
# ── 3. Azure Logout ────────────────────────────────────────
59+
- name: Azure Logout
60+
if: always()
61+
run: az logout
Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
name: Private AKS PoC - Deploy, Log, Teardown
2+
3+
on:
4+
workflow_dispatch:
5+
inputs:
6+
location:
7+
description: 'Azure region (canadacentral or canadaeast)'
8+
default: 'canadacentral'
9+
type: string
10+
wait_minutes:
11+
description: 'Minutes to wait before teardown (cost control)'
12+
default: '30'
13+
type: string
14+
15+
env:
16+
RESOURCE_GROUP: rg-aks-poc-${{ github.run_id }}
17+
CLUSTER_NAME: aks-poc-${{ github.run_id }}
18+
VNET_NAME: vnet-aks-poc
19+
SUBNET_NAME: subnet-aks
20+
LOCATION: ${{ github.event.inputs.location || 'canadacentral' }}
21+
22+
jobs:
23+
deploy-log-teardown:
24+
runs-on: self-hosted
25+
timeout-minutes: 60
26+
27+
steps:
28+
# ── 1. Checkout ────────────────────────────────────────────
29+
- name: Checkout repository
30+
uses: actions/checkout@v4
31+
32+
# ── 2. Azure Login ─────────────────────────────────────────
33+
- name: Azure Login (Managed Identity)
34+
uses: azure/login@v2
35+
with:
36+
auth-type: IDENTITY
37+
client-id: ${{ secrets.AZURE_CLIENT_ID }}
38+
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
39+
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
40+
41+
# ── 3. Record Runner IP ────────────────────────────────────
42+
- name: Record runner IP
43+
run: |
44+
RUNNER_IP=$(curl -s ifconfig.me)
45+
echo "RUNNER_IP=$RUNNER_IP" >> $GITHUB_ENV
46+
echo "Runner public IP: $RUNNER_IP"
47+
48+
# ── 4. Create Resource Group ───────────────────────────────
49+
- name: Create Resource Group
50+
run: |
51+
az group create \
52+
--name "$RESOURCE_GROUP" \
53+
--location "$LOCATION" \
54+
--tags purpose=aks-poc created=$(date -u +%Y-%m-%dT%H:%M:%SZ) run=${{ github.run_id }}
55+
56+
# ── 5. Create VNet + Subnet ────────────────────────────────
57+
- name: Create VNet and Subnet
58+
run: |
59+
az network vnet create \
60+
--resource-group "$RESOURCE_GROUP" \
61+
--name "$VNET_NAME" \
62+
--address-prefixes 10.224.0.0/16 \
63+
--subnet-name "$SUBNET_NAME" \
64+
--subnet-prefixes 10.224.0.0/24
65+
66+
SUBNET_ID=$(az network vnet subnet show \
67+
--resource-group "$RESOURCE_GROUP" \
68+
--vnet-name "$VNET_NAME" \
69+
--name "$SUBNET_NAME" \
70+
--query id -o tsv)
71+
echo "SUBNET_ID=$SUBNET_ID" >> $GITHUB_ENV
72+
73+
# ── 6. Record Start Time ───────────────────────────────────
74+
- name: Record start time
75+
run: |
76+
echo "DEPLOY_START_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ)" >> $GITHUB_ENV
77+
78+
# ── 7. Deploy Private AKS ──────────────────────────────────
79+
- name: Deploy Private AKS Cluster
80+
run: |
81+
az aks create \
82+
--resource-group "$RESOURCE_GROUP" \
83+
--name "$CLUSTER_NAME" \
84+
--node-count 1 \
85+
--node-vm-size Standard_B2s \
86+
--network-plugin azure \
87+
--vnet-subnet-id "$SUBNET_ID" \
88+
--enable-private-cluster \
89+
--enable-managed-identity \
90+
--generate-ssh-keys \
91+
--tier free \
92+
--no-wait || echo "DEPLOY_FAILED=true" >> $GITHUB_ENV
93+
94+
# ── 8. Wait for Provisioning ───────────────────────────────
95+
- name: Wait for AKS provisioning
96+
if: env.DEPLOY_FAILED != 'true'
97+
run: |
98+
az aks wait \
99+
--resource-group "$RESOURCE_GROUP" \
100+
--name "$CLUSTER_NAME" \
101+
--created \
102+
--timeout 1200
103+
104+
# ── 9. Log IPs (Activity Log) ─────────────────────────────
105+
- name: Log IPs (Activity Log)
106+
if: always()
107+
run: |
108+
echo "=== Runner VM Outbound IP ==="
109+
echo "Runner IP: $RUNNER_IP"
110+
echo ""
111+
112+
echo "Waiting 60s for Activity Log propagation..."
113+
sleep 60
114+
115+
echo "=== ARM Operation Caller IPs (ContainerService) ==="
116+
az monitor activity-log list \
117+
--resource-group "$RESOURCE_GROUP" \
118+
--start-time "$DEPLOY_START_TIME" \
119+
--query "[?contains(operationName.value, 'Microsoft.ContainerService')].{op:operationName.value, caller:caller, clientIp:httpRequest.clientIpAddress, status:status.value, time:eventTimestamp}" \
120+
-o table || echo "Activity log query failed for ContainerService"
121+
122+
echo ""
123+
echo "=== ARM Operation Caller IPs (Network) ==="
124+
az monitor activity-log list \
125+
--resource-group "$RESOURCE_GROUP" \
126+
--start-time "$DEPLOY_START_TIME" \
127+
--query "[?contains(operationName.value, 'Microsoft.Network')].{op:operationName.value, caller:caller, clientIp:httpRequest.clientIpAddress, status:status.value, time:eventTimestamp}" \
128+
-o table || echo "Activity log query failed for Network"
129+
130+
echo ""
131+
echo "=== IP Comparison ==="
132+
echo "Runner IP: $RUNNER_IP"
133+
echo "Compare the clientIp values above against the runner IP to verify traffic routes."
134+
135+
# ── 10. Log IPs (Entra Sign-In) ────────────────────────────
136+
- name: Log IPs (Entra Sign-In Logs)
137+
if: always()
138+
continue-on-error: true
139+
run: |
140+
echo "=== Entra ID Sign-In IPs (requires P1/P2) ==="
141+
az rest --method get \
142+
--url "https://graph.microsoft.com/v1.0/auditLogs/signIns?\$filter=createdDateTime ge $DEPLOY_START_TIME and appId eq '${{ secrets.AZURE_CLIENT_ID }}'" \
143+
--query "value[].{ip:ipAddress, app:appDisplayName, time:createdDateTime, status:status.errorCode}" \
144+
-o table || echo "Sign-in log query failed (may require Entra P1/P2)"
145+
146+
# ── 11. Wait Before Teardown ───────────────────────────────
147+
- name: Wait before teardown
148+
if: env.DEPLOY_FAILED != 'true'
149+
run: |
150+
WAIT=${{ github.event.inputs.wait_minutes || '30' }}
151+
echo "Waiting ${WAIT} minutes before teardown..."
152+
sleep $((WAIT * 60))
153+
154+
# ── 12. Teardown ───────────────────────────────────────────
155+
- name: Teardown all resources
156+
if: always()
157+
run: |
158+
echo "Deleting resource group $RESOURCE_GROUP..."
159+
az group delete --name "$RESOURCE_GROUP" --yes --no-wait
160+
echo "Resource group deletion initiated."
161+
162+
# ── 13. Azure Logout ───────────────────────────────────────
163+
- name: Azure Logout
164+
if: always()
165+
run: az logout

README.md

Lines changed: 142 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,143 @@
1-
# aks-private-deployment
1+
---
2+
title: Private AKS Deployment PoC with Managed Identity
3+
description: Proof-of-concept demonstrating that managed identity bypasses Entra ID conditional access policies during private AKS cluster creation
4+
author: devopsabcs-engineering
5+
ms.date: 2026-04-01
6+
ms.topic: concept
7+
keywords:
8+
- azure kubernetes service
9+
- private clusters
10+
- managed identity
11+
- conditional access
12+
- proof of concept
13+
---
214

3-
https://learn.microsoft.com/en-us/azure/aks/private-clusters
15+
## Overview
16+
17+
This proof of concept validates that Azure Managed Identity bypasses Entra ID conditional access (CA) policies when deploying private AKS clusters. Organizations with strict CA location policies can use managed identity to avoid authentication failures that occur with service principals.
18+
19+
## Problem Statement
20+
21+
When AKS Resource Provider authenticates using a service principal's credentials during `az aks create`, the sign-in originates from Azure datacenter IPs, not from the customer's network. If the organization enforces conditional access policies that restrict authentication to known perimeter IPs, these policies block the service principal sign-in because the Azure datacenter IP falls outside the allowed range.
22+
23+
This is the confirmed root cause of deployment failures in environments with location-based conditional access for workload identities.
24+
25+
## Solution
26+
27+
Managed identity bypasses conditional access entirely. MI tokens are acquired internally via IMDS (`169.254.169.254`), not through `login.microsoftonline.com`. The CA engine does not evaluate managed identity token requests at all. Per Microsoft documentation: "Managed identities aren't covered by policy."
28+
29+
By running `az aks create --enable-managed-identity` from a self-hosted runner VM that itself authenticates via `az login --identity`, all authentication stays within the Azure fabric. No external sign-in occurs, so no CA policy evaluation is triggered.
30+
31+
## Architecture
32+
33+
```mermaid
34+
graph TD
35+
A[GitHub workflow_dispatch] --> B[Self-Hosted Runner VM]
36+
B --> C[az login --identity]
37+
C --> D[Create Resource Group + VNet]
38+
D --> E[az aks create --enable-private-cluster --enable-managed-identity]
39+
E --> F{Deploy OK?}
40+
F -->|Yes| G[Log IPs from Activity Log + Sign-in Logs]
41+
G --> H[Wait 30 minutes]
42+
H --> I[az group delete — teardown]
43+
F -->|No| J[Log IPs + Error Info]
44+
J --> I
45+
46+
subgraph "Identity Flow — No CA"
47+
C -.->|IMDS 169.254.169.254| K[Azure Fabric Token]
48+
E -.->|Cluster MI via IMDS| K
49+
end
50+
```
51+
52+
## Authentication Flow Comparison
53+
54+
```text
55+
SERVICE PRINCIPAL FLOW (PROBLEMATIC):
56+
Runner VM → az login --service-principal → login.microsoftonline.com (from Runner IP ✓)
57+
Runner VM → az aks create → ARM → AKS RP → login.microsoftonline.com (from Azure datacenter IP ✗)
58+
↑ BLOCKED by CA
59+
60+
MANAGED IDENTITY FLOW (RECOMMENDED):
61+
Runner VM → az login --identity → IMDS 169.254.169.254 (internal, no CA ✓)
62+
Runner VM → az aks create → ARM → AKS RP → Azure fabric token (internal, no CA ✓)
63+
↑ NOT evaluated by CA
64+
```
65+
66+
The distinction is architectural: managed identities do not trigger conditional access because their credentials are managed by Azure and token issuance happens within the Azure fabric. There is no "source IP" for CA to evaluate.
67+
68+
## Prerequisites
69+
70+
* Azure subscription with permissions to create AKS clusters and managed identities
71+
* Azure CLI v2.28.0 or later
72+
* GitHub repository with Actions enabled
73+
* Self-hosted runner VM in Azure with a user-assigned managed identity (`Contributor` + `User Access Administrator` on the subscription)
74+
75+
## Quick Start
76+
77+
1. Run `scripts/setup-runner-vm.sh` to provision the runner VM and managed identity in `rg-aks-poc-runner`.
78+
79+
2. SSH into the VM and configure the GitHub Actions self-hosted runner. Follow
80+
[Adding self-hosted runners](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/adding-self-hosted-runners).
81+
82+
3. Add these GitHub Actions secrets to the repository:
83+
* `AZURE_CLIENT_ID`: The client ID of the managed identity `mi-aks-poc-deployer`
84+
* `AZURE_TENANT_ID`: Your Entra ID tenant ID
85+
* `AZURE_SUBSCRIPTION_ID`: Target Azure subscription ID
86+
87+
4. Trigger the **deploy-private-aks** workflow from the GitHub Actions UI (workflow_dispatch).
88+
89+
5. Alternatively, run `scripts/deploy-private-aks.sh` directly on any VM that has a managed identity with the required permissions.
90+
91+
## File Structure
92+
93+
```text
94+
.
95+
├── .github/
96+
│ └── workflows/
97+
│ ├── deploy-private-aks.yml # Main deploy + log + teardown workflow
98+
│ └── cleanup-safety-net.yml # Hourly safety net for orphaned resources
99+
├── scripts/
100+
│ ├── setup-runner-vm.sh # One-time: provision runner VM + MI
101+
│ ├── teardown-runner-vm.sh # One-time: delete runner VM
102+
│ ├── deploy-private-aks.sh # Standalone AKS deployment (reusable)
103+
│ └── log-ips.sh # IP logging utility
104+
└── README.md
105+
```
106+
107+
## GitHub Actions Workflows
108+
109+
### deploy-private-aks.yml
110+
111+
A 13-step workflow triggered by `workflow_dispatch`. It authenticates via managed identity, creates a resource group (`rg-aks-poc-<run_id>`), deploys a private AKS cluster, logs IP addresses from Azure Activity Log and Entra ID sign-in logs, waits 30 minutes for log propagation, and tears down all resources. A Dead Man's Switch pattern ensures cleanup runs even if intermediate steps fail.
112+
113+
### cleanup-safety-net.yml
114+
115+
An hourly cron-triggered workflow that scans for resource groups matching the `rg-aks-poc-*` pattern older than 45 minutes. This acts as a safety net to delete orphaned resources left behind by failed or interrupted deployment runs.
116+
117+
## IP Logging
118+
119+
The PoC captures IP addresses from multiple sources to confirm that managed identity authentication does not route through external endpoints:
120+
121+
* **Runner outbound IP**: Captured via `curl -s ifconfig.me`. This establishes the baseline public IP of the runner VM.
122+
* **Azure Activity Log**: Queried via `az monitor activity-log list`. The `httpRequest.clientIpAddress` field shows which IP initiated each ARM operation. If these IPs match the runner IP, traffic is routing as expected.
123+
* **Entra ID sign-in logs** (optional, requires P1/P2): Queried via Microsoft Graph API. Shows managed identity sign-in events and their source IPs under the "Managed identity sign-ins" category.
124+
125+
To verify correct behavior, compare the Activity Log IPs against the runner outbound IP. Matching IPs confirm that ARM calls originate from the runner VM rather than from unexpected Azure datacenter addresses.
126+
127+
## Cost Estimate
128+
129+
Each 30-minute PoC run costs approximately $0.05 to $0.08 with a single Standard_B2s node on the Free tier AKS control plane. The runner VM is the primary ongoing cost at approximately $0.042/hr when running.
130+
131+
## Cleanup
132+
133+
1. Run `scripts/teardown-runner-vm.sh` to delete the runner VM, managed identity, and the `rg-aks-poc-runner` resource group.
134+
2. Deregister the self-hosted runner from your GitHub repository under **Settings > Actions > Runners**.
135+
136+
> [!IMPORTANT]
137+
> The cleanup-safety-net workflow handles PoC resource groups automatically, but the runner VM infrastructure requires manual teardown.
138+
139+
## Key References
140+
141+
* [Azure Private AKS Clusters](https://learn.microsoft.com/en-us/azure/aks/private-clusters)
142+
* [Use Managed Identity with AKS](https://learn.microsoft.com/en-us/azure/aks/use-managed-identity)
143+
* [Conditional Access for Workload Identities](https://learn.microsoft.com/en-us/entra/identity/conditional-access/workload-identity)

0 commit comments

Comments
 (0)