This repo has been archived. Please review the new setup with Azure CLI extensions here.
A comprehensive Python tool for analyzing Azure Kubernetes Service (AKS) network configurations and diagnosing connectivity issues. Features a modular architecture with specialized analyzers for deep network troubleshooting.
- Comprehensive Analysis: 9 specialized analyzers for deep network diagnostics
- Active Testing: Optional connectivity probes from cluster nodes
- Multiple Output Formats: Console summary + detailed output + JSON export
- Security Focused: NSG compliance, inter-node traffic validation
- Modular design: 139 unit tests, modular architecture
- Detailed Reports: Actionable recommendations for every finding
- Prerequisites
- Installation & Usage
- Quick Start
- What It Analyzes
- Architecture
- Command Options
- Usage Examples
- Active Connectivity Tests
- Output Files
- Development
- Troubleshooting
- Python 3.7+ - Download
- Azure CLI 2.0+ - Installation guide
- Azure Authentication: Run
az loginbefore using the tool - Permissions: Reader access to AKS cluster and related network resources
Download and run the pre-built .pyz file from Releases:
# Download the latest release
wget https://github.com/sturrent/aks-net-diagnostics/releases/latest/download/aks-net-diagnostics.pyz# Run directly with Python
python aks-net-diagnostics.pyz -n myCluster -g myResourceGroup
# Or make it executable (Linux/macOS)
chmod +x aks-net-diagnostics.pyz
./aks-net-diagnostics.pyz -n myCluster -g myResourceGroupAdvantages:
- Single file (~55 KB)
- No installation required
- Just download and run
- All modules bundled inside
git clone https://github.com/sturrent/aks-net-diagnostics.git
cd aks-net-diagnostics
python aks-net-diagnostics.py -n myCluster -g myResourceGroupTo create the single-file distribution:
python tools/build_zipapp.py
# Creates: aks-net-diagnostics.pyz# Using the .pyz file (recommended)
python aks-net-diagnostics.pyz -n my-cluster -g my-resource-group
# OR using the source code
python aks-net-diagnostics.py -n my-cluster -g my-resource-group
# With detailed output
python aks-net-diagnostics.pyz -n my-cluster -g my-resource-group --details
# Save JSON report with auto-generated filename
python aks-net-diagnostics.pyz -n my-cluster -g my-resource-group --json-report
# Save JSON report with custom filename
python aks-net-diagnostics.pyz -n my-cluster -g my-resource-group --json-report my-report.json
# Include connectivity testing from cluster nodes
python aks-net-diagnostics.pyz -n my-cluster -g my-resource-group --probe-test- VNet Configuration: Topology, address spaces, peerings
- Outbound Connectivity: LoadBalancer, NAT Gateway, User Defined Routes
- DNS Configuration: Azure DNS, Custom DNS, Private DNS zones
- VMSS Network Profiles: Node subnet assignments, NIC configurations
- NSG Rules: Required AKS traffic, blocking rules, inter-node communication
- API Server Access: Authorized IP ranges, private endpoints
- Route Tables: UDR impact on AKS management traffic
- DNS Resolution: MCR, API server hostname lookup from nodes
- HTTPS Connectivity: Container registry, API server reachability
- Network Path: Validates full network path from nodes to Azure services
The tool uses a modular architecture with specialized analyzers:
- Data Collection: Gathers cluster info, VNets, VMSS configurations
- Network Analysis: NSG rules, DNS, routing, outbound connectivity
- Security Validation: API server access, authorized IPs
- Active Testing: Optional connectivity probes from nodes
- Reporting: Console output, JSON export, finding correlation
Key Modules: NSGAnalyzer, DNSAnalyzer, RouteTableAnalyzer, APIServerAccessAnalyzer, ConnectivityTester, OutboundConnectivityAnalyzer
For detailed architecture documentation, see docs/ARCHITECTURE.md.
| Issue | Severity | Description |
|---|---|---|
| Outbound IPs not in authorized ranges | Critical | Cluster can't reach API server |
| Default route to firewall/NVA | Critical | Breaks AKS management traffic |
| NSG blocking required traffic | Critical | Prevents node communication |
| NSG blocking inter-node traffic | Warning | Breaks system components (konnectivity, metrics-server) |
| DNS resolution failures | Critical | Nodes can't resolve Azure services |
| HTTPS connectivity blocked | Critical | SSL interception or firewall blocking |
| Private DNS zone VNet link missing | Critical | Private cluster name resolution fails |
| Custom DNS not forwarding to Azure DNS | Critical | Private endpoints unreachable |
| Option | Description | Example |
|---|---|---|
-n <NAME> |
AKS cluster name (required) | -n my-cluster |
-g <GROUP> |
Resource group name (required) | -g my-rg |
--details |
Show detailed analysis and test results | --details |
--probe-test |
Enable active connectivity tests from nodes | --probe-test |
--json-report [FILE] |
Save JSON report (optional filename) | --json-report report.json |
--subscription <ID> |
Override Azure subscription | --subscription abc-123 |
Quick health check of cluster network configuration:
python aks-net-diagnostics.py -n production-cluster -g prod-rgGet comprehensive details about all network components:
python aks-net-diagnostics.py -n production-cluster -g prod-rg --detailsTest actual connectivity from cluster nodes (DNS + HTTPS):
python aks-net-diagnostics.py -n production-cluster -g prod-rg --probe-testExport full analysis data for documentation or automation:
# Auto-generated filename
python aks-net-diagnostics.py -n production-cluster -g prod-rg --json-report
# Custom filename
python aks-net-diagnostics.py -n production-cluster -g prod-rg --json-report audit-2025-10-03.jsonComprehensive analysis with connectivity tests:
python aks-net-diagnostics.py -n failed-cluster -g troubleshooting-rg --details --probe-testAnalyze cluster in different subscription:
python aks-net-diagnostics.py -n cluster -g rg --subscription xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxpython aks-net-diagnostics.py -g my-resource-group -n my-private-cluster --detailsOutput:
# AKS Network Assessment Report
**Cluster:** my-private-cluster
**Resource Group:** my-resource-group
**Generated:** 2025-10-03 14:17:24 UTC
## Cluster Overview
| Property | Value |
|----------|-------|
| Provisioning State | Failed |
| Power State | Running |
| Network Plugin | azure |
| Outbound Type | loadBalancer |
| Private Cluster | true |
## Network Configuration
### API Server Access
- **Type:** Private cluster
- **Private FQDN:** my-cluster-xxx.privatelink.eastus.azmk8s.io
- **Private DNS Zone:** system
- **Access Restrictions:** None (unrestricted public access)
### Outbound Connectivity
- **Type:** loadBalancer
- **Effective Public IPs:** 20.10.5.100
### Network Security Group (NSG) Analysis
- **NSGs Analyzed:** 2
- **Issues Found:** 0
- **Inter-node Communication:** [OK] Ok
**Subnet NSGs:**
- **aks-subnet** -> NSG: my-subnet-nsg-eastus
- Custom Rules: 0, Default Rules: 6
**NIC NSGs:**
- **my-nodepool-nsg** (used by: aks-nodepool1-vmss)
- Custom Rules: 0, Default Rules: 6
## Findings
**Findings Summary:**
- [CRITICAL] 4
### [CRITICAL] PRIVATE_DNS_MISCONFIGURED
**Message:** Private cluster is using custom DNS servers (10.1.0.10) that cannot resolve Azure private DNS zones
**Recommendation:** Configure DNS forwarding to 168.63.129.16 for '*.privatelink.*.azmk8s.io'
### [CRITICAL] CLUSTER_OPERATION_FAILURE
**Message:** Cluster failed with error: VMExtensionProvisioningError - agents are unable to resolve Kubernetes API server name
**Recommendation:** Check Azure Activity Log and custom DNS configuration
### [CRITICAL] NODE_POOL_FAILURE
**Message:** Node pools in failed state: nodepool1
**Recommendation:** Check node pool configuration and Azure Activity Log
### [CRITICAL] PDNS_DNS_HOST_VNET_LINK_MISSING
**Message:** DNS server 10.1.0.10 is hosted in VNet hub-vnet but this VNet is not linked to private DNS zone
**Recommendation:** Link VNet hub-vnet to private DNS zone for proper DNS resolution
This example shows the tool detecting a common private cluster misconfiguration where custom DNS servers aren't properly configured to resolve Azure private DNS zones.
When using --probe-test, the tool executes connectivity tests directly from cluster nodes using VMSS run-command.
| Test | Description | Purpose |
|---|---|---|
| MCR DNS Resolution | Resolves mcr.microsoft.com |
Validates DNS for container registry |
| Internet Connectivity | HTTPS to MCR | Tests outbound internet access |
| API Server DNS | Resolves cluster API hostname | Validates private DNS configuration |
| API Server HTTPS | HTTPS to API server | Tests API server reachability |
- Tests use dependency checking: HTTPS tests skip if DNS fails
- Timeouts configured: 60s for MCR, 15s for API server
- Full error visibility: Detailed curl output shows exact failure points
- VMSS timeout: 5 minutes to account for queuing and execution
### Connectivity Tests
**Test Results:**
- [PASS] MCR DNS Resolution - PASSED
- Resolved to: 150.171.70.10, 150.171.69.10
- [PASS] Internet Connectivity - PASSED
- Successfully connected to mcr.microsoft.com
- [PASS] API Server DNS Resolution - PASSED
- Resolved to: 10.0.0.10
- [FAIL] API Server HTTPS Connectivity - FAILED
- Error: Connection timeout after 15s
- Possible causes: Firewall blocking, NSG rules, routing issues
- Summary Mode (default): High-level findings and recommendations
- Detailed Mode (
--details): Detailed analysis of all components - Exit Codes:
0: Analysis completed successfully1: Unexpected error2: Configuration/validation error3: File error4: Permission error130: Cancelled by user (Ctrl+C)
Generated with --json-report, contains:
{
"metadata": {
"cluster_name": "my-cluster",
"resource_group": "my-rg",
"subscription": "xxx",
"generated": "2025-10-03T14:30:00Z",
"script_version": "1.2.0"
},
"cluster_info": { "..." },
"findings": [
{
"severity": "critical",
"code": "CLUSTER_OPERATION_FAILURE",
"message": "...",
"recommendation": "..."
}
],
"network_analysis": {
"vnets": [],
"outbound": {},
"nsgs": {},
"dns": {},
"api_server": {}
},
"connectivity_tests": []
}- docs/DEVELOPMENT.md - Complete development setup, quality tools, testing
- CONTRIBUTING.md - Contribution guidelines and workflow
- docs/ARCHITECTURE.md - Technical architecture and design
- docs/PRE_PUSH_HOOK.md - Pre-push hook details
# Clone and setup
git clone https://github.com/sturrent/aks-net-diagnostics.git
cd aks-net-diagnostics
python -m venv venv
source venv/bin/activate # On Windows: .\venv\Scripts\Activate.ps1
pip install -r dev-requirements.txt
# Run quality checks
./tools/check_quality.sh # On Linux/Mac
.\tools\check_quality.ps1 # On Windows
# Run tests
pytest -vThis project maintains high code quality:
- Pylint: 9.96/10 score
- Flake8: Zero violations (excluding line length)
- Tests: 139 passing unit tests
- Coverage: 80%+ on new code
Azure CLI not found
# Verify Azure CLI installation
az --version
# Install if missing
# Windows: https://aka.ms/installazurecliwindows
# Linux: curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
# macOS: brew install azure-cliNot logged in to Azure
# Login to Azure
az login
# Verify authentication
az account show
# Set specific subscription
az account set --subscription "My Subscription"Python not found
# Check Python version
python --version # or python3 --version
# Requires Python 3.7+
# Download from: https://www.python.org/downloads/Permission errors on Linux/macOS
# Make script executable
chmod +x aks-net-diagnostics.py
# Run with python explicitly
python3 aks-net-diagnostics.py -n my-cluster -g my-rgVMSS run-command timeout
If connectivity tests timeout:
- Cluster nodes may be under heavy load
- Network path may be experiencing latency
- Use
--detailsto see detailed error messages - Re-run without
--probe-testfor static analysis only
Module import errors
# Ensure you're in the project directory
cd aks-net-diagnostics
# Install any missing dependencies
pip install -r requirements.txt # if requirements.txt exists- Issues: GitHub Issues
- Detailed Mode: Always use
--detailswhen reporting issues - JSON Export: Attach JSON report (
--json-report) for detailed diagnostics
MIT License - See LICENSE file for details
Built for Azure Kubernetes Service troubleshooting by the Azure community.
Note: This code has been developed with the assistance of AI tools.
Version: 1.2.0
Last Updated: October 2025
Maintained by: @sturrent