Release v1.2.0 - Permission Handling Feature by sturrent · Pull Request #4 · sturrent/aks-net-diagnostics

sturrent · 2025-10-28T16:55:04Z

Release v1.2.0 - Permission Handling Feature

This PR merges the permission handling feature from develop to main for release v1.2.0.

🎯 Summary

Major feature release adding comprehensive permission handling with graceful degradation when Azure RBAC permissions are insufficient.

✨ New Features

Permission Handling:

Graceful handling of insufficient Azure RBAC permissions
4 new finding codes: PERMISSION_INSUFFICIENT_VNET, PERMISSION_INSUFFICIENT_VMSS, PERMISSION_INSUFFICIENT_LB, PERMISSION_INSUFFICIENT_PROBE_TEST
Permission error detection in all Azure CLI operations
Specific permission recommendations with example az role assignment commands
Tool continues analysis with partial data when permissions are insufficient

Connectivity Test Summary:

Added connectivity test results to report summary
Shows total tests, passed, failed, and could not execute counts
Clear distinction between test failures (ran but failed) and permission errors (could not execute)

🐛 Bug Fixes

Probe Test Crash: Fixed crash when using --probe-test flag with limited permissions
VMSS Operations: Updated to use permission-aware execution
Multiple Code Quality Fixes: 20+ commits fixing various issues discovered during development

📚 Documentation

Enhanced ARCHITECTURE.md with permission handling section
Added permission requirements to security considerations
Visual data collection flow diagram with error paths
Updated CHANGELOG.md with v1.2.0 release notes
Documented Azure CLI v2.78.0 bug and workaround

⚠️ Known Issues

Azure CLI Bug: Azure CLI v2.78.0 has a bug where vmss run-command invoke returns "This is a sample script" instead of actual command output
- Tracked in Azure CLI issue #32286
- Fix merged in PR #32280
- Workaround: Use Azure CLI v2.77 or wait for v2.78.1/v2.79.0
- Our tool handles this gracefully by reporting execution errors

📊 Commits Included

Total: 20 commits

Features: 6 commits
Fixes: 10 commits
Refactoring: 2 commits
Chores: 2 commits

🧪 Testing

✅ All 139 unit tests passing
✅ Pylint score: 9.98/10
✅ Black formatted
✅ Tested with multiple AKS clusters
✅ Tested with limited permissions (Reader role)
✅ Tested with full permissions

📦 Release Checklist

Merge Instructions:
Once approved and merged, tag will be created to trigger GitHub Actions for release automation.

- Added .github/instructions/ to .gitignore - These are internal development documents used by Copilot - Should not be tracked in version control

- Bug #5: Fix VNet detection to search only peered VNets with most specific match * Modified _find_dns_server_host_vnet() to accept cluster_info parameter * Now searches only VNets peered with cluster VNet (not all in subscription) * Uses highest prefix length for most specific network match * Fixes incorrect VNet identification when large address spaces overlap - Bug #6: Change DNS misconfiguration severity from CRITICAL to WARNING * Updated message: 'cannot resolve' → 'may not resolve' * Custom DNS can work if properly configured * Enhanced recommendation with three specific configuration options - Bug #3: Fix severity display labels in report output * Show actual severity levels: [CRITICAL], [HIGH], [WARNING], [INFO] * Removed non-existent 'error' severity from all mappings * Fixes confusing [ERROR] labels that didn't match actual severity Validated against aks-api-connection test cluster. Ref: .github/instructions/backport-features.instructions.md

…inks The method was returning an empty list, preventing detection of missing private DNS zone VNet links. Now properly: - Extracts VNet IDs from agent pool subnet IDs - Retrieves VNet details including custom DNS servers - Returns list of VNets with DNS configuration This enables the PDNS_DNS_HOST_VNET_LINK_MISSING finding to be generated when DNS server VNets are not linked to private DNS zones. Fixes Bug #4 from backport plan.

Added 5 new finding codes for permission-related issues: - PERMISSION_INSUFFICIENT_VNET - PERMISSION_INSUFFICIENT_VMSS - PERMISSION_INSUFFICIENT_LB - PERMISSION_INSUFFICIENT_NSG - PERMISSION_INSUFFICIENT_DNS These codes will be used when implementing graceful degradation for insufficient RBAC permissions. Ref: .github/instructions/backport-features.instructions.md Enhancement #1

- Fix W0612: Remove unused variable 'i' in dns_analyzer.py - Fix W1514: Add encoding='utf-8' to open() in report_generator.py - Fix W0707: Add 'from exc' to exception re-raising (validators.py, aks-net-diagnostics.py) - Fix W0613: Prefix unused argument with underscore in route_table_analyzer.py - Fix R1732: Use 'with' statement for urllib.request.urlopen() - Apply Black formatting to models.py and misconfiguration_analyzer.py All quality checks now pass: - Black: PASSED - isort: PASSED - Flake8: PASSED (0 violations) - Pylint: 9.99/10 (up from 9.96/10) - Tests: 139/139 PASSED

Implements graceful degradation when user has insufficient RBAC permissions: Azure CLI Executor (azure_cli.py): - Add permission_errors tracking list - Add _is_authorization_error() to detect auth failures from stderr - Add _extract_permission_action() to identify missing permissions - Add execute_with_permission_check() for graceful degradation - Returns None instead of crashing on permission errors Cluster Data Collector (cluster_data_collector.py): - Add findings list to track permission issues - Use execute_with_permission_check() for VNet retrieval - Use execute_with_permission_check() for VNet peerings - Use execute_with_permission_check() for VMSS operations - Create WARNING findings with actionable remediation steps Benefits: - Tool doesn't crash with limited permissions - Users get partial diagnostics results - Clear guidance on which permissions are needed - Enterprise-friendly for strict RBAC environments Remaining work (future commits): - Update remaining analyzers (misconfiguration, outbound, NSG, DNS) - Add 'Analysis Incomplete' indicator in report_generator - Collect permission findings in main script - Testing with different RBAC roles Related: Permission handling plan in .github/instructions/permission-handling-plan.md

Implemented execute_with_permission_check() throughout all analyzer modules: misconfiguration_analyzer.py: - Private DNS zone listing - VNet link listing (2 places) - VNet retrieval for DNS checks (2 places) - Remote VNet retrieval in peerings - Private DNS zone lookup - Added findings list for permission issues outbound_analyzer.py: - Load Balancer listing - Frontend IP configuration retrieval - Public IP details retrieval (2 places) - NAT Gateway listing - Public IP prefix details - Added findings list with detailed remediation nsg_analyzer.py: - Subnet retrieval - NSG details for subnet NSGs - NSG details for NIC NSGs - Added findings list - Creates FindingCode.PERMISSION_INSUFFICIENT_NSG with scope info dns_analyzer.py: - VNet DNS configuration retrieval - Shows warning when permissions insufficient - Already had findings list All analyzers now: ✅ Use execute_with_permission_check() for Azure operations ✅ Track permission errors in findings list ✅ Continue analysis with partial data ✅ Log debug messages for skipped operations ✅ Provide actionable remediation in findings Next steps: - Update report_generator.py with 'Analysis Incomplete' indicator - Collect all permission findings in main script - Test with different RBAC permission levels

Quality fixes: - Removed trailing whitespace in azure_cli.py - Fixed f-string without interpolation (4 instances) - Applied Black formatting to outbound_analyzer.py Test fixes: - Updated test mocks to use execute_with_permission_check() - Fixed test_cluster_data_collector.py (3 tests) - Fixed test_nsg_analyzer.py (2 tests) Results: ✅ All 139/139 tests passing ✅ Pylint 9.98/10 (above 9.5 target) ✅ Flake8 clean ✅ Black formatting clean ✅ Ready for integration testing

Issue: - NSG findings were being created twice: 1. By nsg_analyzer.py (authoritative source) 2. By misconfiguration_analyzer.py (duplicate) - This resulted in duplicate warnings with slightly different wording Examples of duplicates: - 'has rules that may block inter-node communication' (nsg_analyzer) vs 'on subnet has X rule(s) that may block...' (misconfiguration) - 'may block AKS traffic but is overridden' (nsg_analyzer) vs 'could block AKS traffic, but higher-priority...' (misconfiguration) Fix: - Removed _analyze_nsg_issues() method from misconfiguration_analyzer.py - NSG analyzer is now the single source of truth for NSG findings - Added comment explaining why NSG analysis is skipped in misconfiguration Impact: - Each finding now appears exactly once in output - Cleaner, more professional diagnostic reports - No functional changes to NSG analysis logic Tested with aks-overlay cluster - confirmed duplicates eliminated. All 139 tests passing, Pylint 9.98/10.

Issue: - 'Analyzing NSG configuration...' logged twice during execution: 1. In main script's analyze_nsg_configuration() method 2. In NSG analyzer's analyze() method Fix: - Removed log message from main script - NSG analyzer logs the message (single source) - Main script just orchestrates without duplicate logging Tested with aks-overlay - confirmed message appears only once.

Critical bug fix for permission handling: Issue: - execute_with_permission_check() only caught subprocess.CalledProcessError - execute() method converts subprocess errors to AzureCLIError - Result: Permission errors were NOT handled gracefully, tool crashed Fix: - Changed execute_with_permission_check() to catch AzureCLIError - Extract stderr from exception attributes - Properly detect authorization errors and return None - Track permission errors for reporting Impact: - Tool now gracefully degrades with limited permissions - Logs warnings for missing permissions - Continues analysis with partial data - No more crashes on AuthorizationFailed errors Tested with: - Service principal with limited permissions (Contributor on cluster only) - Successfully handled 5+ permission errors gracefully - Completed analysis with partial diagnostics This was a CRITICAL bug that prevented the entire permission handling infrastructure from working correctly.

Enhancement to permission handling: Before: - Permission warnings only logged to console - Not shown in findings summary or detailed report - Users had to scan logs to find permission issues After: - Permission findings collected from all analyzers - Displayed in findings summary - Detailed view shows: * Specific permission required (e.g., Microsoft.Network/virtualNetworks/read) * Example Azure CLI command to grant permission * Clear finding codes (PERMISSION_INSUFFICIENT_VNET, etc.) Implementation: - Added collect_permission_findings() method to main script - Stores analyzer instances to access their findings - Collects findings from: * cluster_data_collector * outbound_analyzer * misconfiguration_analyzer * NSG and DNS analyzers (already collected) - Calls before generate_report() to ensure all findings included Tested with limited permissions (aks-dns-ex1): - Shows 3 permission findings in summary - Detailed view provides clear remediation steps - No duplicate findings This completes the permission handling user experience.

- Modified azure_cli.py execute() to detect permission errors and suppress ERROR logging (only show WARNING from execute_with_permission_check) - Updated route_table_analyzer.py _get_subnet_details() to use execute_with_permission_check() for cleaner permission handling - Changed exception logging from INFO to DEBUG level - Updated test mocks to use execute_with_permission_check() This reduces log noise from 3+ lines per permission error down to 1 clean WARNING line with required permission info. Before: 2025-10-27 14:58:16 - INFO - Error getting subnet details... ERROR: (AuthorizationFailed)... Code: AuthorizationFailed... 2025-10-27 14:58:16 - ERROR - Load Balancer outbound type... After: 2025-10-27 15:03:39 - WARNING - Insufficient permissions to retrieve subnet 'aks-subnet' details. Required: Microsoft.Network/virtualNetworks/read

- Modified misconfiguration_analyzer to accept outbound_analyzer parameter - Check for PERMISSION_INSUFFICIENT_LB finding before creating NO_OUTBOUND_IPS - Updated report_generator to display helpful message when outbound config cannot be retrieved due to insufficient permissions - Improved has_outbound_data check to avoid false positives with empty dicts This prevents the tool from incorrectly reporting 'No outbound IP addresses detected' when the real issue is insufficient permissions to retrieve the load balancer configuration. Before (with limited permissions): **Outbound Configuration:** **Findings Summary:** - [WARNING] No outbound IP addresses detected for loadBalancer outbound type After (with limited permissions): **Outbound Configuration:** - Unable to retrieve loadBalancer configuration due to insufficient permissions - See permission findings below for details **Findings Summary:** - [WARNING] Insufficient permissions to retrieve Load Balancer configuration...

…essage - Added permission check to _print_outbound_connectivity() method - Now displays outbound section in --details mode even when permissions lacking - Shows clear message: 'Unable to retrieve loadBalancer configuration...' - Consistent behavior between summary and detailed reports This ensures the detailed report (--details flag) also shows the outbound configuration section with appropriate messaging when permissions are insufficient, matching the behavior of the summary report.

- Auto-formatted with Black to maintain code style consistency - No functional changes - Pylint score: 9.98/10 - All 139 tests passing

- Updated cluster_data_collector.py to use execute_with_permission_check() for VMSS details - Updated connectivity_tester.py to use execute_with_permission_check() for VMSS list and instances - Added warning message when VMSS instances unavailable for connectivity testing - Tool now skips probe tests gracefully instead of crashing with permission errors - Updated test mocks to match new execute_with_permission_check() calls - All 139 tests passing

- Add PERMISSION_INSUFFICIENT_PROBE_TEST finding code - Detect when probe tests are skipped due to lack of permissions - Display permission finding in both summary and detailed reports - Distinguish between empty VMSS list (no instances) vs None (permission error) - Update tests to handle new permission detection logic

…ndation - Detect AuthorizationFailed errors in connectivity test failures - Provide specific permission recommendation: Microsoft.Compute/virtualMachineScaleSets/virtualMachines/runCommand/action - Include az role assignment command example with Virtual Machine Contributor role - Update finding message to clearly state 'due to insufficient permissions' - Distinguish between permission errors and other test failures

- Add Connectivity Tests section to summary report showing total/passed/failed/errors - Add Connectivity Tests section to detailed report with consistent formatting - Improve status labels: 'Tests Failed' vs 'Could Not Execute' for clarity - Show 'Skipped due to insufficient permissions' when applicable - Consistent formatting between summary and detailed reports

- Add permission handling feature with graceful degradation - Add connectivity test summary to reports - Update documentation with permission requirements - Add data collection flow diagram showing permission handling - Document Azure CLI v2.78.0 bug and workaround BREAKING: None FEATURES: - Permission-aware execution with 4 new finding codes - Connectivity test results in report summary - Graceful continuation with partial data when permissions insufficient FIXES: - Probe test crash with limited permissions DOCUMENTATION: - Enhanced ARCHITECTURE.md with permission handling section - Added permission requirements to security considerations - Visual data collection flow diagram with error paths - Updated CHANGELOG.md with v1.2.0 release notes

sturrent · 2025-10-28T16:58:34Z

✅ CI Fix Applied

Fixed Flake8 violation (F541: f-string without placeholders) in connectivity_tester.py.

Issue: Line 152 had an f-string without any placeholders
Fix: Removed f prefix from string literals (commit 297768b)

Status:

✅ Black: Passing
✅ isort: Passing
✅ Flake8: Passing (was failing)
✅ Pytest: 139/139 tests passing

CI should now pass. The workflow will re-run automatically with the new commit.

sturrent added 24 commits October 17, 2025 17:41

fix: update path to build_zipapp script in release workflow

b042c5f

chore: sync develop with main branch

ba4c2f6

chore: ignore GitHub Copilot instructions directory

57fbfba

- Added .github/instructions/ to .gitignore - These are internal development documents used by Copilot - Should not be tracked in version control

style: apply Black formatting to all modified files

be15332

- Auto-formatted with Black to maintain code style consistency - No functional changes - Pylint score: 9.98/10 - All 139 tests passing

fix: remove f-string without placeholders (flake8 F541)

297768b

sturrent merged commit 3462f1b into main Oct 28, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v1.2.0 - Permission Handling Feature#4

Release v1.2.0 - Permission Handling Feature#4
sturrent merged 24 commits intomainfrom
develop

sturrent commented Oct 28, 2025

Uh oh!

sturrent commented Oct 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sturrent commented Oct 28, 2025