Skip to content
This repository was archived by the owner on Nov 14, 2025. It is now read-only.

Release v1.2.0 - Permission Handling Feature#4

Merged
sturrent merged 24 commits intomainfrom
develop
Oct 28, 2025
Merged

Release v1.2.0 - Permission Handling Feature#4
sturrent merged 24 commits intomainfrom
develop

Conversation

@sturrent
Copy link
Copy Markdown
Owner

Release v1.2.0 - Permission Handling Feature

This PR merges the permission handling feature from develop to main for release v1.2.0.

🎯 Summary

Major feature release adding comprehensive permission handling with graceful degradation when Azure RBAC permissions are insufficient.

✨ New Features

Permission Handling:

  • Graceful handling of insufficient Azure RBAC permissions
  • 4 new finding codes: PERMISSION_INSUFFICIENT_VNET, PERMISSION_INSUFFICIENT_VMSS, PERMISSION_INSUFFICIENT_LB, PERMISSION_INSUFFICIENT_PROBE_TEST
  • Permission error detection in all Azure CLI operations
  • Specific permission recommendations with example az role assignment commands
  • Tool continues analysis with partial data when permissions are insufficient

Connectivity Test Summary:

  • Added connectivity test results to report summary
  • Shows total tests, passed, failed, and could not execute counts
  • Clear distinction between test failures (ran but failed) and permission errors (could not execute)

🐛 Bug Fixes

  • Probe Test Crash: Fixed crash when using --probe-test flag with limited permissions
  • VMSS Operations: Updated to use permission-aware execution
  • Multiple Code Quality Fixes: 20+ commits fixing various issues discovered during development

📚 Documentation

  • Enhanced ARCHITECTURE.md with permission handling section
  • Added permission requirements to security considerations
  • Visual data collection flow diagram with error paths
  • Updated CHANGELOG.md with v1.2.0 release notes
  • Documented Azure CLI v2.78.0 bug and workaround

⚠️ Known Issues

  • Azure CLI Bug: Azure CLI v2.78.0 has a bug where vmss run-command invoke returns "This is a sample script" instead of actual command output
    • Tracked in Azure CLI issue #32286
    • Fix merged in PR #32280
    • Workaround: Use Azure CLI v2.77 or wait for v2.78.1/v2.79.0
    • Our tool handles this gracefully by reporting execution errors

📊 Commits Included

Total: 20 commits

  • Features: 6 commits
  • Fixes: 10 commits
  • Refactoring: 2 commits
  • Chores: 2 commits

🧪 Testing

  • ✅ All 139 unit tests passing
  • ✅ Pylint score: 9.98/10
  • ✅ Black formatted
  • ✅ Tested with multiple AKS clusters
  • ✅ Tested with limited permissions (Reader role)
  • ✅ Tested with full permissions

📦 Release Checklist

  • Version bumped to 1.2.0 in all files
  • CHANGELOG.md updated
  • README.md updated
  • ARCHITECTURE.md updated
  • All tests passing
  • Code quality checks passing
  • PR merged to main
  • Tag v1.2.0 created
  • GitHub release created (will be automated by tag)

Merge Instructions:
Once approved and merged, tag will be created to trigger GitHub Actions for release automation.

- Added .github/instructions/ to .gitignore
- These are internal development documents used by Copilot
- Should not be tracked in version control
- Bug #5: Fix VNet detection to search only peered VNets with most specific match
  * Modified _find_dns_server_host_vnet() to accept cluster_info parameter
  * Now searches only VNets peered with cluster VNet (not all in subscription)
  * Uses highest prefix length for most specific network match
  * Fixes incorrect VNet identification when large address spaces overlap

- Bug #6: Change DNS misconfiguration severity from CRITICAL to WARNING
  * Updated message: 'cannot resolve' → 'may not resolve'
  * Custom DNS can work if properly configured
  * Enhanced recommendation with three specific configuration options

- Bug #3: Fix severity display labels in report output
  * Show actual severity levels: [CRITICAL], [HIGH], [WARNING], [INFO]
  * Removed non-existent 'error' severity from all mappings
  * Fixes confusing [ERROR] labels that didn't match actual severity

Validated against aks-api-connection test cluster.

Ref: .github/instructions/backport-features.instructions.md
…inks

The method was returning an empty list, preventing detection of missing
private DNS zone VNet links. Now properly:
- Extracts VNet IDs from agent pool subnet IDs
- Retrieves VNet details including custom DNS servers
- Returns list of VNets with DNS configuration

This enables the PDNS_DNS_HOST_VNET_LINK_MISSING finding to be generated
when DNS server VNets are not linked to private DNS zones.

Fixes Bug #4 from backport plan.
Added 5 new finding codes for permission-related issues:
- PERMISSION_INSUFFICIENT_VNET
- PERMISSION_INSUFFICIENT_VMSS
- PERMISSION_INSUFFICIENT_LB
- PERMISSION_INSUFFICIENT_NSG
- PERMISSION_INSUFFICIENT_DNS

These codes will be used when implementing graceful degradation
for insufficient RBAC permissions.

Ref: .github/instructions/backport-features.instructions.md Enhancement #1
- Fix W0612: Remove unused variable 'i' in dns_analyzer.py
- Fix W1514: Add encoding='utf-8' to open() in report_generator.py
- Fix W0707: Add 'from exc' to exception re-raising (validators.py, aks-net-diagnostics.py)
- Fix W0613: Prefix unused argument with underscore in route_table_analyzer.py
- Fix R1732: Use 'with' statement for urllib.request.urlopen()
- Apply Black formatting to models.py and misconfiguration_analyzer.py

All quality checks now pass:
- Black: PASSED
- isort: PASSED
- Flake8: PASSED (0 violations)
- Pylint: 9.99/10 (up from 9.96/10)
- Tests: 139/139 PASSED
Implements graceful degradation when user has insufficient RBAC permissions:

Azure CLI Executor (azure_cli.py):
- Add permission_errors tracking list
- Add _is_authorization_error() to detect auth failures from stderr
- Add _extract_permission_action() to identify missing permissions
- Add execute_with_permission_check() for graceful degradation
- Returns None instead of crashing on permission errors

Cluster Data Collector (cluster_data_collector.py):
- Add findings list to track permission issues
- Use execute_with_permission_check() for VNet retrieval
- Use execute_with_permission_check() for VNet peerings
- Use execute_with_permission_check() for VMSS operations
- Create WARNING findings with actionable remediation steps

Benefits:
- Tool doesn't crash with limited permissions
- Users get partial diagnostics results
- Clear guidance on which permissions are needed
- Enterprise-friendly for strict RBAC environments

Remaining work (future commits):
- Update remaining analyzers (misconfiguration, outbound, NSG, DNS)
- Add 'Analysis Incomplete' indicator in report_generator
- Collect permission findings in main script
- Testing with different RBAC roles

Related: Permission handling plan in .github/instructions/permission-handling-plan.md
Implemented execute_with_permission_check() throughout all analyzer modules:

misconfiguration_analyzer.py:
- Private DNS zone listing
- VNet link listing (2 places)
- VNet retrieval for DNS checks (2 places)
- Remote VNet retrieval in peerings
- Private DNS zone lookup
- Added findings list for permission issues

outbound_analyzer.py:
- Load Balancer listing
- Frontend IP configuration retrieval
- Public IP details retrieval (2 places)
- NAT Gateway listing
- Public IP prefix details
- Added findings list with detailed remediation

nsg_analyzer.py:
- Subnet retrieval
- NSG details for subnet NSGs
- NSG details for NIC NSGs
- Added findings list
- Creates FindingCode.PERMISSION_INSUFFICIENT_NSG with scope info

dns_analyzer.py:
- VNet DNS configuration retrieval
- Shows warning when permissions insufficient
- Already had findings list

All analyzers now:
✅ Use execute_with_permission_check() for Azure operations
✅ Track permission errors in findings list
✅ Continue analysis with partial data
✅ Log debug messages for skipped operations
✅ Provide actionable remediation in findings

Next steps:
- Update report_generator.py with 'Analysis Incomplete' indicator
- Collect all permission findings in main script
- Test with different RBAC permission levels
Quality fixes:
- Removed trailing whitespace in azure_cli.py
- Fixed f-string without interpolation (4 instances)
- Applied Black formatting to outbound_analyzer.py

Test fixes:
- Updated test mocks to use execute_with_permission_check()
- Fixed test_cluster_data_collector.py (3 tests)
- Fixed test_nsg_analyzer.py (2 tests)

Results:
✅ All 139/139 tests passing
✅ Pylint 9.98/10 (above 9.5 target)
✅ Flake8 clean
✅ Black formatting clean
✅ Ready for integration testing
Issue:
- NSG findings were being created twice:
  1. By nsg_analyzer.py (authoritative source)
  2. By misconfiguration_analyzer.py (duplicate)
- This resulted in duplicate warnings with slightly different wording

Examples of duplicates:
- 'has rules that may block inter-node communication' (nsg_analyzer)
  vs 'on subnet has X rule(s) that may block...' (misconfiguration)
- 'may block AKS traffic but is overridden' (nsg_analyzer)
  vs 'could block AKS traffic, but higher-priority...' (misconfiguration)

Fix:
- Removed _analyze_nsg_issues() method from misconfiguration_analyzer.py
- NSG analyzer is now the single source of truth for NSG findings
- Added comment explaining why NSG analysis is skipped in misconfiguration

Impact:
- Each finding now appears exactly once in output
- Cleaner, more professional diagnostic reports
- No functional changes to NSG analysis logic

Tested with aks-overlay cluster - confirmed duplicates eliminated.
All 139 tests passing, Pylint 9.98/10.
Issue:
- 'Analyzing NSG configuration...' logged twice during execution:
  1. In main script's analyze_nsg_configuration() method
  2. In NSG analyzer's analyze() method

Fix:
- Removed log message from main script
- NSG analyzer logs the message (single source)
- Main script just orchestrates without duplicate logging

Tested with aks-overlay - confirmed message appears only once.
Critical bug fix for permission handling:

Issue:
- execute_with_permission_check() only caught subprocess.CalledProcessError
- execute() method converts subprocess errors to AzureCLIError
- Result: Permission errors were NOT handled gracefully, tool crashed

Fix:
- Changed execute_with_permission_check() to catch AzureCLIError
- Extract stderr from exception attributes
- Properly detect authorization errors and return None
- Track permission errors for reporting

Impact:
- Tool now gracefully degrades with limited permissions
- Logs warnings for missing permissions
- Continues analysis with partial data
- No more crashes on AuthorizationFailed errors

Tested with:
- Service principal with limited permissions (Contributor on cluster only)
- Successfully handled 5+ permission errors gracefully
- Completed analysis with partial diagnostics

This was a CRITICAL bug that prevented the entire permission handling
infrastructure from working correctly.
Enhancement to permission handling:

Before:
- Permission warnings only logged to console
- Not shown in findings summary or detailed report
- Users had to scan logs to find permission issues

After:
- Permission findings collected from all analyzers
- Displayed in findings summary
- Detailed view shows:
  * Specific permission required (e.g., Microsoft.Network/virtualNetworks/read)
  * Example Azure CLI command to grant permission
  * Clear finding codes (PERMISSION_INSUFFICIENT_VNET, etc.)

Implementation:
- Added collect_permission_findings() method to main script
- Stores analyzer instances to access their findings
- Collects findings from:
  * cluster_data_collector
  * outbound_analyzer
  * misconfiguration_analyzer
  * NSG and DNS analyzers (already collected)
- Calls before generate_report() to ensure all findings included

Tested with limited permissions (aks-dns-ex1):
- Shows 3 permission findings in summary
- Detailed view provides clear remediation steps
- No duplicate findings

This completes the permission handling user experience.
- Modified azure_cli.py execute() to detect permission errors
  and suppress ERROR logging (only show WARNING from
  execute_with_permission_check)
- Updated route_table_analyzer.py _get_subnet_details() to use
  execute_with_permission_check() for cleaner permission handling
- Changed exception logging from INFO to DEBUG level
- Updated test mocks to use execute_with_permission_check()

This reduces log noise from 3+ lines per permission error down to
1 clean WARNING line with required permission info.

Before:
  2025-10-27 14:58:16 - INFO - Error getting subnet details...
  ERROR: (AuthorizationFailed)...
  Code: AuthorizationFailed...
  2025-10-27 14:58:16 - ERROR - Load Balancer outbound type...

After:
  2025-10-27 15:03:39 - WARNING - Insufficient permissions to retrieve subnet 'aks-subnet' details. Required: Microsoft.Network/virtualNetworks/read
- Modified misconfiguration_analyzer to accept outbound_analyzer parameter
- Check for PERMISSION_INSUFFICIENT_LB finding before creating NO_OUTBOUND_IPS
- Updated report_generator to display helpful message when outbound config
  cannot be retrieved due to insufficient permissions
- Improved has_outbound_data check to avoid false positives with empty dicts

This prevents the tool from incorrectly reporting 'No outbound IP addresses
detected' when the real issue is insufficient permissions to retrieve the
load balancer configuration.

Before (with limited permissions):
  **Outbound Configuration:**

  **Findings Summary:**
  - [WARNING] No outbound IP addresses detected for loadBalancer outbound type

After (with limited permissions):
  **Outbound Configuration:**
  - Unable to retrieve loadBalancer configuration due to insufficient permissions
  - See permission findings below for details

  **Findings Summary:**
  - [WARNING] Insufficient permissions to retrieve Load Balancer configuration...
…essage

- Added permission check to _print_outbound_connectivity() method
- Now displays outbound section in --details mode even when permissions lacking
- Shows clear message: 'Unable to retrieve loadBalancer configuration...'
- Consistent behavior between summary and detailed reports

This ensures the detailed report (--details flag) also shows the outbound
configuration section with appropriate messaging when permissions are insufficient,
matching the behavior of the summary report.
- Auto-formatted with Black to maintain code style consistency
- No functional changes
- Pylint score: 9.98/10
- All 139 tests passing
- Updated cluster_data_collector.py to use execute_with_permission_check() for VMSS details
- Updated connectivity_tester.py to use execute_with_permission_check() for VMSS list and instances
- Added warning message when VMSS instances unavailable for connectivity testing
- Tool now skips probe tests gracefully instead of crashing with permission errors
- Updated test mocks to match new execute_with_permission_check() calls
- All 139 tests passing
- Add PERMISSION_INSUFFICIENT_PROBE_TEST finding code
- Detect when probe tests are skipped due to lack of permissions
- Display permission finding in both summary and detailed reports
- Distinguish between empty VMSS list (no instances) vs None (permission error)
- Update tests to handle new permission detection logic
…ndation

- Detect AuthorizationFailed errors in connectivity test failures
- Provide specific permission recommendation: Microsoft.Compute/virtualMachineScaleSets/virtualMachines/runCommand/action
- Include az role assignment command example with Virtual Machine Contributor role
- Update finding message to clearly state 'due to insufficient permissions'
- Distinguish between permission errors and other test failures
- Add Connectivity Tests section to summary report showing total/passed/failed/errors
- Add Connectivity Tests section to detailed report with consistent formatting
- Improve status labels: 'Tests Failed' vs 'Could Not Execute' for clarity
- Show 'Skipped due to insufficient permissions' when applicable
- Consistent formatting between summary and detailed reports
- Add permission handling feature with graceful degradation
- Add connectivity test summary to reports
- Update documentation with permission requirements
- Add data collection flow diagram showing permission handling
- Document Azure CLI v2.78.0 bug and workaround

BREAKING: None
FEATURES:
- Permission-aware execution with 4 new finding codes
- Connectivity test results in report summary
- Graceful continuation with partial data when permissions insufficient

FIXES:
- Probe test crash with limited permissions

DOCUMENTATION:
- Enhanced ARCHITECTURE.md with permission handling section
- Added permission requirements to security considerations
- Visual data collection flow diagram with error paths
- Updated CHANGELOG.md with v1.2.0 release notes
@sturrent
Copy link
Copy Markdown
Owner Author

✅ CI Fix Applied

Fixed Flake8 violation (F541: f-string without placeholders) in connectivity_tester.py.

Issue: Line 152 had an f-string without any placeholders
Fix: Removed f prefix from string literals (commit 297768b)

Status:

  • ✅ Black: Passing
  • ✅ isort: Passing
  • ✅ Flake8: Passing (was failing)
  • ✅ Pytest: 139/139 tests passing

CI should now pass. The workflow will re-run automatically with the new commit.

@sturrent sturrent merged commit 3462f1b into main Oct 28, 2025
2 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant