{AKS} Add provisioningState retry logic to AKS test base class#33130
{AKS} Add provisioningState retry logic to AKS test base class#33130deveshdama wants to merge 2 commits intoAzure:devfrom
Conversation
️✔️AzureCLI-FullTest
|
️✔️AzureCLI-BreakingChangeTest
|
|
Thank you for your contribution! We will review the pull request and get back to you soon. |
|
The git hooks are available for azure-cli and azure-cli-extensions repos. They could help you run required checks before creating the PR. Please sync the latest code with latest dev branch (for azure-cli) or main branch (for azure-cli-extensions). pip install azdev --upgrade
azdev setup -c <your azure-cli repo path> -r <your azure-cli-extensions repo path>
|
There was a problem hiding this comment.
Pull request overview
This PR adds opt-in retry logic to AKS live scenario tests to reduce intermittent failures caused by post-operation external updates (e.g., Azure Policy) that temporarily flip provisioningState away from Succeeded.
Changes:
- Override
AzureKubernetesServiceScenarioTest.cmd()to detectprovisioningState == Succeededassertions and, when enabled, poll the resource with exponential backoff untilprovisioningStatebecomes terminal. - Add helper methods to identify provisioning-state checks and decide whether polling should occur.
- Add unit tests covering
_should_retry_for_provisioning_stateand the retry/polling behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
src/azure-cli/azure/cli/command_modules/acs/tests/latest/test_aks_commands.py |
Adds the cmd() override and _cmd_with_retry() polling logic for provisioningState checks in live tests. |
src/azure-cli/azure/cli/command_modules/acs/tests/latest/test_aks_provisioning_retry.py |
Adds unit tests for the provisioningState retry decision and retry loop behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/azure-cli/azure/cli/command_modules/acs/tests/latest/test_aks_commands.py
Show resolved
Hide resolved
src/azure-cli/azure/cli/command_modules/acs/tests/latest/test_aks_commands.py
Outdated
Show resolved
Hide resolved
src/azure-cli/azure/cli/command_modules/acs/tests/latest/test_aks_commands.py
Outdated
Show resolved
Hide resolved
src/azure-cli/azure/cli/command_modules/acs/tests/latest/test_aks_commands.py
Outdated
Show resolved
Hide resolved
src/azure-cli/azure/cli/command_modules/acs/tests/latest/test_aks_provisioning_retry.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if provisioning_checks: | ||
| should_retry, resource_id = self._should_retry_for_provisioning_state(result) | ||
| if should_retry: | ||
| initial_data = result.get_output_in_json() | ||
| initial_etag = initial_data.get('etag') | ||
| last_seen_etag = initial_etag | ||
| max_retries = max(1, int(os.environ.get('AZURE_CLI_TEST_PROVISIONING_MAX_RETRIES', '10'))) | ||
| base_delay = float(os.environ.get('AZURE_CLI_TEST_PROVISIONING_BASE_DELAY', '2.0')) | ||
|
|
||
| # Poll with exponential backoff + jitter until terminal state | ||
| for attempt in range(max_retries): | ||
| delay = base_delay * (2 ** attempt) + random.uniform(0, 1) | ||
| time.sleep(delay) | ||
| poll_result = execute(self.cli_ctx, f'resource show --ids {resource_id}', expect_failure=False) | ||
| poll_data = poll_result.get_output_in_json() | ||
| current_provisioning_state = poll_data.get('provisioningState') | ||
| current_etag = poll_data.get('etag') | ||
|
|
||
| # Track etag changes to detect external modifications during polling | ||
| if current_etag and last_seen_etag and current_etag != last_seen_etag: | ||
| logging.warning(f"ETag changed during polling (external modification detected)") | ||
| last_seen_etag = current_etag | ||
|
|
||
| if current_provisioning_state == 'Succeeded': | ||
| break | ||
| elif current_provisioning_state in {'Failed', 'Canceled'}: | ||
| raise AssertionError( | ||
| f"provisioningState reached terminal failure: {current_provisioning_state}" | ||
| ) | ||
| else: | ||
| # for/else: ran all retries without breaking | ||
| final_etag_msg = "" | ||
| if initial_etag and last_seen_etag: | ||
| final_etag_msg = f" (initial etag: {initial_etag}, final: {last_seen_etag})" | ||
| raise TimeoutError( | ||
| f"provisioningState did not reach 'Succeeded' after {max_retries} retries. " | ||
| f"Final state: {current_provisioning_state}{final_etag_msg}" | ||
| ) | ||
|
|
||
| # Provisioning checks already verified via polling, skip re-checking stale result | ||
|
|
||
| # Run all non-provisioning checks against the original result | ||
| if other_checks: | ||
| result.assert_with_checks(other_checks) | ||
|
|
||
| return result |
There was a problem hiding this comment.
_cmd_with_retry never executes provisioning_checks when _should_retry_for_provisioning_state returns False (e.g., initial provisioningState == 'Succeeded', or when the command output lacks id / provisioningState). This changes ScenarioTest.cmd() semantics: a JMESPathCheck('provisioningState','Succeeded') can be silently skipped and the test may pass even though the check would have failed.
To preserve behavior, ensure provisioning checks are still asserted when no retry is performed (e.g., run result.assert_with_checks(provisioning_checks) when should_retry is False; when polling occurs, run provisioning checks against the last poll_result that reached Succeeded). Consider adding a unit test for the “missing provisioningState/id” case to prevent regressions.
There was a problem hiding this comment.
Please take a look at this comment.
|
Besides, I think the concern I raised in the original PR is still valid, in the following scenario, the test would fail in replay mode. ======================================================================
Cassette has 3 recorded interactions. ======================================================================
Consumed: 1/3 interactions Simulating JMESPathCheck('provisioningState', 'Succeeded')... CHECK FAILED: Expected 'Succeeded', got 'Updating' ====================================================================== BUT the real problem is different: the CONSUMED entry (the initial This only happens when the race condition occurred during recording.
|
Add opt-in retry logic to the AKS CLI tests for handling provisioningState race conditions caused by external modifications (e.g., Azure Policy) during live tests.
Related command
az aks updateDescription
Problem:
provisioningState race condition caused by Azure Policy seen in AKS CLI Runners that use azure cli test sdk. Live tests in Microsoft-managed subscriptions intermittently fail with:
Query 'provisioningState' doesn't yield expected value 'Succeeded', instead the actual value is 'Updating'Proposed Fix in the PR:
Overrides
cmd()inAzureKubernetesServiceScenarioTestto poll viaaz resource show --ids <id>with exponential backoff when provisioningState is non-terminal. Scoped entirely within the ACS module.Opt-in via
AZURE_CLI_TEST_RETRY_PROVISIONING_CHECK=true. Only activates in live mode (self.is_live), so playback/recorded tests are unaffected. Extra poll requests recorded in cassettes during live runs are silently ignored on replay.Testing Guide
Unit tests in
test_aks_provisioning_retry.py. To enable retry path:This checklist is used to make sure that common guidelines for a pull request are followed.
The PR title and description has followed the guideline in Submitting Pull Requests.
I adhere to the Command Guidelines.
I adhere to the Error Handling Guidelines.