{AKS} Add provisioningState retry logic to AKS test base class by deveshdama · Pull Request #33130 · Azure/azure-cli

deveshdama · 2026-04-03T09:04:55Z

Add opt-in retry logic to the AKS CLI tests for handling provisioningState race conditions caused by external modifications (e.g., Azure Policy) during live tests.

Related command
az aks update

Description

Problem:

provisioningState race condition caused by Azure Policy seen in AKS CLI Runners that use azure cli test sdk. Live tests in Microsoft-managed subscriptions intermittently fail with:

Query 'provisioningState' doesn't yield expected value 'Succeeded', instead the actual value is 'Updating'

Time	Actor	Action	provisioningState
20:46:01	Azure CLI	PUT managed cluster (cli test)	-
20:56:52	AKS	PUT completes successfully	Succeeded
20:57:08	Azure Policy	PUT managed cluster (azure policy compliance)	Updating
20:57:14	Azure CLI SDK	GET managed cluster (test assertion)	Updating (cli test failure)

Proposed Fix in the PR:

Overrides cmd() in AzureKubernetesServiceScenarioTest to poll via az resource show --ids <id> with exponential backoff when provisioningState is non-terminal. Scoped entirely within the ACS module.

Opt-in via AZURE_CLI_TEST_RETRY_PROVISIONING_CHECK=true. Only activates in live mode (self.is_live), so playback/recorded tests are unaffected. Extra poll requests recorded in cassettes during live runs are silently ignored on replay.

Testing Guide
Unit tests in test_aks_provisioning_retry.py. To enable retry path:

AZURE_CLI_TEST_RETRY_PROVISIONING_CHECK=true \
azdev test test_aks_create_default_service_with_monitoring_addon_msi --live

This checklist is used to make sure that common guidelines for a pull request are followed.

The PR title and description has followed the guideline in Submitting Pull Requests.
I adhere to the Command Guidelines.
I adhere to the Error Handling Guidelines.

azure-client-tools-bot-prd · 2026-04-03T09:04:58Z

️✔️AzureCLI-FullTest

️✔️acr

️✔️latest

️✔️3.12

️✔️3.13

️✔️acs

️✔️latest

️✔️3.12

️✔️3.13

️✔️advisor

️✔️latest

️✔️3.12

️✔️3.13

️✔️ams

️✔️latest

️✔️3.12

️✔️3.13

️✔️apim

️✔️latest

️✔️3.12

️✔️3.13

️✔️appconfig

️✔️latest

️✔️3.12

️✔️3.13

️✔️appservice

️✔️latest

️✔️3.12

️✔️3.13

️✔️aro

️✔️latest

️✔️3.12

️✔️3.13

️✔️backup

️✔️latest

️✔️3.12

️✔️3.13

️✔️batch

️✔️latest

️✔️3.12

️✔️3.13

️✔️batchai

️✔️latest

️✔️3.12

️✔️3.13

️✔️billing

️✔️latest

️✔️3.12

️✔️3.13

️✔️botservice

️✔️latest

️✔️3.12

️✔️3.13

️✔️cdn

️✔️latest

️✔️3.12

️✔️3.13

️✔️cloud

️✔️latest

️✔️3.12

️✔️3.13

️✔️cognitiveservices

️✔️latest

️✔️3.12

️✔️3.13

️✔️compute_recommender

️✔️latest

️✔️3.12

️✔️3.13

️✔️computefleet

️✔️latest

️✔️3.12

️✔️3.13

️✔️config

️✔️latest

️✔️3.12

️✔️3.13

️✔️configure

️✔️latest

️✔️3.12

️✔️3.13

️✔️consumption

️✔️latest

️✔️3.12

️✔️3.13

️✔️container

️✔️latest

️✔️3.12

️✔️3.13

️✔️containerapp

️✔️latest

️✔️3.12

️✔️3.13

️✔️core

️✔️latest

️✔️3.12

️✔️3.13

️✔️cosmosdb

️✔️latest

️✔️3.12

️✔️3.13

️✔️databoxedge

️✔️latest

️✔️3.12

️✔️3.13

️✔️dls

️✔️latest

️✔️3.12

️✔️3.13

️✔️dms

️✔️latest

️✔️3.12

️✔️3.13

️✔️eventgrid

️✔️latest

️✔️3.12

️✔️3.13

️✔️eventhubs

️✔️latest

️✔️3.12

️✔️3.13

️✔️feedback

️✔️latest

️✔️3.12

️✔️3.13

️✔️find

️✔️latest

️✔️3.12

️✔️3.13

️✔️hdinsight

️✔️latest

️✔️3.12

️✔️3.13

️✔️identity

️✔️latest

️✔️3.12

️✔️3.13

️✔️iot

️✔️latest

️✔️3.12

️✔️3.13

️✔️keyvault

️✔️latest

️✔️3.12

️✔️3.13

️✔️lab

️✔️latest

️✔️3.12

️✔️3.13

️✔️managedservices

️✔️latest

️✔️3.12

️✔️3.13

️✔️maps

️✔️latest

️✔️3.12

️✔️3.13

️✔️marketplaceordering

️✔️latest

️✔️3.12

️✔️3.13

️✔️monitor

️✔️latest

️✔️3.12

️✔️3.13

️✔️mysql

️✔️latest

️✔️3.12

️✔️3.13

️✔️netappfiles

️✔️latest

️✔️3.12

️✔️3.13

️✔️network

️✔️latest

️✔️3.12

️✔️3.13

️✔️policyinsights

️✔️latest

️✔️3.12

️✔️3.13

️✔️postgresql

️✔️latest

️✔️3.12

️✔️3.13

️✔️privatedns

️✔️latest

️✔️3.12

️✔️3.13

️✔️profile

️✔️latest

️✔️3.12

️✔️3.13

️✔️rdbms

️✔️latest

️✔️3.12

️✔️3.13

️✔️redis

️✔️latest

️✔️3.12

️✔️3.13

️✔️relay

️✔️latest

️✔️3.12

️✔️3.13

️✔️resource

️✔️latest

️✔️3.12

️✔️3.13

️✔️role

️✔️latest

️✔️3.12

️✔️3.13

️✔️search

️✔️latest

️✔️3.12

️✔️3.13

️✔️security

️✔️latest

️✔️3.12

️✔️3.13

️✔️servicebus

️✔️latest

️✔️3.12

️✔️3.13

️✔️serviceconnector

️✔️latest

️✔️3.12

️✔️3.13

️✔️servicefabric

️✔️latest

️✔️3.12

️✔️3.13

️✔️signalr

️✔️latest

️✔️3.12

️✔️3.13

️✔️sql

️✔️latest

️✔️3.12

️✔️3.13

️✔️sqlvm

️✔️latest

️✔️3.12

️✔️3.13

️✔️storage

️✔️latest

️✔️3.12

️✔️3.13

️✔️synapse

️✔️latest

️✔️3.12

️✔️3.13

️✔️telemetry

️✔️latest

️✔️3.12

️✔️3.13

️✔️util

️✔️latest

️✔️3.12

️✔️3.13

️✔️vm

️✔️latest

️✔️3.12

️✔️3.13

azure-client-tools-bot-prd · 2026-04-03T09:05:01Z

️✔️AzureCLI-BreakingChangeTest

️✔️Non Breaking Changes

yonzhan · 2026-04-03T09:05:09Z

Thank you for your contribution! We will review the pull request and get back to you soon.

github-actions · 2026-04-03T09:05:13Z

The git hooks are available for azure-cli and azure-cli-extensions repos. They could help you run required checks before creating the PR.

Please sync the latest code with latest dev branch (for azure-cli) or main branch (for azure-cli-extensions).
After that please run the following commands to enable git hooks:

pip install azdev --upgrade
azdev setup -c <your azure-cli repo path> -r <your azure-cli-extensions repo path>

Copilot

Pull request overview

This PR adds opt-in retry logic to AKS live scenario tests to reduce intermittent failures caused by post-operation external updates (e.g., Azure Policy) that temporarily flip provisioningState away from Succeeded.

Changes:

Override AzureKubernetesServiceScenarioTest.cmd() to detect provisioningState == Succeeded assertions and, when enabled, poll the resource with exponential backoff until provisioningState becomes terminal.
Add helper methods to identify provisioning-state checks and decide whether polling should occur.
Add unit tests covering _should_retry_for_provisioning_state and the retry/polling behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
`src/azure-cli/azure/cli/command_modules/acs/tests/latest/test_aks_commands.py`	Adds the `cmd()` override and `_cmd_with_retry()` polling logic for provisioningState checks in live tests.
`src/azure-cli/azure/cli/command_modules/acs/tests/latest/test_aks_provisioning_retry.py`	Adds unit tests for the provisioningState retry decision and retry loop behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/azure-cli/azure/cli/command_modules/acs/tests/latest/test_aks_commands.py

src/azure-cli/azure/cli/command_modules/acs/tests/latest/test_aks_provisioning_retry.py

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T07:42:04Z

src/azure-cli/azure/cli/command_modules/acs/tests/latest/test_aks_commands.py

+        if provisioning_checks:
+            should_retry, resource_id = self._should_retry_for_provisioning_state(result)
+            if should_retry:
+                initial_data = result.get_output_in_json()
+                initial_etag = initial_data.get('etag')
+                last_seen_etag = initial_etag
+                max_retries = max(1, int(os.environ.get('AZURE_CLI_TEST_PROVISIONING_MAX_RETRIES', '10')))
+                base_delay = float(os.environ.get('AZURE_CLI_TEST_PROVISIONING_BASE_DELAY', '2.0'))
+
+                # Poll with exponential backoff + jitter until terminal state
+                for attempt in range(max_retries):
+                    delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
+                    time.sleep(delay)
+                    poll_result = execute(self.cli_ctx, f'resource show --ids {resource_id}', expect_failure=False)
+                    poll_data = poll_result.get_output_in_json()
+                    current_provisioning_state = poll_data.get('provisioningState')
+                    current_etag = poll_data.get('etag')
+
+                    # Track etag changes to detect external modifications during polling
+                    if current_etag and last_seen_etag and current_etag != last_seen_etag:
+                        logging.warning(f"ETag changed during polling (external modification detected)")
+                    last_seen_etag = current_etag
+
+                    if current_provisioning_state == 'Succeeded':
+                        break
+                    elif current_provisioning_state in {'Failed', 'Canceled'}:
+                        raise AssertionError(
+                            f"provisioningState reached terminal failure: {current_provisioning_state}"
+                        )
+                else:
+                    # for/else: ran all retries without breaking
+                    final_etag_msg = ""
+                    if initial_etag and last_seen_etag:
+                        final_etag_msg = f" (initial etag: {initial_etag}, final: {last_seen_etag})"
+                    raise TimeoutError(
+                        f"provisioningState did not reach 'Succeeded' after {max_retries} retries. "
+                        f"Final state: {current_provisioning_state}{final_etag_msg}"
+                    )
+
+            # Provisioning checks already verified via polling, skip re-checking stale result
+
+        # Run all non-provisioning checks against the original result
+        if other_checks:
+            result.assert_with_checks(other_checks)
+
+        return result


_cmd_with_retry never executes provisioning_checks when _should_retry_for_provisioning_state returns False (e.g., initial provisioningState == 'Succeeded', or when the command output lacks id / provisioningState). This changes ScenarioTest.cmd() semantics: a JMESPathCheck('provisioningState','Succeeded') can be silently skipped and the test may pass even though the check would have failed.

To preserve behavior, ensure provisioning checks are still asserted when no retry is performed (e.g., run result.assert_with_checks(provisioning_checks) when should_retry is False; when polling occurs, run provisioning checks against the last poll_result that reached Succeeded). Consider adding a unit test for the “missing provisioningState/id” case to prevent regressions.

Please take a look at this comment.

FumingZhang · 2026-04-07T23:51:21Z

Besides, I think the concern I raised in the original PR is still valid, in the following scenario, the test would fail in replay mode.

======================================================================
PHASE 1: LIVE RECORDING (simulating retry scenario)

Initial command returns provisioningState='Updating'
Retry polls until provisioningState='Succeeded'
======================================================================
[1] GET /managedClusters/myAKS -> provisioningState=Updating
[2] GET /resource-show-poll?attempt=1 -> provisioningState=Updating
[3] GET /resource-show-poll?attempt=2 -> provisioningState=Succeeded

Cassette has 3 recorded interactions.
First interaction response body: provisioningState=Updating

======================================================================
PHASE 2: REPLAY (simulating super().cmd() with check)

Retry is SKIPPED (is_live=False)
VCR replays the INITIAL 'Updating' response
Test asserts provisioningState == 'Succeeded'
======================================================================
[1] GET /managedClusters/myAKS -> provisioningState=Updating

Consumed: 1/3 interactions
Unconsumed: 2

Simulating JMESPathCheck('provisioningState', 'Succeeded')...
Actual value from replayed response: 'Updating'

CHECK FAILED: Expected 'Succeeded', got 'Updating'
This is EXACTLY the failure that would occur in replay mode!
The JMESPathCheck assertion would raise:
AssertionError: Query 'provisioningState' doesn't yield expected
value 'Succeeded', instead the actual value is 'Updating'

======================================================================
CONCLUSION:
UNCONSUMED cassette entries don't cause VCR errors.

BUT the real problem is different: the CONSUMED entry (the initial
command response) has provisioningState='Updating'. In replay mode,
super().cmd() gets this 'Updating' response and runs
JMESPathCheck('provisioningState', 'Succeeded') against it.
That check FAILS — not a VCR error, but an assertion failure.

This only happens when the race condition occurred during recording.
If the initial response was 'Succeeded' (no race), replay works fine.

Add provisioningState retry logic to AKS test base class

9c2ad0e

deveshdama requested review from FumingZhang, NoriZC, teresaritorto and yanzhudd as code owners April 3, 2026 09:04

Copilot AI review requested due to automatic review settings April 3, 2026 09:04

azure-client-tools-bot-prd bot added this to the May 2026 (2026-05-05) milestone Apr 3, 2026

microsoft-github-policy-service bot requested review from yonzhan and zhoxing-ms April 3, 2026 09:05

microsoft-github-policy-service bot added the Auto-Assign Auto assign by bot label Apr 3, 2026

microsoft-github-policy-service bot assigned zhoxing-ms Apr 3, 2026

microsoft-github-policy-service bot added ARM az resource/group/lock/tag/deployment/policy/managementapp/account management-group act-identity-squad labels Apr 3, 2026

microsoft-github-policy-service bot assigned yanzhudd Apr 3, 2026

microsoft-github-policy-service bot added AKS az aks/acs/openshift act-observability-squad labels Apr 3, 2026

Copilot started reviewing on behalf of deveshdama April 3, 2026 09:06 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

yonzhan unassigned zhoxing-ms Apr 3, 2026

isra-fel removed the act-identity-squad label Apr 7, 2026

updates as per review comments

648f4db

deveshdama requested a review from Copilot April 7, 2026 07:36

Copilot started reviewing on behalf of deveshdama April 7, 2026 07:38 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{AKS} Add provisioningState retry logic to AKS test base class#33130

{AKS} Add provisioningState retry logic to AKS test base class#33130
deveshdama wants to merge 2 commits intoAzure:devfrom
deveshdama:aks-provisioningstate-retry-v2

deveshdama commented Apr 3, 2026

Uh oh!

azure-client-tools-bot-prd bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

azure-client-tools-bot-prd bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

yonzhan commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

FumingZhang Apr 7, 2026

Uh oh!

FumingZhang commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

deveshdama commented Apr 3, 2026

Problem:

Proposed Fix in the PR:

Uh oh!

azure-client-tools-bot-prd bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azure-client-tools-bot-prd bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yonzhan commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

FumingZhang Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

FumingZhang commented Apr 7, 2026

This only happens when the race condition occurred during recording. If the initial response was 'Succeeded' (no race), replay works fine.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

azure-client-tools-bot-prd bot commented Apr 3, 2026 •

edited

Loading

azure-client-tools-bot-prd bot commented Apr 3, 2026 •

edited

Loading

This only happens when the race condition occurred during recording.
If the initial response was 'Succeeded' (no race), replay works fine.