Skip to content

{AKS} Add provisioningState retry logic to AKS test base class#33130

Open
deveshdama wants to merge 2 commits intoAzure:devfrom
deveshdama:aks-provisioningstate-retry-v2
Open

{AKS} Add provisioningState retry logic to AKS test base class#33130
deveshdama wants to merge 2 commits intoAzure:devfrom
deveshdama:aks-provisioningstate-retry-v2

Conversation

@deveshdama
Copy link
Copy Markdown
Contributor

Add opt-in retry logic to the AKS CLI tests for handling provisioningState race conditions caused by external modifications (e.g., Azure Policy) during live tests.

Related command
az aks update

Description

Problem:

provisioningState race condition caused by Azure Policy seen in AKS CLI Runners that use azure cli test sdk. Live tests in Microsoft-managed subscriptions intermittently fail with:

Query 'provisioningState' doesn't yield expected value 'Succeeded', instead the actual value is 'Updating'

Time Actor Action provisioningState
20:46:01 Azure CLI PUT managed cluster (cli test) -
20:56:52 AKS PUT completes successfully Succeeded
20:57:08 Azure Policy PUT managed cluster (azure policy compliance) Updating
20:57:14 Azure CLI SDK GET managed cluster (test assertion) Updating (cli test failure)

Proposed Fix in the PR:

Overrides cmd() in AzureKubernetesServiceScenarioTest to poll via az resource show --ids <id> with exponential backoff when provisioningState is non-terminal. Scoped entirely within the ACS module.

Opt-in via AZURE_CLI_TEST_RETRY_PROVISIONING_CHECK=true. Only activates in live mode (self.is_live), so playback/recorded tests are unaffected. Extra poll requests recorded in cassettes during live runs are silently ignored on replay.

Testing Guide
Unit tests in test_aks_provisioning_retry.py. To enable retry path:

AZURE_CLI_TEST_RETRY_PROVISIONING_CHECK=true \
azdev test test_aks_create_default_service_with_monitoring_addon_msi --live

This checklist is used to make sure that common guidelines for a pull request are followed.

Copilot AI review requested due to automatic review settings April 3, 2026 09:04
@azure-client-tools-bot-prd
Copy link
Copy Markdown

azure-client-tools-bot-prd bot commented Apr 3, 2026

️✔️AzureCLI-FullTest
️✔️acr
️✔️latest
️✔️3.12
️✔️3.13
️✔️acs
️✔️latest
️✔️3.12
️✔️3.13
️✔️advisor
️✔️latest
️✔️3.12
️✔️3.13
️✔️ams
️✔️latest
️✔️3.12
️✔️3.13
️✔️apim
️✔️latest
️✔️3.12
️✔️3.13
️✔️appconfig
️✔️latest
️✔️3.12
️✔️3.13
️✔️appservice
️✔️latest
️✔️3.12
️✔️3.13
️✔️aro
️✔️latest
️✔️3.12
️✔️3.13
️✔️backup
️✔️latest
️✔️3.12
️✔️3.13
️✔️batch
️✔️latest
️✔️3.12
️✔️3.13
️✔️batchai
️✔️latest
️✔️3.12
️✔️3.13
️✔️billing
️✔️latest
️✔️3.12
️✔️3.13
️✔️botservice
️✔️latest
️✔️3.12
️✔️3.13
️✔️cdn
️✔️latest
️✔️3.12
️✔️3.13
️✔️cloud
️✔️latest
️✔️3.12
️✔️3.13
️✔️cognitiveservices
️✔️latest
️✔️3.12
️✔️3.13
️✔️compute_recommender
️✔️latest
️✔️3.12
️✔️3.13
️✔️computefleet
️✔️latest
️✔️3.12
️✔️3.13
️✔️config
️✔️latest
️✔️3.12
️✔️3.13
️✔️configure
️✔️latest
️✔️3.12
️✔️3.13
️✔️consumption
️✔️latest
️✔️3.12
️✔️3.13
️✔️container
️✔️latest
️✔️3.12
️✔️3.13
️✔️containerapp
️✔️latest
️✔️3.12
️✔️3.13
️✔️core
️✔️latest
️✔️3.12
️✔️3.13
️✔️cosmosdb
️✔️latest
️✔️3.12
️✔️3.13
️✔️databoxedge
️✔️latest
️✔️3.12
️✔️3.13
️✔️dls
️✔️latest
️✔️3.12
️✔️3.13
️✔️dms
️✔️latest
️✔️3.12
️✔️3.13
️✔️eventgrid
️✔️latest
️✔️3.12
️✔️3.13
️✔️eventhubs
️✔️latest
️✔️3.12
️✔️3.13
️✔️feedback
️✔️latest
️✔️3.12
️✔️3.13
️✔️find
️✔️latest
️✔️3.12
️✔️3.13
️✔️hdinsight
️✔️latest
️✔️3.12
️✔️3.13
️✔️identity
️✔️latest
️✔️3.12
️✔️3.13
️✔️iot
️✔️latest
️✔️3.12
️✔️3.13
️✔️keyvault
️✔️latest
️✔️3.12
️✔️3.13
️✔️lab
️✔️latest
️✔️3.12
️✔️3.13
️✔️managedservices
️✔️latest
️✔️3.12
️✔️3.13
️✔️maps
️✔️latest
️✔️3.12
️✔️3.13
️✔️marketplaceordering
️✔️latest
️✔️3.12
️✔️3.13
️✔️monitor
️✔️latest
️✔️3.12
️✔️3.13
️✔️mysql
️✔️latest
️✔️3.12
️✔️3.13
️✔️netappfiles
️✔️latest
️✔️3.12
️✔️3.13
️✔️network
️✔️latest
️✔️3.12
️✔️3.13
️✔️policyinsights
️✔️latest
️✔️3.12
️✔️3.13
️✔️postgresql
️✔️latest
️✔️3.12
️✔️3.13
️✔️privatedns
️✔️latest
️✔️3.12
️✔️3.13
️✔️profile
️✔️latest
️✔️3.12
️✔️3.13
️✔️rdbms
️✔️latest
️✔️3.12
️✔️3.13
️✔️redis
️✔️latest
️✔️3.12
️✔️3.13
️✔️relay
️✔️latest
️✔️3.12
️✔️3.13
️✔️resource
️✔️latest
️✔️3.12
️✔️3.13
️✔️role
️✔️latest
️✔️3.12
️✔️3.13
️✔️search
️✔️latest
️✔️3.12
️✔️3.13
️✔️security
️✔️latest
️✔️3.12
️✔️3.13
️✔️servicebus
️✔️latest
️✔️3.12
️✔️3.13
️✔️serviceconnector
️✔️latest
️✔️3.12
️✔️3.13
️✔️servicefabric
️✔️latest
️✔️3.12
️✔️3.13
️✔️signalr
️✔️latest
️✔️3.12
️✔️3.13
️✔️sql
️✔️latest
️✔️3.12
️✔️3.13
️✔️sqlvm
️✔️latest
️✔️3.12
️✔️3.13
️✔️storage
️✔️latest
️✔️3.12
️✔️3.13
️✔️synapse
️✔️latest
️✔️3.12
️✔️3.13
️✔️telemetry
️✔️latest
️✔️3.12
️✔️3.13
️✔️util
️✔️latest
️✔️3.12
️✔️3.13
️✔️vm
️✔️latest
️✔️3.12
️✔️3.13

@azure-client-tools-bot-prd
Copy link
Copy Markdown

azure-client-tools-bot-prd bot commented Apr 3, 2026

️✔️AzureCLI-BreakingChangeTest
️✔️Non Breaking Changes

@yonzhan
Copy link
Copy Markdown
Collaborator

yonzhan commented Apr 3, 2026

Thank you for your contribution! We will review the pull request and get back to you soon.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 3, 2026

The git hooks are available for azure-cli and azure-cli-extensions repos. They could help you run required checks before creating the PR.

Please sync the latest code with latest dev branch (for azure-cli) or main branch (for azure-cli-extensions).
After that please run the following commands to enable git hooks:

pip install azdev --upgrade
azdev setup -c <your azure-cli repo path> -r <your azure-cli-extensions repo path>

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds opt-in retry logic to AKS live scenario tests to reduce intermittent failures caused by post-operation external updates (e.g., Azure Policy) that temporarily flip provisioningState away from Succeeded.

Changes:

  • Override AzureKubernetesServiceScenarioTest.cmd() to detect provisioningState == Succeeded assertions and, when enabled, poll the resource with exponential backoff until provisioningState becomes terminal.
  • Add helper methods to identify provisioning-state checks and decide whether polling should occur.
  • Add unit tests covering _should_retry_for_provisioning_state and the retry/polling behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
src/azure-cli/azure/cli/command_modules/acs/tests/latest/test_aks_commands.py Adds the cmd() override and _cmd_with_retry() polling logic for provisioningState checks in live tests.
src/azure-cli/azure/cli/command_modules/acs/tests/latest/test_aks_provisioning_retry.py Adds unit tests for the provisioningState retry decision and retry loop behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +95 to +140
if provisioning_checks:
should_retry, resource_id = self._should_retry_for_provisioning_state(result)
if should_retry:
initial_data = result.get_output_in_json()
initial_etag = initial_data.get('etag')
last_seen_etag = initial_etag
max_retries = max(1, int(os.environ.get('AZURE_CLI_TEST_PROVISIONING_MAX_RETRIES', '10')))
base_delay = float(os.environ.get('AZURE_CLI_TEST_PROVISIONING_BASE_DELAY', '2.0'))

# Poll with exponential backoff + jitter until terminal state
for attempt in range(max_retries):
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
time.sleep(delay)
poll_result = execute(self.cli_ctx, f'resource show --ids {resource_id}', expect_failure=False)
poll_data = poll_result.get_output_in_json()
current_provisioning_state = poll_data.get('provisioningState')
current_etag = poll_data.get('etag')

# Track etag changes to detect external modifications during polling
if current_etag and last_seen_etag and current_etag != last_seen_etag:
logging.warning(f"ETag changed during polling (external modification detected)")
last_seen_etag = current_etag

if current_provisioning_state == 'Succeeded':
break
elif current_provisioning_state in {'Failed', 'Canceled'}:
raise AssertionError(
f"provisioningState reached terminal failure: {current_provisioning_state}"
)
else:
# for/else: ran all retries without breaking
final_etag_msg = ""
if initial_etag and last_seen_etag:
final_etag_msg = f" (initial etag: {initial_etag}, final: {last_seen_etag})"
raise TimeoutError(
f"provisioningState did not reach 'Succeeded' after {max_retries} retries. "
f"Final state: {current_provisioning_state}{final_etag_msg}"
)

# Provisioning checks already verified via polling, skip re-checking stale result

# Run all non-provisioning checks against the original result
if other_checks:
result.assert_with_checks(other_checks)

return result
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_cmd_with_retry never executes provisioning_checks when _should_retry_for_provisioning_state returns False (e.g., initial provisioningState == 'Succeeded', or when the command output lacks id / provisioningState). This changes ScenarioTest.cmd() semantics: a JMESPathCheck('provisioningState','Succeeded') can be silently skipped and the test may pass even though the check would have failed.

To preserve behavior, ensure provisioning checks are still asserted when no retry is performed (e.g., run result.assert_with_checks(provisioning_checks) when should_retry is False; when polling occurs, run provisioning checks against the last poll_result that reached Succeeded). Consider adding a unit test for the “missing provisioningState/id” case to prevent regressions.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at this comment.

@FumingZhang
Copy link
Copy Markdown
Member

Besides, I think the concern I raised in the original PR is still valid, in the following scenario, the test would fail in replay mode.

======================================================================
PHASE 1: LIVE RECORDING (simulating retry scenario)

  • Initial command returns provisioningState='Updating'
  • Retry polls until provisioningState='Succeeded'
    ======================================================================
    [1] GET /managedClusters/myAKS -> provisioningState=Updating
    [2] GET /resource-show-poll?attempt=1 -> provisioningState=Updating
    [3] GET /resource-show-poll?attempt=2 -> provisioningState=Succeeded

Cassette has 3 recorded interactions.
First interaction response body: provisioningState=Updating

======================================================================
PHASE 2: REPLAY (simulating super().cmd() with check)

  • Retry is SKIPPED (is_live=False)
  • VCR replays the INITIAL 'Updating' response
  • Test asserts provisioningState == 'Succeeded'
    ======================================================================
    [1] GET /managedClusters/myAKS -> provisioningState=Updating

Consumed: 1/3 interactions
Unconsumed: 2

Simulating JMESPathCheck('provisioningState', 'Succeeded')...
Actual value from replayed response: 'Updating'

CHECK FAILED: Expected 'Succeeded', got 'Updating'
This is EXACTLY the failure that would occur in replay mode!
The JMESPathCheck assertion would raise:
AssertionError: Query 'provisioningState' doesn't yield expected
value 'Succeeded', instead the actual value is 'Updating'

======================================================================
CONCLUSION:
UNCONSUMED cassette entries don't cause VCR errors.

BUT the real problem is different: the CONSUMED entry (the initial
command response) has provisioningState='Updating'. In replay mode,
super().cmd() gets this 'Updating' response and runs
JMESPathCheck('provisioningState', 'Succeeded') against it.
That check FAILS — not a VCR error, but an assertion failure.

This only happens when the race condition occurred during recording.
If the initial response was 'Succeeded' (no race), replay works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

act-observability-squad AKS az aks/acs/openshift ARM az resource/group/lock/tag/deployment/policy/managementapp/account management-group Auto-Assign Auto assign by bot

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants