Skip to content

cosmos: add probe-gated endpoint failback integration test#4628

Open
NaluTripician wants to merge 2 commits into
Azure:mainfrom
NaluTripician:nalutripician/cosmos-probe-gated-failback-test
Open

cosmos: add probe-gated endpoint failback integration test#4628
NaluTripician wants to merge 2 commits into
Azure:mainfrom
NaluTripician:nalutripician/cosmos-probe-gated-failback-test

Conversation

@NaluTripician

Copy link
Copy Markdown
Contributor

Summary

Follow-up to #4604 (review thread r3431461745), closing #4622. PR #4604 made Cosmos account-level endpoint failback probe-gated: a marked-unavailable endpoint rejoins the routing rotation only after a background connectivity probe confirms it is reachable (the old time-based auto-clear was removed). That state machine is covered by unit tests using injected fake probe closures, but nothing exercised the real probe path end to end.

This PR adds an in-memory-emulator integration test that drives the driver's real connectivity-probe closure against the emulator and asserts all three behaviors the issue calls out:

  1. A marked regional endpoint stays out of rotation while the region is connection-blocked (the probe keeps failing).
  2. It is failed back only after a successful probe once connectivity is restored.
  3. A failed probe resets the cooldown, so the endpoint is not immediately re-probed / prematurely failed back.

How

  • New test tests/in_memory_emulator_tests/endpoint_probe_failback.rs builds a two-region in-memory emulator, blocks one region with a region-scoped ConnectionError fault (which also blocks the probe's GET /probe), and toggles the fault to simulate outage and recovery.
  • Failback is driven deterministically through one probe-and-failback iteration rather than the 60-second background loop, via new doc-hidden, internal-feature-gated test hooks on CosmosDriver: run_endpoint_probe_once_for_testing, mark_region_endpoint_unavailable_for_testing, and is_endpoint_host_marked_unavailable_for_testing — matching the existing *_for_testing convention (gated on any(test, feature = "__internal_in_memory_emulator")). A short endpoint_unavailability_ttl makes the endpoint probe-eligible quickly.
  • The endpoint is seeded as unavailable via a hook; operation-driven marking (connection error → failover → mark) is already covered by regional_gateway_unreachable.rs, so this test isolates the new failback behavior and exercises the real probe.

Notes

  • No production behavior changes. All hooks compile out unless the test / __internal_in_memory_emulator feature is enabled; the only non-test change is a behavior-identical restructuring of the probe-closure construction so a clone can be retained for the hook.
  • The real (Docker) emulator harness still has no per-endpoint TCP/DNS connectivity-block hook; this scenario is covered against the in-memory emulator. A real-emulator block hook + live test remains a possible future enhancement (noted in the changelog).
  • Verified: cargo fmt, cargo clippy (default and --all-features), and cargo test --all-features for azure_data_cosmos_driver all pass.

Fixes #4622

Adds an in-memory-emulator integration test exercising the real

account-level connectivity-probe-gated failback path end to end

(PR Azure#4604 / issue Azure#4622): a marked regional endpoint is failed back

only after a real connectivity probe succeeds, and a failed probe

resets the cooldown so it is not immediately re-probed.

The existing unit tests only cover this state machine with injected

fake probe closures; this test drives the driver's real probe closure

against the emulator (connection-blocked vs. healthy via region-scoped

fault injection). Failback is driven deterministically through new

doc-hidden, internal-feature-gated test hooks on CosmosDriver

(run_endpoint_probe_once_for_testing, mark_region_endpoint_unavailable_for_testing,

is_endpoint_host_marked_unavailable_for_testing) plus a short

endpoint_unavailability_ttl, so it never waits on the 60s probe loop.

No production behavior changes; all hooks compile out unless the

test / __internal_in_memory_emulator feature is enabled.

Fixes Azure#4622

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 18, 2026 23:04
@NaluTripician NaluTripician requested a review from a team as a code owner June 18, 2026 23:04
@github-actions github-actions Bot added the Cosmos The azure_cosmos crate label Jun 18, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an in-memory-emulator integration test that exercises the real connectivity-probe-gated account endpoint failback path end to end, closing the follow-up issue #4622 from the probe-gating work in #4604. Previously, the probe-gated failback state machine was only covered by unit tests using injected fake probe closures; this test drives the driver's actual probe closure against the emulator under simulated regional outage and recovery.

To make the test deterministic (instead of waiting on the 60-second background probe loop), the PR adds three doc-hidden, internal-feature-gated *_for_testing hooks on CosmosDriver and a supporting hook on LocationStateStore. The only non-test production change is a behavior-identical restructuring of the probe-closure construction so a clone can be retained for the test hook, wrapped in a small Debug-implementing newtype because CosmosDriver derives Debug.

Changes:

  • New endpoint_probe_failback.rs integration test covering the three behaviors from #4622 (stays out of rotation while blocked, fails back only after a successful probe, cooldown reset on a failed probe).
  • New internal *_for_testing hooks (run_endpoint_probe_once_for_testing, mark_region_endpoint_unavailable_for_testing, is_endpoint_host_marked_unavailable_for_testing) following the existing convention.
  • Behavior-identical refactor of the endpoint probe-closure construction in CosmosDriver::new to retain a clone for tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/in_memory_emulator_tests/mod.rs Registers the new endpoint_probe_failback module, gated on fault_injection (consistent with sibling fault-injection tests).
tests/in_memory_emulator_tests/endpoint_probe_failback.rs New integration test driving the real probe through outage/recovery phases.
src/driver/routing/location_state_store.rs Adds a pub(crate) test hook to seed an endpoint as unavailable using the live snapshot's endpoint object.
src/driver/cosmos_driver.rs Adds Debug newtype + stored probe-fn clone, three doc-hidden test hooks, and a behavior-identical probe-closure restructuring.
CHANGELOG.md Adds an "Other Changes" entry documenting the test and the real-emulator follow-up note.

Comment thread sdk/cosmos/azure_data_cosmos_driver/CHANGELOG.md Outdated
…s-probe-gated-failback-test

# Conflicts:
#	sdk/cosmos/azure_data_cosmos_driver/CHANGELOG.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cosmos The azure_cosmos crate

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

Add live test for Cosmos probe-gated endpoint failback

2 participants