Resilience improvements for instance discovery#5811
Conversation
There was a problem hiding this comment.
Pull request overview
Improves AAD instance discovery resilience and performance by avoiding repeated network instance discovery attempts when the endpoint is failing/unreachable, and by bounding discovery latency with a dedicated timeout.
Changes:
- Cache a fallback instance discovery entry when network instance discovery fails (non-
invalid_instance) to avoid retrying discovery on subsequent token requests. - Add a per-instance-discovery timeout (default 10s) by linking a timeout CancellationToken into the discovery request flow.
- Add unit tests covering caching-on-failure and timeout fallback behavior; add a rules doc for cross-SDK reference.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/Microsoft.Identity.Test.Unit/PublicApiTests/InstanceDiscoveryTests.cs | Adds tests ensuring instance discovery failures/timeouts are cached and not retried. |
| src/client/Microsoft.Identity.Client/Internal/RequestContext.cs | Makes UserCancellationToken settable to support temporary override during instance discovery. |
| src/client/Microsoft.Identity.Client/Instance/Discovery/NetworkMetadataProvider.cs | Adds instance discovery timeout and links it into the outgoing request. |
| src/client/Microsoft.Identity.Client/Instance/Discovery/InstanceDiscoveryManager.cs | Caches fallback metadata on discovery failure; updates warning text. |
| docs/instance-discovery-rules.md | Adds a detailed description of instance discovery behavior and error-handling rules. |
You can also share your feedback on Copilot code review. Take the survey.
b15f64d to
64e28a7
Compare
907fb84 to
9066be3
Compare
9066be3 to
a7a2d5b
Compare
af8f6cd to
296899f
Compare
|
looks good, one follow-up test needed. |
| httpManager.AddMockHandler(new MockHttpMessageHandler() | ||
| { | ||
| ExpectedMethod = HttpMethod.Get, | ||
| ExceptionToThrow = new TaskCanceledException("simulated timeout") | ||
| }); |
There was a problem hiding this comment.
This test simulates an instance discovery timeout, but the mock GET handler does not set ExpectedUrl / expected query params. Adding them would make the test assert that the failing call is actually the instance discovery endpoint (and not some other GET) before verifying the fallback caching behavior.
| httpManager.AddMockHandler(new MockHttpMessageHandler() | ||
| { | ||
| ExpectedMethod = HttpMethod.Get, | ||
| ResponseMessage = new HttpResponseMessage(HttpStatusCode.OK) | ||
| { | ||
| Content = new StringContent("{}") | ||
| }, | ||
| AdditionalRequestValidation = _ => callerCts.Cancel() | ||
| }); |
There was a problem hiding this comment.
The instance discovery mock here cancels the caller token via AdditionalRequestValidation, but it does not constrain the request target. Setting ExpectedUrl (and expected query params) would ensure the cancellation is happening during the intended instance discovery call and not another GET in the flow.
| httpManager.AddMockHandler(new MockHttpMessageHandler() | ||
| { | ||
| ExpectedMethod = HttpMethod.Get, | ||
| ResponseMessage = new HttpResponseMessage(errorStatusCode) | ||
| { | ||
| Content = new StringContent("error") | ||
| } | ||
| }); |
There was a problem hiding this comment.
The instance-discovery mock handler is very permissive (only ExpectedMethod = GET), so any unexpected GET (e.g., to a different endpoint) could consume this handler and make the test less precise. Consider also setting ExpectedUrl (and, if feasible, the expected query params) for the instance discovery request so the test validates the correct endpoint is called before the failure is cached.
Set ExpectedUrl on all instance discovery mock handlers to ensure the tests validate that the correct endpoint is called before verifying fallback/caching behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes #5804 #5805
If instance discovery fails due to 404 or 502, it should not be attempted again
Instance discovery should have a reasonble timeout