Fix for #5804 - MSAL discovery resilience#5806
Conversation
| @@ -0,0 +1,324 @@ | |||
| # Instance Discovery Rules for MSAL | |||
There was a problem hiding this comment.
I generated these with the AI. I hope it'll help it next time. Also to be used with other MSALs.
There was a problem hiding this comment.
Pull request overview
Fixes #5804 by ensuring instance discovery failures (e.g., 404/502/network errors) are cached so MSAL doesn’t repeatedly hit the discovery endpoint on subsequent token acquisitions.
Changes:
- Cache the fallback instance discovery entry in the network metadata cache when the network discovery call fails (non-
invalid_instance). - Add a unit test validating that instance discovery failures are not retried on subsequent calls (including a forced STS call via different scopes).
- Add detailed documentation describing MSAL.NET instance discovery rules and error-handling/caching behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tests/Microsoft.Identity.Test.Unit/PublicApiTests/InstanceDiscoveryTests.cs | Adds regression test verifying discovery failures are cached and not retried across subsequent token requests. |
| src/client/Microsoft.Identity.Client/Instance/Discovery/InstanceDiscoveryManager.cs | Caches fallback metadata entry after discovery failures to prevent repeated network instance discovery calls. |
| docs/instance-discovery-rules.md | Adds a rules/spec-style document describing instance discovery flows, caching, and error handling. |
You can also share your feedback on Copilot code review. Take the survey.
…overyTests.cs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
You can also share your feedback on Copilot code review. Take the survey.
| httpManager.AddMockHandler(new MockHttpMessageHandler() | ||
| { | ||
| ExpectedMethod = HttpMethod.Get, | ||
| ResponseMessage = new HttpResponseMessage(errorStatusCode) | ||
| { | ||
| Content = new StringContent("error") | ||
| } | ||
| }); |
There was a problem hiding this comment.
The mock handlers for the instance discovery failure only assert the HTTP method (GET) and not the discovery endpoint URL/query parameters. This makes the test less precise (any unexpected GET could satisfy the handler). Consider setting ExpectedUrl (and optionally ExpectedQueryParams for api-version/authorization_endpoint) to the computed instance discovery endpoint (default trusted host) so the test verifies the intended call is what gets cached/not retried.
| ## 4. Discovery Endpoint Selection | ||
|
|
||
| When a network call is needed, MSAL must choose which host to call: | ||
|
|
||
| | Authority Host | Discovery Endpoint Host | | ||
| |---|---| | ||
| | Known environment (e.g., `login.microsoftonline.com`) | Same host: `https://{authority_host}/common/discovery/instance` | | ||
| | Unknown environment (e.g., `login.microsoft.new`) | Fallback to default trusted host: `https://login.microsoftonline.com/common/discovery/instance` | | ||
|
|
||
| The query parameters are: | ||
| ``` | ||
| GET https://{discovery_host}/common/discovery/instance | ||
| ?api-version=1.1 | ||
| &authorization_endpoint=https://{authority_host}/{tenant}/oauth2/v2.0/authorize | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## 5. Provider Resolution Order | ||
|
|
||
| ### 5.1 Full Flow (GetMetadataEntryAsync — used during token acquisition) | ||
|
|
||
| Providers are consulted in strict priority order. The first non-null result wins. | ||
|
|
||
| ``` | ||
| 1. [If instance discovery is disabled (WithInstanceDiscovery(false))] | ||
| → Check region discovery provider (regions are not affected by the disable flag) | ||
| → If still null, return self-entry. STOP. | ||
|
|
||
| 2. Region discovery provider | ||
|
|
||
| 3. Network call (FetchNetworkMetadataOrFallback) | ||
| → On success: cache all entries by alias in the network cache | ||
| → On "invalid_instance" error: see §6 | ||
| → On any other error (404, 502, network failure, etc.): see §7 |
There was a problem hiding this comment.
This doc aims to describe MSAL.NET’s instance discovery rules for reimplementation, but it currently omits two behaviors that affect resolution order/endpoint selection: (1) user-supplied instance metadata via WithInstanceDiscoveryMetadata(...) (which short-circuits providers), and (2) user-supplied instance discovery URI (_userProvidedInstanceDiscoveryUri / WithInstanceDiscoveryUri) which overrides the default discovery endpoint host selection. Please document these cases so the rules are complete and match the implementation.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.
You can also share your feedback on Copilot code review. Take the survey.
|
@copilot open a new pull request to apply changes based on the comments in this thread |
|
@bgavrilMS I've opened a new pull request, #5847, to work on those changes. Once the pull request is ready, I'll request review from you. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
You can also share your feedback on Copilot code review. Take the survey.
| | Authority Type | Instance Discovery Supported | | ||
| |---|---| | ||
| | AAD (Entra ID) | ✅ Yes | | ||
| | B2C | ❌ No — use self-entry | | ||
| | ADFS | ❌ No — use self-entry | | ||
| | CIAM / dSTS | ❌ No — use self-entry | | ||
|
|
There was a problem hiding this comment.
Several markdown tables start with || (e.g., || Authority Type | ... |), which introduces an empty first column and typically renders as a malformed table. Use standard table syntax starting with a single | for each row (and update the separator row accordingly) to ensure the tables render correctly.
| ?? CreateEntryForSingleAuthority(authorityUri); | ||
| ?? CreateEntryForSingleAuthority(authorityUri); | ||
|
|
||
| _networkCacheMetadataProvider.AddMetadata(authorityUri.Host, fallbackEntry); |
There was a problem hiding this comment.
When fallbackEntry comes from KnownMetadataProvider, it can contain multiple aliases (e.g., public cloud includes login.microsoftonline.com, login.windows.net, sts.windows.net, etc.). Caching it only under authorityUri.Host means requests using a different alias may still miss the network cache and re-trigger network instance discovery (especially when the known provider is bypassed due to an unknown environment in cache). Consider caching the fallback entry for each alias in fallbackEntry.Aliases (similar to NetworkMetadataProvider.CacheInstanceDiscoveryMetadata) rather than only for authorityUri.Host.
| _networkCacheMetadataProvider.AddMetadata(authorityUri.Host, fallbackEntry); | |
| if (fallbackEntry.Aliases != null && fallbackEntry.Aliases.Length > 0) | |
| { | |
| foreach (var alias in fallbackEntry.Aliases) | |
| { | |
| _networkCacheMetadataProvider.AddMetadata(alias, fallbackEntry); | |
| } | |
| } | |
| else | |
| { | |
| _networkCacheMetadataProvider.AddMetadata(authorityUri.Host, fallbackEntry); | |
| } |
|
Closing in favor of #5811, as that PR appears to cover the same issue and some others. |
Pull request was closed
Fixes #5804
Setup the SDK to use an Entra authority host that is not known to the SDK, e.g. login.microsoft.new
Block access to "login.microsoftonline.com/discovery" endpoint by having it return 404 or 502
Acquire token, e.g. S2S token repeatedly (from cache and from STS)
Examine the HTTP traffic
Actual: the SDK will perform instance discovery on the global endpoint on each token request
Expected: after instance discovery fails with 404 or 502, the SDK should stop calling instance discovery