Skip to content

Fix for #5804 - MSAL discovery resilience#5806

Closed
bgavrilMS wants to merge 11 commits into
mainfrom
bogavril/2804
Closed

Fix for #5804 - MSAL discovery resilience#5806
bgavrilMS wants to merge 11 commits into
mainfrom
bogavril/2804

Conversation

@bgavrilMS
Copy link
Copy Markdown
Member

@bgavrilMS bgavrilMS commented Mar 4, 2026

Fixes #5804

Setup the SDK to use an Entra authority host that is not known to the SDK, e.g. login.microsoft.new
Block access to "login.microsoftonline.com/discovery" endpoint by having it return 404 or 502
Acquire token, e.g. S2S token repeatedly (from cache and from STS)
Examine the HTTP traffic

Actual: the SDK will perform instance discovery on the global endpoint on each token request
Expected: after instance discovery fails with 404 or 502, the SDK should stop calling instance discovery

@bgavrilMS bgavrilMS requested a review from a team as a code owner March 4, 2026 14:07
@@ -0,0 +1,324 @@
# Instance Discovery Rules for MSAL
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generated these with the AI. I hope it'll help it next time. Also to be used with other MSALs.

Copilot AI review requested due to automatic review settings March 6, 2026 12:01
@bgavrilMS bgavrilMS added this to the 4.83.0 milestone Mar 6, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes #5804 by ensuring instance discovery failures (e.g., 404/502/network errors) are cached so MSAL doesn’t repeatedly hit the discovery endpoint on subsequent token acquisitions.

Changes:

  • Cache the fallback instance discovery entry in the network metadata cache when the network discovery call fails (non-invalid_instance).
  • Add a unit test validating that instance discovery failures are not retried on subsequent calls (including a forced STS call via different scopes).
  • Add detailed documentation describing MSAL.NET instance discovery rules and error-handling/caching behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
tests/Microsoft.Identity.Test.Unit/PublicApiTests/InstanceDiscoveryTests.cs Adds regression test verifying discovery failures are cached and not retried across subsequent token requests.
src/client/Microsoft.Identity.Client/Instance/Discovery/InstanceDiscoveryManager.cs Caches fallback metadata entry after discovery failures to prevent repeated network instance discovery calls.
docs/instance-discovery-rules.md Adds a rules/spec-style document describing instance discovery flows, caching, and error handling.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread tests/Microsoft.Identity.Test.Unit/PublicApiTests/InstanceDiscoveryTests.cs Outdated
@bgavrilMS bgavrilMS changed the title Fix for #2804 Fix for #5804 - instance metadata Mar 6, 2026
@bgavrilMS bgavrilMS changed the title Fix for #5804 - instance metadata Fix for #5804 - MSAL discovery resilience Mar 6, 2026
…overyTests.cs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 6, 2026 12:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +182 to +189
httpManager.AddMockHandler(new MockHttpMessageHandler()
{
ExpectedMethod = HttpMethod.Get,
ResponseMessage = new HttpResponseMessage(errorStatusCode)
{
Content = new StringContent("error")
}
});
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mock handlers for the instance discovery failure only assert the HTTP method (GET) and not the discovery endpoint URL/query parameters. This makes the test less precise (any unexpected GET could satisfy the handler). Consider setting ExpectedUrl (and optionally ExpectedQueryParams for api-version/authorization_endpoint) to the computed instance discovery endpoint (default trusted host) so the test verifies the intended call is what gets cached/not retried.

Copilot uses AI. Check for mistakes.
Comment on lines +74 to +108
## 4. Discovery Endpoint Selection

When a network call is needed, MSAL must choose which host to call:

| Authority Host | Discovery Endpoint Host |
|---|---|
| Known environment (e.g., `login.microsoftonline.com`) | Same host: `https://{authority_host}/common/discovery/instance` |
| Unknown environment (e.g., `login.microsoft.new`) | Fallback to default trusted host: `https://login.microsoftonline.com/common/discovery/instance` |

The query parameters are:
```
GET https://{discovery_host}/common/discovery/instance
?api-version=1.1
&authorization_endpoint=https://{authority_host}/{tenant}/oauth2/v2.0/authorize
```

---

## 5. Provider Resolution Order

### 5.1 Full Flow (GetMetadataEntryAsync — used during token acquisition)

Providers are consulted in strict priority order. The first non-null result wins.

```
1. [If instance discovery is disabled (WithInstanceDiscovery(false))]
→ Check region discovery provider (regions are not affected by the disable flag)
→ If still null, return self-entry. STOP.

2. Region discovery provider

3. Network call (FetchNetworkMetadataOrFallback)
→ On success: cache all entries by alias in the network cache
→ On "invalid_instance" error: see §6
→ On any other error (404, 502, network failure, etc.): see §7
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc aims to describe MSAL.NET’s instance discovery rules for reimplementation, but it currently omits two behaviors that affect resolution order/endpoint selection: (1) user-supplied instance metadata via WithInstanceDiscoveryMetadata(...) (which short-circuits providers), and (2) user-supplied instance discovery URI (_userProvidedInstanceDiscoveryUri / WithInstanceDiscoveryUri) which overrides the default discovery endpoint host selection. Please document these cases so the rules are complete and match the implementation.

Copilot uses AI. Check for mistakes.
@gladjohn gladjohn modified the milestones: 4.83.0, 4.84.0 Mar 9, 2026
Copilot AI review requested due to automatic review settings March 11, 2026 20:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.


You can also share your feedback on Copilot code review. Take the survey.

@bgavrilMS bgavrilMS requested review from Avery-Dunn March 12, 2026 11:49
@bgavrilMS
Copy link
Copy Markdown
Member Author

@copilot open a new pull request to apply changes based on the comments in this thread

@bgavrilMS bgavrilMS enabled auto-merge (squash) March 12, 2026 11:52
Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 12, 2026

@bgavrilMS I've opened a new pull request, #5847, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 12, 2026 14:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +41 to +47
| Authority Type | Instance Discovery Supported |
|---|---|
| AAD (Entra ID) | ✅ Yes |
| B2C | ❌ No — use self-entry |
| ADFS | ❌ No — use self-entry |
| CIAM / dSTS | ❌ No — use self-entry |

Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several markdown tables start with || (e.g., || Authority Type | ... |), which introduces an empty first column and typically renders as a malformed table. Use standard table syntax starting with a single | for each row (and update the separator row accordingly) to ensure the tables render correctly.

Copilot uses AI. Check for mistakes.
?? CreateEntryForSingleAuthority(authorityUri);
?? CreateEntryForSingleAuthority(authorityUri);

_networkCacheMetadataProvider.AddMetadata(authorityUri.Host, fallbackEntry);
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When fallbackEntry comes from KnownMetadataProvider, it can contain multiple aliases (e.g., public cloud includes login.microsoftonline.com, login.windows.net, sts.windows.net, etc.). Caching it only under authorityUri.Host means requests using a different alias may still miss the network cache and re-trigger network instance discovery (especially when the known provider is bypassed due to an unknown environment in cache). Consider caching the fallback entry for each alias in fallbackEntry.Aliases (similar to NetworkMetadataProvider.CacheInstanceDiscoveryMetadata) rather than only for authorityUri.Host.

Suggested change
_networkCacheMetadataProvider.AddMetadata(authorityUri.Host, fallbackEntry);
if (fallbackEntry.Aliases != null && fallbackEntry.Aliases.Length > 0)
{
foreach (var alias in fallbackEntry.Aliases)
{
_networkCacheMetadataProvider.AddMetadata(alias, fallbackEntry);
}
}
else
{
_networkCacheMetadataProvider.AddMetadata(authorityUri.Host, fallbackEntry);
}

Copilot uses AI. Check for mistakes.
@Avery-Dunn
Copy link
Copy Markdown
Contributor

Closing in favor of #5811, as that PR appears to cover the same issue and some others.

@Avery-Dunn Avery-Dunn closed this Apr 1, 2026
auto-merge was automatically disabled April 1, 2026 18:39

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] If instance discovery fails due to 404 or 502, it should not be attempted again

5 participants