Detect orphaned KG certs via public key modulus comparison#6020
Detect orphaned KG certs via public key modulus comparison#6020Robbie-Microsoft wants to merge 5 commits into
Conversation
After a VM or app restart, KeyGuard regenerates the per-boot private key in the same CNG container while the old cert remains on disk. The cert appears valid (time-wise, HasPrivateKey=true) but the key material has changed. Fix: compare the cert's embedded public key modulus against the public key currently exported from the CNG container. A mismatch conclusively means the container was regenerated and the cert is orphaned. The check runs inside WindowsPersistentCertificateCache.Read during candidate selection, so the loop continues looking for a valid cert rather than returning an unusable one. Non-CNG keys (software CSP) are accepted on faith since KeyGuard-style regeneration does not apply to them. CryptographicException during key load is treated as orphaned (key inaccessible = unusable). The CNG container modified-time heuristic (Check 3 from the original proposal) is intentionally omitted. Both it and the modulus comparison detect the same event, but the modulus check is definitive: independently generated RSA keys sharing a modulus is computationally infeasible. The modified-time check is a heuristic with a known false-negative window and adds no additional coverage. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a Windows KeyGuard-specific “orphaned cert” detection to MSAL’s persistent mTLS certificate selection logic by validating that the certificate’s embedded RSA public key matches the current RSA public key in the associated CNG container. This prevents MSAL from reusing a persisted cert whose key container was regenerated after reboot, avoiding repeated SCHANNEL handshake failures.
Changes:
- Add modulus/exponent comparison helpers (
IsCertKeyOrphaned,PublicKeyMatchesCert) to detect CNG-container/key mismatches. - Update
WindowsPersistentCertificateCache.Readto skip orphaned candidates and continue searching for a usable cert. - Add unit tests covering the new helper methods.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| tests/Microsoft.Identity.Test.Unit/ManagedIdentityTests/PersistentCertificateStoreUnitTests.cs | Adds tests for orphan detection and RSA public key matching logic. |
| src/client/Microsoft.Identity.Client/ManagedIdentity/V2/WindowsPersistentCertificateCache.cs | Skips persisted cert candidates whose CNG container key no longer matches the cert’s embedded public key. |
| src/client/Microsoft.Identity.Client/ManagedIdentity/V2/MtlsCertificateCache.cs | Introduces orphan-detection helpers based on RSA modulus/exponent comparison with logging on crypto failures. |
Comments suppressed due to low confidence (1)
tests/Microsoft.Identity.Test.Unit/ManagedIdentityTests/PersistentCertificateStoreUnitTests.cs:912
- Same issue as the KeyMatches test: on
net48,RSA.Create(2048)commonly returnsRSACryptoServiceProvider, sokey2 as RSACngwill be null and the test will fail. Createkey2as anRSACngexplicitly (or gate the test onkey2 is RSACng) to keep the multi-targeted test suite passing.
using var key1 = RSA.Create(2048);
using var key2 = RSA.Create(2048);
var rsaCng2 = key2 as RSACng;
Assert.IsNotNull(rsaCng2, "Expected RSACng on Windows.");
RSA.Create(2048) does not return RSACng on ARM64 Windows (.NET 8). Use new RSACng(2048) explicitly so the cast is always valid. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When IsCertKeyOrphaned detects a cert whose CNG container key no longer matches the cert's public key (post-reboot KeyGuard regeneration), the cert is now removed from the Windows cert store instead of just skipped. Previously the orphaned cert stayed in the store indefinitely, causing a redundant modulus check on every Read call. Changes: - WindowsPersistentCertificateCache.Read: collect orphaned cert thumbprints during the scan loop; after the loop, open ReadWrite store and remove them (best-effort, wrapped in try/catch) - Added internal Read overload accepting Func<X509Certificate2, ILoggerAdapter, bool> isOrphaned delegate for testability; IPersistentCertificateCache interface unchanged - Added unit test: Read_RemovesOrphanedCert_FromStore Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GetRSAPrivateKey() returns null for both non-RSA certs (ECDSA, DSA) and for RSA certs with an inaccessible private key. Previously both fell into the 'accept on faith' branch, which could allow a useless RSA cert to pass the orphan check. Fix: check GetRSAPublicKey() to distinguish RSA certs from non-RSA certs. An RSA cert with no accessible private key is treated as orphaned. Non-RSA certs continue to be accepted on faith (KG regeneration does not apply). Add unit test: IsCertKeyOrphaned_ReturnsTrue_For_RsaCert_WithNoPrivateKey Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
src/client/Microsoft.Identity.Client/ManagedIdentity/V2/WindowsPersistentCertificateCache.cs:190
RemoveByThumbprintsusesList<string>+thumbprints.Contains(...)inside a loop over all certificates, making removal O(storeSize * orphanCount). Since this can run on the read path after a reboot, consider using aHashSet<string>(e.g., OrdinalIgnoreCase) for thumbprints (and/orX509Certificate2Collection.Findby thumbprint) to keep the removal cost predictable.
int removed = 0;
foreach (var cert in items)
{
try
{
if (thumbprints.Contains(cert.Thumbprint))
{
store.Remove(cert);
removed++;
…ock orphan removal - Rename internal Read overload to public TryRead with out param last (public Read delegates to TryRead with the default IsCertKeyOrphaned check) - Extract SnapshotCertificates(store, logger) helper to eliminate 5 duplicate try/catch CopyTo fallback blocks across Read, Write, DeleteAllForAlias, PruneExpiredForAlias, and RemoveByThumbprints - Wrap RemoveByThumbprints in InterprocessLock.TryWithAliasLock (300ms, best-effort) — same pattern as Write/Delete/DeleteAllForAlias to prevent concurrent store writes during orphan cleanup - Update orphan removal test to call TryRead with the new signature Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Problem
After a VM or app restart, KeyGuard regenerates the per-boot private key in the same CNG container while the old cert remains on disk. The cert appears valid (time-wise,
HasPrivateKey=true) but the key material has changed. MSAL loads it from the persistent store, attempts an mTLS handshake, and fails.Fix
Compare the cert's embedded public key (modulus) against the public key currently in the CNG container. A mismatch conclusively means the container was regenerated — the cert is orphaned. The check runs inside
WindowsPersistentCertificateCache.Readduring candidate selection.Orphaned certs are removed from the store on detection (not just skipped) so they don't accumulate and don't require the modulus check on every subsequent
Read.Design notes
GetRSAPrivateKey()returning null for an RSA cert is treated as orphaned (inaccessible = unusable); non-RSA certs (ECDSA, etc.) are accepted on faithCryptographicExceptionduring key load is treated as orphaned (inaccessible = unusable)Readoverload accepts an injectableisOrphaneddelegate for testability;IPersistentCertificateCacheinterface is unchangedTesting
6 new unit tests in
PersistentCertificateStoreUnitTests.cs:IsCertKeyOrphaned_ReturnsTrue_For_NullCertIsCertKeyOrphaned_ReturnsFalse_For_ValidCertIsCertKeyOrphaned_ReturnsTrue_For_RsaCert_WithNoPrivateKeyPublicKeyMatchesCert_ReturnsTrue_When_KeyMatchesCertPublicKeyMatchesCert_ReturnsFalse_When_ModulusMismatch(passes a different RSACng to simulate post-reboot key replacement)Read_RemovesOrphanedCert_FromStore(injects always-true orphan check, verifies cert is deleted from store and Read returns false)Companion to #6017 (token_not_after OID eviction — different failure mode).