Fix FwLite OAuth: stop wiping refresh-token cache on transient errors#2308
Conversation
📝 WalkthroughWalkthroughThis PR refactors OAuthClient to add semaphore-guarded concurrency control and explicit MSAL error classification, preventing transient failures from permanently wiping cached credentials. It introduces IsSignedIn() to distinguish offline users from unauthenticated ones, adds a new Offline sync status, and updates downstream consumers to report accurate error states. ChangesOAuth Client Refactoring & Error Handling
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The latest updates on your projects. Learn more about Argos notifications ↗︎
|
UI unit Tests 1 files 62 suites 31s ⏱️ Results for commit 09ae216. ♻️ This comment has been updated with latest results. |
93f753a to
4d4a1e5
Compare
C# FwHeadless Unit Tests48 tests 48 ✅ 16s ⏱️ Results for commit 09ae216. ♻️ This comment has been updated with latest results. |
C# Unit Tests165 tests 165 ✅ 19s ⏱️ Results for commit 09ae216. ♻️ This comment has been updated with latest results. |
The comment said "all 3 scopes are required" but the list had 4 entries. The "3" referred to profile/openid/offline_access (the MSAL workaround); sendandreceive was added later as the backend API permission scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GetAuth used to call _application.RemoveAsync(account) on any caught exception, which destroys the persistent MSAL refresh token. A single network blip, cancellation, or unexpected error during silent refresh therefore permanently logged the user out — far more often than the 14-day refresh-token TTL would suggest. Narrow the policy: only remove the account on definitive failures (MsalUiRequired, multiple_matching_tokens_detected). Rethrow cancellations. Keep the cache on everything else and let the next call retry. Centralise the removal path so it also publishes AuthenticationChangedEvent for subscribers (e.g. SignalR teardown). Refs #2306 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two callers can reach the silent-refresh path simultaneously (SignalR's AccessTokenProvider keepalive + HttpClientRefreshDelegate's fire-and-forget refresh on the lexbox-jwt-updated header). With forceRefresh:true the in-MSAL dedupe is bypassed, both calls hit the token endpoint with the same refresh token, and OpenIddict's rolling-RT replay detection revokes the entire chain — the next call legitimately gets MsalUiRequired and the user is logged out. Wrap the slow path of GetAuth in a SemaphoreSlim with a double-check so the second caller either piggybacks on the first's fresh _authResult or refreshes serially against the rotated RT. Refs #2306 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two test files plus the minimal seams in OAuthClient to enable them: - OAuthClientFailureClassifierTests asserts the per-exception policy in ClassifySilentAuthFailure against every realistic exception MSAL or the HTTP stack can surface. - OAuthClientSilentRefreshTests exercises the full GetAuth → catch/switch → RemoveAccountAsync glue end-to-end with a mocked IPublicClientApplication, covering the three SilentAuthFailureOutcome branches. Production-code changes are limited to test seams: an internal ctor that skips persistent-cache wiring, an internal virtual AcquireTokenSilentAsync, and promoting GetAuth from private to internal. Refs #2306 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AuthConfig.CacheFileName defaults to Path.GetFullPath("msal.json"),
which resolves relative to CWD at config-instantiation time. FwLiteWeb
was only stable because Program.cs chdirs to the assembly directory as
its first statement — any future startup-order change would silently
move the cache and make every user appear logged out. PostConfigure
the path explicitly against AppContext.BaseDirectory.
Refs #2306
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…th classifier GetAuth takes no CancellationToken, so any OCE is MSAL-internal (an HttpClient timeout surfacing as TaskCanceledException on a flaky network) and therefore transient. Rethrowing surprised call sites that never expect GetAuth to throw (GetLexboxProjects builds its HttpClient outside its try/catch; the SignalR Reconnecting handler), since the pre-branch code swallowed every exception. Let OCE fall through to KeepCachedCredentials and drop the now-dead Rethrow arm. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…boundary Add OAuthClient.IsSignedIn (a purely local MSAL cache read) and a new Offline member to ProjectSyncStatusErrorCode. At the user-facing "you're not logged in" sites where CreateHttpClient returns null, check IsSignedIn so a signed-in user who is merely offline is told they're offline rather than logged out: GetLexboxSyncStatus, AwaitLexboxSyncFinished, and SyncService.ExecuteSync. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The cache was pinned to AppContext.BaseDirectory (the binary directory). Platform.Bible extracts each extension version to a versioned cache dir (~/.platform.bible/cache/extensions/<name>_<version>/), so a binary-relative cache is orphaned on every update — the user is silently logged out per update — and a cache/ tree is also wipeable by cache cleaning. Store msal.json under LocalApplicationData/SIL/FwLiteWeb instead (resolves to %LOCALAPPDATA% on Windows, ~/.local/share on Linux, ~/Library/Application Support on macOS), mirroring FwLiteMaui's stable-per-user pattern. Create the directory up front since MsalCacheHelper needs a fully-qualified path with an existing parent. Best-effort migrate an existing binary-relative msal.json so standalone FwLiteWeb users are not logged out by this change. Refs #2306 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ogout On a transient silent-refresh failure GetAuth kept the in-memory _authResult, which is only ever reached when that result is null or expired — so callers got a stale bearer and 401s. Null it out (the refresh token stays in the MSAL cache, so the next GetAuth retries). Also take _authSemaphore in Logout so an in-flight AcquireTokenSilent cannot repopulate _authResult and silently sign the user back in. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…pData PostConfigure unconditionally overwrote Auth:CacheFileName, ignoring an explicit config value; skip the default when one is set. Also fall back to the binary dir when LocalApplicationData resolves empty, keeping the cache path fully qualified (BuildCacheProperties rejects relative paths). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ailure The pre-expiry refresh window (<5 min left) reaches the same transient- failure branch as an actually-expired token; dropping a still-valid token there spuriously reports logged-out (stopping the SignalR listener) during a network blip. Only clear the in-memory result once it has truly expired -- the freshness fast-path already guarantees a retained near-expiry token keeps retrying refresh on every call. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…h lock RemoveAccountAsync published AuthenticationChangedEvent while GetAuth still held _authSemaphore. Subject.OnNext dispatches synchronously, so a subscriber re-entering GetAuth/Logout would self-deadlock on the non-reentrant semaphore. Mirror Logout: GetAuth now captures the result under the lock, releases, then publishes outside it only when an account was removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1078705 to
09ae216
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/FwLite/FwLiteShared/Auth/OAuthClient.cs`:
- Around line 224-226: The current _logger.LogWarning call in OAuthClient.cs
includes raw usernames and HomeAccountId identifiers via string.Join over
accounts which leaks PII; change the log to avoid raw identifiers by logging
only the count of cached accounts or redacted identifiers (e.g., account index
or hashed/partial redaction) instead of a full username or HomeAccountId.
Specifically update the _logger.LogWarning invocation (the call that passes 'e'
and the "Silent token acquisition..." message and the
string.Join(accounts.Select(...))) to replace string.Join(...) with either
accounts.Count() or a safe projection like accounts.Select((a,i) =>
$"account#{i}") or a redacted substring/hash, ensuring no full usernames or
HomeAccountId.Identifier values are emitted. Ensure formatting and exception
parameter 'e' remain unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 41983a8c-d5ba-473b-8343-1f64734a65c7
📒 Files selected for processing (8)
backend/FwLite/FwLiteShared.Tests/Auth/OAuthClientFailureClassifierTests.csbackend/FwLite/FwLiteShared.Tests/Auth/OAuthClientIsSignedInTests.csbackend/FwLite/FwLiteShared.Tests/Auth/OAuthClientSilentRefreshTests.csbackend/FwLite/FwLiteShared/Auth/OAuthClient.csbackend/FwLite/FwLiteShared/Projects/LexboxProjectService.csbackend/FwLite/FwLiteShared/Sync/SyncService.csbackend/LexCore/Sync/SyncStatus.csfrontend/viewer/src/lib/dotnet-types/generated-types/LexCore/Sync/ProjectSyncStatusErrorCode.ts
hahn-kev
left a comment
There was a problem hiding this comment.
Looks good to me! I tested it out, I was still logged in, and had no trouble logging out and logging back in again.
Resolves #2306 — root-cause analysis lives there. This PR fixes all three client-side causes:
MsalUiRequired/multiple_matching_tokenslog out), and a still-valid access token survives a transient refresh failure instead of being discarded.GetAuthis serialized (semaphore + double-check) so concurrent force-refreshes can''t trip rolling-RT replay detection;Logouttakes the same lock so an in-flight refresh can''t sign the user back in; auth events publish after the lock is released.Supporting change:
IsSignedIn(local-only cache read) lets sync/status report a newOfflineerror code instead of a misleading not-logged-in state.Verified end-to-end with
FW_LITE_CHAOS=trueand a 30s token lifetime: 50/50 forced refreshes stayed signed in; reverting cause 1 logged the user out by poll #10. The remaining known logout cause (production cert rotation) is #2332.🤖 Generated with Claude Code