Skip to content

Fix FwLite OAuth: stop wiping refresh-token cache on transient errors#2308

Merged
myieye merged 14 commits into
developfrom
claude/fix-fwlite-auth-logout
Jun 11, 2026
Merged

Fix FwLite OAuth: stop wiping refresh-token cache on transient errors#2308
myieye merged 14 commits into
developfrom
claude/fix-fwlite-auth-logout

Conversation

@myieye

@myieye myieye commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Resolves #2306 — root-cause analysis lives there. This PR fixes all three client-side causes:

  • Transient/unknown errors during silent refresh no longer wipe the MSAL account (only MsalUiRequired / multiple_matching_tokens log out), and a still-valid access token survives a transient refresh failure instead of being discarded.
  • GetAuth is serialized (semaphore + double-check) so concurrent force-refreshes can''t trip rolling-RT replay detection; Logout takes the same lock so an in-flight refresh can''t sign the user back in; auth events publish after the lock is released.
  • FwLiteWeb''s MSAL cache moved to a stable per-user location with a one-time migration off the Platform.Bible-versioned binary directory.

Supporting change: IsSignedIn (local-only cache read) lets sync/status report a new Offline error code instead of a misleading not-logged-in state.

Verified end-to-end with FW_LITE_CHAOS=true and a 30s token lifetime: 50/50 forced refreshes stayed signed in; reverting cause 1 logged the user out by poll #10. The remaining known logout cause (production cert rotation) is #2332.

🤖 Generated with Claude Code

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR refactors OAuthClient to add semaphore-guarded concurrency control and explicit MSAL error classification, preventing transient failures from permanently wiping cached credentials. It introduces IsSignedIn() to distinguish offline users from unauthenticated ones, adds a new Offline sync status, and updates downstream consumers to report accurate error states.

Changes

OAuth Client Refactoring & Error Handling

Layer / File(s) Summary
OAuthClient concurrency primitives and authentication flow refactoring
backend/FwLite/FwLiteShared/Auth/OAuthClient.cs
OAuthClient adds _authSemaphore for serialized GetAuth/Logout access; removes LexboxProjectService constructor dependency; introduces internal test-seam constructor; refactors GetAuth to synchronize token acquisition with expiry re-check and failure classification; rewrites Logout to acquire lock, clear cache, and publish events outside the critical section; documents IsSignedIn as local-cache-only; refactors CreateHttpClient to use object initializer for BaseAddress.
Silent auth failure classification logic
backend/FwLite/FwLiteShared/Auth/OAuthClient.cs
Defines SilentAuthFailureOutcome enum (RemoveAccount, KeepCachedCredentials) and ClassifySilentAuthFailure switch mapping MSAL exceptions to removal or retention decisions; MsalUiRequiredException and multiple_matching_tokens_detected error code trigger removal; transient/server-side failures (HTTP errors, MSAL service errors, cancellation) trigger retention; adds virtual AcquireTokenSilentAsync test seam.
OAuthClient error classification unit tests
backend/FwLite/FwLiteShared.Tests/Auth/OAuthClientFailureClassifierTests.cs
OAuthClientFailureClassifierTests verifies ClassifySilentAuthFailure behavior: MsalUiRequiredException and multiple_matching_tokens_detected error code result in RemoveAccount; transient/server-side exceptions (HTTP errors, MSAL service errors, cancellation) result in KeepCachedCredentials.
OAuthClient sign-in and silent refresh integration tests
backend/FwLite/FwLiteShared.Tests/Auth/OAuthClientIsSignedInTests.cs, backend/FwLite/FwLiteShared.Tests/Auth/OAuthClientSilentRefreshTests.cs
OAuthClientIsSignedInTests verify IsSignedIn() returns true/false based on account presence. OAuthClientSilentRefreshTests verify GetAuth() across scenarios: transient errors keep cached account with no event; invalid_grant removes account and publishes event; cancellation keeps account; expired/near-expiry tokens follow expiry-window semantics; Logout blocks on same semaphore as in-flight GetAuth and stays locked out after completion. Includes TestableOAuthClient subclass and AuthenticationResult helpers.
Sync status enum expansion
backend/LexCore/Sync/SyncStatus.cs, frontend/viewer/src/lib/dotnet-types/generated-types/LexCore/Sync/ProjectSyncStatusErrorCode.ts
ProjectSyncStatusErrorCode enum adds new Offline member (between NotLoggedIn and Unknown) to distinguish signed-in-but-unreachable users from unauthenticated users in JSON and frontend representations.
Downstream consumer integration
backend/FwLite/FwLiteShared/Sync/SyncService.cs, backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs
SyncService.ExecuteSync and LexboxProjectService.GetLexboxSyncStatus/AwaitLexboxSyncFinished now call IsSignedIn() when HTTP client creation fails to distinguish offline-but-signed-in (maps to Offline status and internet-check message) from not-logged-in (maps to NotLoggedIn status and auth-failure message).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • hahn-kev
  • rmunn

Poem

🐰 A semaphore guards the token gate,
No more does a blip sign you out of your fate,
Classify each error with logic so clear,
Transient troubles won't make you disappear!
Offline or lost—now we know which is which,
One PR to fix them all—quite the switch! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: fixing OAuth to prevent wiping refresh-token cache on transient errors, which is the primary focus of all changes in the PR.
Description check ✅ Passed The description is directly related to the changeset, explaining the three client-side causes being fixed and the supporting changes for the Offline error code.
Linked Issues check ✅ Passed The PR directly addresses all three client-side root causes from #2306: (1) transient errors no longer wipe MSAL cache [OAuthClient.cs refactor], (2) concurrency control via SemaphoreSlim prevents replay detection [OAuthClient.cs + tests], (3) MSAL cache relocation to stable per-user location [mentioned in description], plus supporting IsSignedIn feature [OAuthClient.cs + tests + SyncStatus.cs].
Out of Scope Changes check ✅ Passed All changes are scoped to the stated objectives: test classes for new auth behaviors, OAuthClient refactoring, LexboxProjectService and SyncService updates to use IsSignedIn, SyncStatus enum addition, and generated TypeScript types. No unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/fix-fwlite-auth-logout

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the 💻 FW Lite issues related to the fw lite application, not miniLcm or crdt related label May 28, 2026
@argos-ci

argos-ci Bot commented May 28, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Argos notifications ↗︎

Build Status Details Updated (UTC)
default (Inspect) ✅ No changes detected - Jun 10, 2026, 3:19 PM

@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

UI unit Tests

  1 files   62 suites   31s ⏱️
186 tests 186 ✅ 0 💤 0 ❌
258 runs  258 ✅ 0 💤 0 ❌

Results for commit 09ae216.

♻️ This comment has been updated with latest results.

@myieye myieye force-pushed the claude/fix-fwlite-auth-logout branch 2 times, most recently from 93f753a to 4d4a1e5 Compare June 3, 2026 11:36
@github-actions github-actions Bot added the 📦 Lexbox issues related to any server side code, fw-headless included label Jun 3, 2026
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

C# FwHeadless Unit Tests

48 tests   48 ✅  16s ⏱️
 5 suites   0 💤
 1 files     0 ❌

Results for commit 09ae216.

♻️ This comment has been updated with latest results.

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

C# Unit Tests

165 tests   165 ✅  19s ⏱️
 23 suites    0 💤
  1 files      0 ❌

Results for commit 09ae216.

♻️ This comment has been updated with latest results.

myieye and others added 14 commits June 10, 2026 15:04
The comment said "all 3 scopes are required" but the list had 4 entries.
The "3" referred to profile/openid/offline_access (the MSAL workaround);
sendandreceive was added later as the backend API permission scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GetAuth used to call _application.RemoveAsync(account) on any caught
exception, which destroys the persistent MSAL refresh token. A single
network blip, cancellation, or unexpected error during silent refresh
therefore permanently logged the user out — far more often than the
14-day refresh-token TTL would suggest.

Narrow the policy: only remove the account on definitive failures
(MsalUiRequired, multiple_matching_tokens_detected). Rethrow
cancellations. Keep the cache on everything else and let the next call
retry. Centralise the removal path so it also publishes
AuthenticationChangedEvent for subscribers (e.g. SignalR teardown).

Refs #2306

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two callers can reach the silent-refresh path simultaneously (SignalR's
AccessTokenProvider keepalive + HttpClientRefreshDelegate's
fire-and-forget refresh on the lexbox-jwt-updated header). With
forceRefresh:true the in-MSAL dedupe is bypassed, both calls hit the
token endpoint with the same refresh token, and OpenIddict's rolling-RT
replay detection revokes the entire chain — the next call legitimately
gets MsalUiRequired and the user is logged out.

Wrap the slow path of GetAuth in a SemaphoreSlim with a double-check so
the second caller either piggybacks on the first's fresh _authResult or
refreshes serially against the rotated RT.

Refs #2306

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two test files plus the minimal seams in OAuthClient to enable
them:

- OAuthClientFailureClassifierTests asserts the per-exception policy in
  ClassifySilentAuthFailure against every realistic exception MSAL or
  the HTTP stack can surface.
- OAuthClientSilentRefreshTests exercises the full GetAuth →
  catch/switch → RemoveAccountAsync glue end-to-end with a mocked
  IPublicClientApplication, covering the three SilentAuthFailureOutcome
  branches.

Production-code changes are limited to test seams: an internal ctor
that skips persistent-cache wiring, an internal virtual
AcquireTokenSilentAsync, and promoting GetAuth from private to internal.

Refs #2306

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AuthConfig.CacheFileName defaults to Path.GetFullPath("msal.json"),
which resolves relative to CWD at config-instantiation time. FwLiteWeb
was only stable because Program.cs chdirs to the assembly directory as
its first statement — any future startup-order change would silently
move the cache and make every user appear logged out. PostConfigure
the path explicitly against AppContext.BaseDirectory.

Refs #2306

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…th classifier

GetAuth takes no CancellationToken, so any OCE is MSAL-internal (an HttpClient
timeout surfacing as TaskCanceledException on a flaky network) and therefore
transient. Rethrowing surprised call sites that never expect GetAuth to throw
(GetLexboxProjects builds its HttpClient outside its try/catch; the SignalR
Reconnecting handler), since the pre-branch code swallowed every exception.
Let OCE fall through to KeepCachedCredentials and drop the now-dead Rethrow arm.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…boundary

Add OAuthClient.IsSignedIn (a purely local MSAL cache read) and a new Offline
member to ProjectSyncStatusErrorCode. At the user-facing "you're not logged in"
sites where CreateHttpClient returns null, check IsSignedIn so a signed-in user
who is merely offline is told they're offline rather than logged out:
GetLexboxSyncStatus, AwaitLexboxSyncFinished, and SyncService.ExecuteSync.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The cache was pinned to AppContext.BaseDirectory (the binary directory).
Platform.Bible extracts each extension version to a versioned cache dir
(~/.platform.bible/cache/extensions/<name>_<version>/), so a binary-relative
cache is orphaned on every update — the user is silently logged out per
update — and a cache/ tree is also wipeable by cache cleaning.

Store msal.json under LocalApplicationData/SIL/FwLiteWeb instead (resolves
to %LOCALAPPDATA% on Windows, ~/.local/share on Linux, ~/Library/Application
Support on macOS), mirroring FwLiteMaui's stable-per-user pattern. Create the
directory up front since MsalCacheHelper needs a fully-qualified path with an
existing parent. Best-effort migrate an existing binary-relative msal.json so
standalone FwLiteWeb users are not logged out by this change.

Refs #2306

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ogout

On a transient silent-refresh failure GetAuth kept the in-memory _authResult,
which is only ever reached when that result is null or expired — so callers got
a stale bearer and 401s. Null it out (the refresh token stays in the MSAL cache,
so the next GetAuth retries). Also take _authSemaphore in Logout so an in-flight
AcquireTokenSilent cannot repopulate _authResult and silently sign the user back in.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…pData

PostConfigure unconditionally overwrote Auth:CacheFileName, ignoring an explicit
config value; skip the default when one is set. Also fall back to the binary dir
when LocalApplicationData resolves empty, keeping the cache path fully qualified
(BuildCacheProperties rejects relative paths).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ailure

The pre-expiry refresh window (<5 min left) reaches the same transient-
failure branch as an actually-expired token; dropping a still-valid token
there spuriously reports logged-out (stopping the SignalR listener) during
a network blip. Only clear the in-memory result once it has truly expired
-- the freshness fast-path already guarantees a retained near-expiry token
keeps retrying refresh on every call.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…h lock

RemoveAccountAsync published AuthenticationChangedEvent while GetAuth still
held _authSemaphore. Subject.OnNext dispatches synchronously, so a subscriber
re-entering GetAuth/Logout would self-deadlock on the non-reentrant semaphore.
Mirror Logout: GetAuth now captures the result under the lock, releases, then
publishes outside it only when an account was removed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@myieye myieye force-pushed the claude/fix-fwlite-auth-logout branch from 1078705 to 09ae216 Compare June 10, 2026 15:16
@myieye myieye marked this pull request as ready for review June 10, 2026 15:16

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/FwLite/FwLiteShared/Auth/OAuthClient.cs`:
- Around line 224-226: The current _logger.LogWarning call in OAuthClient.cs
includes raw usernames and HomeAccountId identifiers via string.Join over
accounts which leaks PII; change the log to avoid raw identifiers by logging
only the count of cached accounts or redacted identifiers (e.g., account index
or hashed/partial redaction) instead of a full username or HomeAccountId.
Specifically update the _logger.LogWarning invocation (the call that passes 'e'
and the "Silent token acquisition..." message and the
string.Join(accounts.Select(...))) to replace string.Join(...) with either
accounts.Count() or a safe projection like accounts.Select((a,i) =>
$"account#{i}") or a redacted substring/hash, ensuring no full usernames or
HomeAccountId.Identifier values are emitted. Ensure formatting and exception
parameter 'e' remain unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 41983a8c-d5ba-473b-8343-1f64734a65c7

📥 Commits

Reviewing files that changed from the base of the PR and between 4ccd57f and 09ae216.

📒 Files selected for processing (8)
  • backend/FwLite/FwLiteShared.Tests/Auth/OAuthClientFailureClassifierTests.cs
  • backend/FwLite/FwLiteShared.Tests/Auth/OAuthClientIsSignedInTests.cs
  • backend/FwLite/FwLiteShared.Tests/Auth/OAuthClientSilentRefreshTests.cs
  • backend/FwLite/FwLiteShared/Auth/OAuthClient.cs
  • backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs
  • backend/FwLite/FwLiteShared/Sync/SyncService.cs
  • backend/LexCore/Sync/SyncStatus.cs
  • frontend/viewer/src/lib/dotnet-types/generated-types/LexCore/Sync/ProjectSyncStatusErrorCode.ts

Comment thread backend/FwLite/FwLiteShared/Auth/OAuthClient.cs

@hahn-kev hahn-kev left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I tested it out, I was still logged in, and had no trouble logging out and logging back in again.

@myieye myieye merged commit 78e3f97 into develop Jun 11, 2026
32 of 33 checks passed
@myieye myieye deleted the claude/fix-fwlite-auth-logout branch June 11, 2026 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

💻 FW Lite issues related to the fw lite application, not miniLcm or crdt related 📦 Lexbox issues related to any server side code, fw-headless included

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FwLite signs out unexpectedly more often than the 14-day refresh-token TTL would suggest

2 participants