refactor(auth): remove navigator.locks-based mutex; introduce commit guard + dispose()#2387
refactor(auth): remove navigator.locks-based mutex; introduce commit guard + dispose()#2387Bewinxed wants to merge 1 commit into
Conversation
f698e42 to
d248bb2
Compare
…guard + dispose() Removes the `_acquireLock` mutex that wrapped every auth operation and the underlying `navigator.locks` / `processLock` machinery. Replaces it with two lighter primitives that target the specific synchronization needs each operation actually has: - `refreshingDeferred` (already existed) continues to single-flight the refresh path within an instance. - A storage-level commit guard in `_callRefreshToken` re-reads the storage refresh_token between the rotated-tokens response and `_saveSession`. If storage changed under us (e.g. a concurrent `signOut` ran `_removeSession`), the rotated tokens are discarded rather than written back. Returns `AuthRefreshDiscardedError` on the result. Cross-tab refresh races are handled server-side by GoTrue's v1 parent-of-active mechanism at `internal/tokens/service.go:376-385`, so no client-side coordination is needed. New `client.dispose()` tears down the auto-refresh interval, the `visibilitychange` listener, and the BroadcastChannel; clears registered `onAuthStateChange` subscribers. Idempotent. Call from cleanup hooks in React Strict Mode / HMR contexts to prevent stale tickers from outliving the client. The `lock` and `lockAcquireTimeout` constructor options are silently ignored for backwards compatibility; both are marked `@deprecated`. `navigatorLock`, `processLock`, and the `LockAcquireTimeoutError` family remain exported from `./lib/locks` for one major version. Stale lock references in JSDoc on `getSession`, `onAuthStateChange`, `_challengeAndVerify`, and `_listFactors` updated to match the new model. Test branch only, not for merge yet. See RFC `lockless_auth_coordination`. Resolves (test target): supabase#2013, supabase#936, supabase#2111
d248bb2 to
f3d4e71
Compare
@supabase/auth-js
@supabase/functions-js
@supabase/postgrest-js
@supabase/realtime-js
@supabase/storage-js
@supabase/supabase-js
commit: |
mandarini
left a comment
There was a problem hiding this comment.
Left some comments in the code, and also:
- "No breaking changes" checkbox is incorrect. There are two real breaking changes.
- Custom
lockfunctions are silently ignored. React Native users withprocessLockand Node multi-process setups with custom locks now race silently with no warning. Add aconsole.warn(gated on an opt-out) at construction when a non-nulllockis supplied. setSession/refreshSession({ refresh_token })with externally-supplied tokens against empty storage now error. Goes away once #1 is fixed.
Let's update the PR description.
_autoRefreshTokenTickbehavior change not flagged.
Old code: _acquireLock(0, ...) made the tick skip whenever any auth op held the lock. New code: only skips when refreshingDeferred is set. The tick can now fire concurrently with signOut/setSession/getUser/etc. for the first time. Add to "Behaviour changes worth flagging for review".
Side note: the early-return if (this.refreshingDeferred !== null) at the top of the tick duplicates dedup that _callRefreshToken already does. Keep as perf opt or drop, your call.
- RFC references in code.
Code comments and test names reference lockless_auth_coordination. If the RFC isn't landing as a separate doc, scrub the references; otherwise link it.
Thank you so much @Bewinxed !!! :D
| // discard the rotated tokens. Writing them back would undo what | ||
| // signOut just cleared. | ||
| const currentStored = (await getItemAsync(this.storage, this.storageKey)) as Session | null | ||
| if (!currentStored || currentStored.refresh_token !== refreshToken) { |
There was a problem hiding this comment.
Commit guard fires false-positives on empty storage.
To repro: fresh client, empty memoryLocalStorageAdapter, no signOut anywhere, mocked _refreshAccessToken returning rotated tokens without touching storage:
await client.setSession({ access_token: expiredAT, refresh_token: 'R-external' })
// returns: { data: { session: null }, error: AuthRefreshDiscardedError }
// error.message: "Refresh result discarded: session state changed mid-flight (e.g., concurrent signOut)"Breaks setSession and refreshSession({ refresh_token }) for any caller supplying an expired AT plus a refresh token that isn't already in storage. That's the entire SSR cookie-handoff / external-token-hydration flow.
Fix: snapshot storage before the fetch, only discard when a non-null snapshot got cleared.
const storedAtStart = await getItemAsync(this.storage, this.storageKey)
// ... _refreshAccessToken ...
const storedAfter = await getItemAsync(this.storage, this.storageKey)
const wasClearedMidFlight =
storedAtStart !== null &&
(storedAfter === null || storedAfter.refresh_token !== refreshToken)Add a test for the empty-storage path. Your existing 'discards rotated tokens when storage was cleared mid-flight' test plants R1 first, so it doesn't catch this.
| * | ||
| * Call this from cleanup hooks when the client is being replaced before | ||
| * its JS realm is destroyed. React Strict Mode and HMR are the common | ||
| * cases. After dispose, network operations will reject. Storage reads |
There was a problem hiding this comment.
The dispose() docstring claims behavior the code doesn't have. dispose() stops the ticker, closes the broadcast channel, removes the visibility listener, clears subscribers. In-flight fetches still complete and _saveSession still writes. New getSession/getUser calls after dispose() work fine. Per our earlier alignment to defer AbortController plumbing, delete the sentence.
| * @deprecated Due to the possibility of deadlocks with async functions as callbacks, use the version without an async function. | ||
| */ | ||
| onAuthStateChange(callback: (event: AuthChangeEvent, session: Session | null) => Promise<void>): { |
There was a problem hiding this comment.
I think removing @deprecated on the async onAuthStateChange overload is premature. One reentry deadlock remains: a subscriber that calls refreshSession (or anything routing through _callRefreshToken) from inside a TOKEN_REFRESHED handler. refreshingDeferred.resolve happens after _notifyAllSubscribers returns:
await this._notifyAllSubscribers('TOKEN_REFRESHED', data.session) // subscriber runs here
// ...
this.refreshingDeferred.resolve(result) // resolved AFTERSubscriber's inner refreshSession dedups onto refreshingDeferred, which won't resolve until the outer finishes awaiting subscribers. Deadlock.
Let's keep the @deprecated on the async overload, rewrite the reason to point at residual refreshingDeferred reentry instead of the old lock. The common cases (getUser, setSession, reading session from a callback) are genuinely fixed and worth saying.
Description
What changed?
GoTrueClientno longer uses a shared mutex to serialize auth operations._acquireLock, itspendingInLockqueue, and all 14 call-site wrappers are gone. In their place:refreshingDeferred(already existed) keeps single-flighting in-instance concurrent refreshes._callRefreshTokenre-reads storage after the rotated-token response and before_saveSession. If the inputrefresh_tokenisn't in storage anymore (because a concurrentsignOutran_removeSessionbetween fetch start and continuation), the rotated tokens are discarded and the result resolves to{ data: null, error: new AuthRefreshDiscardedError() }._recoverAndRefreshrecognises this error and skips its own_removeSession()call, so no duplicateSIGNED_OUTevent fires.AuthRefreshDiscardedError(with anisAuthRefreshDiscardedErrortype guard), is what the commit guard returns when it throws away a successfully-rotated token. It signals "the server gave us new tokens but the client decided not to keep them," distinct fromAuthRetryableFetchError(transient network) andAuthApiError(server rejection). Surfaces throughrefreshSession()andgetSession()results.client.dispose()tears down the auto-refresh interval, thevisibilitychangelistener, theBroadcastChannel, and registeredonAuthStateChangesubscribers. Idempotent. Designed for React Strict Mode and HMR cleanup hooks.internal/tokens/service.go:376-385, so the client doesn't need to coordinate. The current default is v1 (RefreshTokenAlgorithmVersion = 0); v2 (counter-based) is opt-in viaRefreshTokenUpgradePercentageand gated rollout. This PR's reasoning relies on v1 behaviour, which is what customers actually run today.lockandlockAcquireTimeoutconstructor options are accepted but silently ignored for backwards compatibility. Both are now@deprecatedin JSDoc, including the@supabase/supabase-jsre-exported fields.navigatorLock,processLock,LockAcquireTimeoutError,NavigatorLockAcquireTimeoutError,ProcessLockAcquireTimeoutError, andinternalsinlib/locks.tsremain exported for one major version. All@deprecated.getSession,onAuthStateChange,_useSession,_challengeAndVerify, and_listFactorsnow matches the new behaviour. The@deprecatedtag on the asynconAuthStateChangeoverload (its only justification was the deadlock that no longer exists) is removed.Why was this change needed?
The shared mutex has caused seven failure classes on the issue tracker for 18+ months. Each earlier patch fixed one symptom and created another:
#2235 (primitive swap to
processLock) was an earlier attempt at moving offnavigator.locksand is no longer being actively pursued. Community PRs #2016 and #2019 fixed individual symptoms and were closed waiting on a structural fix.Failure classes resolved by this PR:
INITIAL_SESSION/getSession()hangs with a persisted session (#936 @bugprone). The client no longer touchesnavigator.locks.AbortError: Lock broken by another request with the 'steal' option(#2013 @cpannwitz). The{ steal: true }fallback that produced these errors is gone with the lock.GoTrueClient.ts:3824). The cycle needed the lock; a callback that awaitsgetUser()now just runs.dispose()cleans them up.signOutblocked behind in-flight refresh. signOut now runs concurrently with the refresh, and the commit guard prevents the refresh from writing rotated tokens after_removeSession()cleared storage.Why no lock at all: cross-tab races are handled on the server (GoTrue's parent-of-active), in-instance refresh dedup is already handled by
refreshingDeferred, and the only job left for the lock was serializing subscriber callbacks. That's the deadlock this PR fixes. A narrower lock would still do that one job.Closes (test target, not for merge yet): #2013, #936, #2111
Related:
lockless_auth_coordinationRFC (to be opened separately).Examples
dispose()in a React appSubscriber callbacks can call other auth methods now
AuthRefreshDiscardedErrorfor the signOut-during-refresh raceBreaking changes
Behaviour changes worth flagging for review
onAuthStateChangeno longer warns against async callbacks. Calling other auth methods inside the callback used to deadlock; now it works. The@deprecatedmarker on the async overload signature is removed (its only justification was the deadlock).lockandlockAcquireTimeoutare accepted but ignored. A customlockfunction passed via constructor is never invoked. Code that depended on a custom lock being called will silently lose that behaviour. The@deprecatedJSDoc on both options flags this for the developer.signOutno longer waits for an in-flight refresh's HTTP and continuation to finish before its own fetch goes out. Both fetches now run concurrently, and the commit guard keeps storage consistent.Checklist
<type>(<scope>): <description>npx nx formatto ensure consistent code formattingdispose(), subscriber re-entry no-deadlock, custom lock no-op)Additional notes
Alternatives considered
Continuing to patch
navigator.locks(better steal recovery, lower timeouts, smarter error filtering). Each prior patch in this direction (#1962, #2106, #2178) fixed one symptom and produced another. The Web Locks API has no recovery for orphaned holders other than{ steal: true }, which leaves the previousfn()running concurrently with the new holder and creates the steal-cascade error storm. Each fix here swaps one failure for another instead of removing the contention that causes them.Swapping
navigatorLockforprocessLockas the default browser lock (the direction #2235 explored). Removes the cross-process orphan failure but keeps the same shape of the bug: one primitive serializing operations that don't all need the same synchronization. Subscriber re-entry deadlock, visibility-tick contention, init-tick contention, and in-process orphans of the new primitive all remain.Non-blocking subscriber notifications alone (#2016). Fixes subscriber re-entry. The other six failure classes still bite.
A "smaller, refactored" lock with narrower scope (e.g. lock only around storage writes, not around HTTP). Tempting on paper. Cross-tab refresh is already lockless on the server (parent-of-active returns the same rotated token to N tabs), and
refreshingDeferredalready single-flights refreshes inside a tab. The visibility-tick race is the lock fighting itself. The only thing a narrower lock could still do is serialize subscriber callbacks within a tab, which is the deadlock this PR fixes. No smaller version of this PR still fixes the seven failures above.Adding an
AbortControllerlayer for in-flight operation cancellation. Deferred to a follow-up, not included here. The commit guard catches the one race that affects correctness (signOut overwriting cleared storage with rotated tokens). AbortController would be a UX improvement on top: cancel the in-flight refresh as soon as signOut runs, instead of letting it finish and discarding the result. It also requires plumbingsignalthroughlib/fetch.tsand a decision onAbortSignal.anyruntime support (Node ≥ 20.3, Hermes ≥ 0.74, Safari ≥ 17.4). Out of scope for v1.Deferring the commit guard and accepting eventual consistency. Considered and rejected. Without a guard, a refresh that completes after
_removeSession()cleared storage will write the rotated tokens back. Subscribers then seeSIGNED_OUTfollowed byTOKEN_REFRESHED, with stale tokens in storage until the next refresh tick (~30s) fails against the server and clears them. The guard is about 15 lines. Better to land it with this PR than to ship the bug and patch it later.Server-side context
Why cross-tab is safe: GoTrue's parent-of-active path at
internal/tokens/service.go:376-385(the v1 branch,*models.RefreshToken). When a request arrives with a revoked refresh token whose child is the currently-active token, the server returns the active token instead of rejecting. Two tabs that POST the same refresh token concurrently both receive the same rotated token under DB row locking. This is the production-default behaviour (RefreshTokenAlgorithmVersion = 0, v1). v2 (counter-based, gated onRefreshTokenUpgradePercentage) is safe under the same N-tab concurrency, covered byTestConcurrentReuse.Open questions for review
_setSession's expired branch also filterAuthRefreshDiscardedError? Currently it propagates the error to the caller, which I think is correct (setSessionwas given new tokens; if a concurrent signOut just ran, the caller should be told). Confirming we're OK with that.AbortSignal.anyplumbing inlib/fetch.tsisn't done here (no AbortController layer to plumb yet). If we add the AbortController follow-up later, this becomes a prerequisite. Just so it's on the radar.@supabase/supabase-jstypes deprecation: I markedlockandlockAcquireTimeout@deprecatedin the supabase-js types file too, since they delegate to the auth client. Confirm that's the right surface to deprecate.Test coverage
New tests in
packages/core/auth-js/test/GoTrueClient.test.ts:'Lockless coordination'describe: deprecatedlockandlockAcquireTimeoutoptions accepted-but-ignored; subscriber callback may call other auth methods without deadlock.'dispose() lifecycle'describe: idempotency, subscriber clearing, ticker stopping.'Refresh commit guard (signOut-during-refresh race)'describe: simulates the race, assertsAuthRefreshDiscardedErroris returned and storage stays cleared.New tests in
packages/core/auth-js/test/GoTrueClient.browser.test.ts:'Lockless coordination: navigator.locks should NOT be invoked': confirms the browser API is no longer touched.'Lockless backwards-compatibility: deprecated lock option': a customlockfunction passed via constructor is never invoked.Removed:
'Lock functionality','Browser locks functionality','Lock Mechanism Branches'(tested a mechanism that no longer exists). Updated the existing'should handle custom lock implementation'test to assertnot.toHaveBeenCalled()instead oftoHaveBeenCalled().Status
This is on a test branch for now. Ready for review and discussion. Once the design is signed off, I'll open it against
develop.