Serialize OAuth token refresh per-account to prevent RTR logout.#2931
Conversation
| // election so that a no-match path (e.g. account removed during refresh) never | ||
| // marks a RefreshState as refreshing — preserving the no deadlock fix and | ||
| // logout-during-refresh semantics. | ||
| final UserAccountManager userAccountManager = SalesforceSDKManager.getInstance().getUserAccountManager(); |
There was a problem hiding this comment.
Should we add a helper function on UserAccountManager to return the matching user given a refresh token?
There was a problem hiding this comment.
I considered that but it won't be needed when we make the API change to ClientManager's constructor to associate it with a UserAccount from the start.
wmathurin
left a comment
There was a problem hiding this comment.
The core concurrency model is solid — the winner/loser election, spurious-wakeup loop, single finally publish path, and live-token read in refreshStaleToken are all correct. A few issues below worth addressing before merging.
| } | ||
| if (parked >= count) return | ||
| Thread.yield() | ||
| } |
There was a problem hiding this comment.
Thread.yield() in a tight deadline loop burns CPU and can be flaky on loaded CI machines where yield starvation delays state transitions. Thread.sleep(1) per iteration is cheaper and less sensitive to scheduler pressure.
There was a problem hiding this comment.
I updated to Thread.sleep(50) as I don't think 1ms is even feasible w/ schedule overhead.
| public synchronized void refreshAccessToken() throws IOException { | ||
| oAuthRefreshInterceptor.refreshAccessToken(); | ||
| } |
There was a problem hiding this comment.
@wmathurin @JohnsonEricAtSalesforce The is simple but important new API.
wmathurin
left a comment
There was a problem hiding this comment.
Great fix for the RTR logout bug overall.
I kept digging with Claude to see if having the two separate synchronized(state.lock) could lead to unwanted edge cases.
It came up with one. It's very unlikely to happen but the fix makes sense, so we should probably adopt it. Below are the details.
Consecutive-cycle race: a loser can miss a result that was already published
The gap
Between Block 1 (election) and Block 2 (publish), the lock is intentionally released so the network call doesn't hold it. refreshing=true is the "slot taken" flag during that gap. This is correct.
The issue arises at the transition between two consecutive refresh cycles. After Winner A publishes in Block 2:
A: refreshing=false, newAuthToken=published, notifyAll(), releases lock
← RACE: Loser X (just woken) and Winner B (arriving fresh) both compete for state.lock →
B: wins lock in Block 1, sees refreshing=false, sets refreshing=true, newAuthToken=null
X: acquires lock, sees refreshing=true (Winner B!), parks again with whatever deadline time remains
Winner B clears state.newAuthToken=null as part of becoming winner — correctly, to avoid re-publishing a stale result next cycle. But this means Loser X, even if it was nanoseconds behind, finds nothing to adopt and is now on the hook for Winner B's full network round-trip, counted against a deadline that started ticking during Winner A's cycle.
If Winner A consumed most of the 30s budget and Winner B's call takes longer than what remains, Loser X times out and returns null — even though Winner A's refresh succeeded.
Proposed fix
Add a freshness check inside Block 1, before committing to a new winner election:
synchronized (state.lock) {
if (state.refreshing) {
// existing loser path — unchanged
}
// If a refresh completed very recently, adopt its result instead of starting a new cycle.
// This closes the consecutive-cycle race where a loser woken by notifyAll() loses the
// lock race to a new winner, which clears the just-published result before the loser
// can read it.
if (state.newAuthToken != null &&
System.currentTimeMillis() - state.lastRefreshTime < RECENT_REFRESH_THRESHOLD_MILLIS) {
adoptWinnerResult(state);
return state.newAuthToken;
}
// Become winner — unchanged
state.refreshing = true;
state.newAuthToken = null;
...
}RECENT_REFRESH_THRESHOLD_MILLIS can be small — a few seconds is enough, since the race window is the time between notifyAll() and a loser re-acquiring the lock, which is microseconds in practice. The threshold just needs to be larger than that window.
This makes state.newAuthToken serve dual purpose: election result for waiting losers, and short-lived cache for threads that arrive just after a cycle completes. Winner B becomes a no-op adopter rather than starting a redundant network call, and Loser X's deadline is no longer at risk from a cycle it had no connection to.
…essary refresh to be requested.
@wmathurin Great find. I originally had a different approach to fix this (adding a generational counter) that I think better guarantees correctness in the worst case, but was definitely less elegant in some situations. I was able to combine the best of both into this fix. |
No description provided.