Skip to content

Serialize OAuth token refresh per-account to prevent RTR logout.#2931

Merged
brandonpage merged 3 commits into
forcedotcom:devfrom
brandonpage:fix-concurrent-rtr-logout
Jun 23, 2026
Merged

Serialize OAuth token refresh per-account to prevent RTR logout.#2931
brandonpage merged 3 commits into
forcedotcom:devfrom
brandonpage:fix-concurrent-rtr-logout

Conversation

@brandonpage

Copy link
Copy Markdown
Contributor

No description provided.

// election so that a no-match path (e.g. account removed during refresh) never
// marks a RefreshState as refreshing — preserving the no deadlock fix and
// logout-during-refresh semantics.
final UserAccountManager userAccountManager = SalesforceSDKManager.getInstance().getUserAccountManager();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a helper function on UserAccountManager to return the matching user given a refresh token?

@brandonpage brandonpage Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered that but it won't be needed when we make the API change to ClientManager's constructor to associate it with a UserAccount from the start.

@wmathurin wmathurin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core concurrency model is solid — the winner/loser election, spurious-wakeup loop, single finally publish path, and live-token read in refreshStaleToken are all correct. A few issues below worth addressing before merging.

}
if (parked >= count) return
Thread.yield()
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread.yield() in a tight deadline loop burns CPU and can be flaky on loaded CI machines where yield starvation delays state transitions. Thread.sleep(1) per iteration is cheaper and less sensitive to scheduler pressure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated to Thread.sleep(50) as I don't think 1ms is even feasible w/ schedule overhead.

@brandonpage brandonpage marked this pull request as ready for review June 17, 2026 22:26
Comment on lines +322 to +324
public synchronized void refreshAccessToken() throws IOException {
oAuthRefreshInterceptor.refreshAccessToken();
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wmathurin @JohnsonEricAtSalesforce The is simple but important new API.

@brandonpage brandonpage requested a review from wmathurin June 19, 2026 16:19

@wmathurin wmathurin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great fix for the RTR logout bug overall.
I kept digging with Claude to see if having the two separate synchronized(state.lock) could lead to unwanted edge cases.
It came up with one. It's very unlikely to happen but the fix makes sense, so we should probably adopt it. Below are the details.

Consecutive-cycle race: a loser can miss a result that was already published

The gap

Between Block 1 (election) and Block 2 (publish), the lock is intentionally released so the network call doesn't hold it. refreshing=true is the "slot taken" flag during that gap. This is correct.

The issue arises at the transition between two consecutive refresh cycles. After Winner A publishes in Block 2:

A: refreshing=false, newAuthToken=published, notifyAll(), releases lock
   ← RACE: Loser X (just woken) and Winner B (arriving fresh) both compete for state.lock →
B: wins lock in Block 1, sees refreshing=false, sets refreshing=true, newAuthToken=null
X: acquires lock, sees refreshing=true (Winner B!), parks again with whatever deadline time remains

Winner B clears state.newAuthToken=null as part of becoming winner — correctly, to avoid re-publishing a stale result next cycle. But this means Loser X, even if it was nanoseconds behind, finds nothing to adopt and is now on the hook for Winner B's full network round-trip, counted against a deadline that started ticking during Winner A's cycle.

If Winner A consumed most of the 30s budget and Winner B's call takes longer than what remains, Loser X times out and returns null — even though Winner A's refresh succeeded.

Proposed fix

Add a freshness check inside Block 1, before committing to a new winner election:

synchronized (state.lock) {
    if (state.refreshing) {
        // existing loser path — unchanged
    }
    // If a refresh completed very recently, adopt its result instead of starting a new cycle.
    // This closes the consecutive-cycle race where a loser woken by notifyAll() loses the
    // lock race to a new winner, which clears the just-published result before the loser
    // can read it.
    if (state.newAuthToken != null &&
            System.currentTimeMillis() - state.lastRefreshTime < RECENT_REFRESH_THRESHOLD_MILLIS) {
        adoptWinnerResult(state);
        return state.newAuthToken;
    }
    // Become winner — unchanged
    state.refreshing = true;
    state.newAuthToken = null;
    ...
}

RECENT_REFRESH_THRESHOLD_MILLIS can be small — a few seconds is enough, since the race window is the time between notifyAll() and a loser re-acquiring the lock, which is microseconds in practice. The threshold just needs to be larger than that window.

This makes state.newAuthToken serve dual purpose: election result for waiting losers, and short-lived cache for threads that arrive just after a cycle completes. Winner B becomes a no-op adopter rather than starting a redundant network call, and Loser X's deadline is no longer at risk from a cycle it had no connection to.

@brandonpage

Copy link
Copy Markdown
Contributor Author

Great fix for the RTR logout bug overall. I kept digging with Claude to see if having the two separate synchronized(state.lock) could lead to unwanted edge cases. It came up with one. It's very unlikely to happen but the fix makes sense, so we should probably adopt it. Below are the details.

@wmathurin Great find. I originally had a different approach to fix this (adding a generational counter) that I think better guarantees correctness in the worst case, but was definitely less elegant in some situations. I was able to combine the best of both into this fix.

@brandonpage brandonpage merged commit 25b42de into forcedotcom:dev Jun 23, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants