fix(auth/codex): de-duplicate concurrent token refreshes via singleflight#3784
fix(auth/codex): de-duplicate concurrent token refreshes via singleflight#3784ppanphper wants to merge 1 commit into
Conversation
…ight OpenAI rotates the refresh token on every successful refresh and immediately invalidates the previous one. The Codex auth refresh path had no in-process de-duplication, so two concurrent refreshes of the same token would race: the first succeeds and rotates the token, while the second sends the now-stale token and fails with `refresh_token_reused`. Wrap RefreshTokens with a package-level singleflight.Group keyed by the refresh token (mirroring the existing Claude auth implementation in internal/auth/claude), so concurrent refreshes of the same token collapse into a single upstream call whose result is shared. The group is package-level on purpose: the executor constructs a fresh CodexAuth per refresh, so an instance-scoped group would not de-duplicate. context is detached with WithoutCancel so one caller's cancellation does not abort the shared refresh other callers depend on. Add a concurrency test asserting that N concurrent refreshes for the same token result in exactly one upstream call. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
This pull request targeted The base branch has been automatically changed to |
There was a problem hiding this comment.
Code Review
This pull request introduces singleflight to collapse concurrent token refresh requests for the same refresh token, preventing race conditions and "refresh_token_reused" errors. The review feedback highlights two critical issues with this implementation: an infinite hang risk due to stripping the context timeout with context.WithoutCancel on an untimeout-configured HTTP client, and ignored context cancellation for individual callers because of the synchronous Do call. It is recommended to use DoChan with a background context timeout instead.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| // Use WithoutCancel so a single caller's cancellation does not abort the shared | ||
| // refresh that other waiting callers depend on. | ||
| result, err, _ := codexRefreshGroup.Do(refreshToken, func() (interface{}, error) { | ||
| return o.refreshTokensSingleFlight(context.WithoutCancel(ctx), refreshToken) | ||
| }) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| tokenData, ok := result.(*CodexTokenData) | ||
| if !ok || tokenData == nil { | ||
| return nil, fmt.Errorf("token refresh failed: invalid single-flight result") | ||
| } | ||
| return tokenData, nil |
There was a problem hiding this comment.
There are two significant issues with the current singleflight implementation:
- Infinite Hang Risk: Since
context.WithoutCancel(ctx)strips any timeout/deadline from the caller's context, and thehttp.Clienthas no default timeout configured, an unresponsive upstream server will cause the HTTP request to hang indefinitely. This will permanently block thesingleflightkey, causing all subsequent token refreshes to hang forever. - Ignored Context Cancellation: Using
codexRefreshGroup.Doblocks the caller synchronously. If a caller's context is canceled, they will still remain blocked until the shared HTTP request completes.
Using DoChan combined with a background context timeout solves both issues. Individual callers can return immediately if their context is canceled, while the background request is protected by a 30-second timeout and properly cleaned up to avoid context leaks.
// Use DoChan and select on ctx.Done() so that if an individual caller's context
// is canceled, that caller can return immediately without blocking on the shared
// refresh. Meanwhile, the background refresh continues with a timeout to prevent
// infinite hangs and context leaks.
ch := codexRefreshGroup.DoChan(refreshToken, func() (interface{}, error) {
bgCtx, cancel := context.WithTimeout(context.WithoutCancel(ctx), 30*time.Second)
defer cancel()
return o.refreshTokensSingleFlight(bgCtx, refreshToken)
})
select {
case <-ctx.Done():
return nil, ctx.Err()
case res := <-ch:
if res.Err != nil {
return nil, res.Err
}
tokenData, ok := res.Val.(*CodexTokenData)
if !ok || tokenData == nil {
return nil, fmt.Errorf("token refresh failed: invalid single-flight result")
}
return tokenData, nil
}There was a problem hiding this comment.
Code Review
This pull request introduces single-flight de-duplication for concurrent token refreshes using singleflight.Group to prevent race conditions and "refresh_token_reused" errors caused by OpenAI's token rotation. It also adds a concurrent test to verify this behavior. The review feedback highlights two critical issues: first, using singleflight.Group.Do blocks the calling goroutine even if the caller's context is cancelled, which should be resolved by using DoChan and selecting on ctx.Done(); second, using context.WithoutCancel without an HTTP client timeout can cause requests to hang indefinitely, which should be mitigated by enforcing a safety timeout inside the single-flight function.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| result, err, _ := codexRefreshGroup.Do(refreshToken, func() (interface{}, error) { | ||
| return o.refreshTokensSingleFlight(context.WithoutCancel(ctx), refreshToken) | ||
| }) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| tokenData, ok := result.(*CodexTokenData) | ||
| if !ok || tokenData == nil { | ||
| return nil, fmt.Errorf("token refresh failed: invalid single-flight result") | ||
| } | ||
| return tokenData, nil |
There was a problem hiding this comment.
Using singleflight.Group.Do blocks the calling goroutine until the single-flight function completes, even if the caller's context (ctx) is cancelled or times out. This can lead to goroutine leaks and unresponsive clients under high load or network degradation.
To respect context cancellation for the caller while still allowing the shared background refresh to complete, we should use singleflight.Group.DoChan and select on ctx.Done() and the result channel.
ch := codexRefreshGroup.DoChan(refreshToken, func() (interface{}, error) {
return o.refreshTokensSingleFlight(context.WithoutCancel(ctx), refreshToken)
})
select {
case <-ctx.Done():
return nil, ctx.Err()
case res := <-ch:
if res.Err != nil {
return nil, res.Err
}
tokenData, ok := res.Val.(*CodexTokenData)
if !ok || tokenData == nil {
return nil, fmt.Errorf("token refresh failed: invalid single-flight result")
}
return tokenData, nil
}| func (o *CodexAuth) refreshTokensSingleFlight(ctx context.Context, refreshToken string) (*CodexTokenData, error) { | ||
| data := url.Values{ |
There was a problem hiding this comment.
Because context.WithoutCancel(ctx) strips any deadline or timeout from the parent context, and the CodexAuth HTTP client is initialized without a default timeout (&http.Client{}), the background HTTP request has no timeout. If the upstream server hangs or the connection drops silently, this request could block indefinitely.
To prevent indefinite hangs, we should enforce a safety timeout (e.g., 30 seconds) on the context inside refreshTokensSingleFlight. Since this timeout is applied inside the single-flight function, it won't be affected by individual caller cancellations.
| func (o *CodexAuth) refreshTokensSingleFlight(ctx context.Context, refreshToken string) (*CodexTokenData, error) { | |
| data := url.Values{ | |
| func (o *CodexAuth) refreshTokensSingleFlight(ctx context.Context, refreshToken string) (*CodexTokenData, error) { | |
| ctx, cancel := context.WithTimeout(ctx, 30*time.Second) | |
| defer cancel() | |
| data := url.Values{ |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 298b426a2e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| result, err, _ := codexRefreshGroup.Do(refreshToken, func() (interface{}, error) { | ||
| return o.refreshTokensSingleFlight(context.WithoutCancel(ctx), refreshToken) |
There was a problem hiding this comment.
Respect canceled refresh contexts
When the caller's context is already canceled or reaches its deadline, this still enters the singleflight function with context.WithoutCancel(ctx), so the POST no longer observes cancellation or deadlines. In this path NewCodexAuthWithProxyURL constructs an http.Client{} without a timeout, so a canceled request or shutdown context can now block indefinitely on a stalled token endpoint instead of returning context.Canceled/DeadlineExceeded; use a non-cancelable shared context only for protecting other waiters while still letting each caller stop waiting (or at least skip starting the refresh when its context is already done).
Useful? React with 👍 / 👎.
What
Wrap
(*CodexAuth).RefreshTokensin a package-levelsingleflight.Groupkeyed by the refresh token, so concurrent refreshes for the same Codex (OpenAI) credential collapse into a single upstream call whose result is shared.Fixes #3783.
Why
OpenAI rotates refresh tokens on every successful refresh and immediately invalidates the previous one. The Codex refresh path had no in-process de-duplication, so two concurrent refreshes of the same token race: the first succeeds and rotates the token, while the second sends the now-stale token and fails with
refresh_token_reused.The Claude auth path (
internal/auth/claude/anthropic_auth.go) already solves this withsingleflight; this change brings Codex in line.How
internal/auth/codex/openai_auth.go:var codexRefreshGroup singleflight.Group.RefreshTokensnow callscodexRefreshGroup.Do(refreshToken, ...); the actual HTTP exchange moves into a new unexportedrefreshTokensSingleFlight.context.WithoutCancel(ctx)so one caller's cancellation does not abort the shared refresh that other waiting callers depend on.CodexAuthper refresh, so an instance-scoped group would not de-duplicate.Test
Added
TestRefreshTokens_ConcurrentSingleFlight: launches N concurrentRefreshTokenscalls for the same token against a blocking transport and asserts exactly one upstream call is made and all callers succeed.passes.
Scope / limitation
singleflightde-duplicates within a single process only. Multiple processes sharing the same auth store still race the rotating token and need a single-writer guarantee at the deployment level — noted in #3783.🤖 Generated with Claude Code