Skip to content

fix(auth/codex): de-duplicate concurrent token refreshes via singleflight#3784

Closed
ppanphper wants to merge 1 commit into
router-for-me:devfrom
ppanphper:fix/codex-refresh-token-singleflight
Closed

fix(auth/codex): de-duplicate concurrent token refreshes via singleflight#3784
ppanphper wants to merge 1 commit into
router-for-me:devfrom
ppanphper:fix/codex-refresh-token-singleflight

Conversation

@ppanphper

Copy link
Copy Markdown

What

Wrap (*CodexAuth).RefreshTokens in a package-level singleflight.Group keyed by the refresh token, so concurrent refreshes for the same Codex (OpenAI) credential collapse into a single upstream call whose result is shared.

Fixes #3783.

Why

OpenAI rotates refresh tokens on every successful refresh and immediately invalidates the previous one. The Codex refresh path had no in-process de-duplication, so two concurrent refreshes of the same token race: the first succeeds and rotates the token, while the second sends the now-stale token and fails with refresh_token_reused.

The Claude auth path (internal/auth/claude/anthropic_auth.go) already solves this with singleflight; this change brings Codex in line.

How

internal/auth/codex/openai_auth.go:

  • Add a package-level var codexRefreshGroup singleflight.Group.
  • RefreshTokens now calls codexRefreshGroup.Do(refreshToken, ...); the actual HTTP exchange moves into a new unexported refreshTokensSingleFlight.
  • The inner call uses context.WithoutCancel(ctx) so one caller's cancellation does not abort the shared refresh that other waiting callers depend on.
  • The group is package-level on purpose: the executor constructs a fresh CodexAuth per refresh, so an instance-scoped group would not de-duplicate.

Test

Added TestRefreshTokens_ConcurrentSingleFlight: launches N concurrent RefreshTokens calls for the same token against a blocking transport and asserts exactly one upstream call is made and all callers succeed.

go test -race -run 'TestRefreshTokens|TestNewCodexAuth' ./internal/auth/codex/

passes.

Scope / limitation

singleflight de-duplicates within a single process only. Multiple processes sharing the same auth store still race the rotating token and need a single-writer guarantee at the deployment level — noted in #3783.

🤖 Generated with Claude Code

…ight

OpenAI rotates the refresh token on every successful refresh and
immediately invalidates the previous one. The Codex auth refresh path had
no in-process de-duplication, so two concurrent refreshes of the same
token would race: the first succeeds and rotates the token, while the
second sends the now-stale token and fails with `refresh_token_reused`.

Wrap RefreshTokens with a package-level singleflight.Group keyed by the
refresh token (mirroring the existing Claude auth implementation in
internal/auth/claude), so concurrent refreshes of the same token collapse
into a single upstream call whose result is shared. The group is
package-level on purpose: the executor constructs a fresh CodexAuth per
refresh, so an instance-scoped group would not de-duplicate. context is
detached with WithoutCancel so one caller's cancellation does not abort
the shared refresh other callers depend on.

Add a concurrency test asserting that N concurrent refreshes for the same
token result in exactly one upstream call.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot changed the base branch from main to dev June 9, 2026 13:21
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

This pull request targeted main.

The base branch has been automatically changed to dev.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces singleflight to collapse concurrent token refresh requests for the same refresh token, preventing race conditions and "refresh_token_reused" errors. The review feedback highlights two critical issues with this implementation: an infinite hang risk due to stripping the context timeout with context.WithoutCancel on an untimeout-configured HTTP client, and ignored context cancellation for individual callers because of the synchronous Do call. It is recommended to use DoChan with a background context timeout instead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +202 to +214
// Use WithoutCancel so a single caller's cancellation does not abort the shared
// refresh that other waiting callers depend on.
result, err, _ := codexRefreshGroup.Do(refreshToken, func() (interface{}, error) {
return o.refreshTokensSingleFlight(context.WithoutCancel(ctx), refreshToken)
})
if err != nil {
return nil, err
}
tokenData, ok := result.(*CodexTokenData)
if !ok || tokenData == nil {
return nil, fmt.Errorf("token refresh failed: invalid single-flight result")
}
return tokenData, nil

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are two significant issues with the current singleflight implementation:

  1. Infinite Hang Risk: Since context.WithoutCancel(ctx) strips any timeout/deadline from the caller's context, and the http.Client has no default timeout configured, an unresponsive upstream server will cause the HTTP request to hang indefinitely. This will permanently block the singleflight key, causing all subsequent token refreshes to hang forever.
  2. Ignored Context Cancellation: Using codexRefreshGroup.Do blocks the caller synchronously. If a caller's context is canceled, they will still remain blocked until the shared HTTP request completes.

Using DoChan combined with a background context timeout solves both issues. Individual callers can return immediately if their context is canceled, while the background request is protected by a 30-second timeout and properly cleaned up to avoid context leaks.

	// Use DoChan and select on ctx.Done() so that if an individual caller's context
	// is canceled, that caller can return immediately without blocking on the shared
	// refresh. Meanwhile, the background refresh continues with a timeout to prevent
	// infinite hangs and context leaks.
	ch := codexRefreshGroup.DoChan(refreshToken, func() (interface{}, error) {
		bgCtx, cancel := context.WithTimeout(context.WithoutCancel(ctx), 30*time.Second)
		defer cancel()
		return o.refreshTokensSingleFlight(bgCtx, refreshToken)
	})

	select {
	case <-ctx.Done():
		return nil, ctx.Err()
	case res := <-ch:
		if res.Err != nil {
			return nil, res.Err
		}
		tokenData, ok := res.Val.(*CodexTokenData)
		if !ok || tokenData == nil {
			return nil, fmt.Errorf("token refresh failed: invalid single-flight result")
		}
		return tokenData, nil
	}

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces single-flight de-duplication for concurrent token refreshes using singleflight.Group to prevent race conditions and "refresh_token_reused" errors caused by OpenAI's token rotation. It also adds a concurrent test to verify this behavior. The review feedback highlights two critical issues: first, using singleflight.Group.Do blocks the calling goroutine even if the caller's context is cancelled, which should be resolved by using DoChan and selecting on ctx.Done(); second, using context.WithoutCancel without an HTTP client timeout can cause requests to hang indefinitely, which should be mitigated by enforcing a safety timeout inside the single-flight function.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +204 to +214
result, err, _ := codexRefreshGroup.Do(refreshToken, func() (interface{}, error) {
return o.refreshTokensSingleFlight(context.WithoutCancel(ctx), refreshToken)
})
if err != nil {
return nil, err
}
tokenData, ok := result.(*CodexTokenData)
if !ok || tokenData == nil {
return nil, fmt.Errorf("token refresh failed: invalid single-flight result")
}
return tokenData, nil

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using singleflight.Group.Do blocks the calling goroutine until the single-flight function completes, even if the caller's context (ctx) is cancelled or times out. This can lead to goroutine leaks and unresponsive clients under high load or network degradation.

To respect context cancellation for the caller while still allowing the shared background refresh to complete, we should use singleflight.Group.DoChan and select on ctx.Done() and the result channel.

	ch := codexRefreshGroup.DoChan(refreshToken, func() (interface{}, error) {
		return o.refreshTokensSingleFlight(context.WithoutCancel(ctx), refreshToken)
	})
	select {
	case <-ctx.Done():
		return nil, ctx.Err()
	case res := <-ch:
		if res.Err != nil {
			return nil, res.Err
		}
		tokenData, ok := res.Val.(*CodexTokenData)
		if !ok || tokenData == nil {
			return nil, fmt.Errorf("token refresh failed: invalid single-flight result")
		}
		return tokenData, nil
	}

Comment on lines +220 to 221
func (o *CodexAuth) refreshTokensSingleFlight(ctx context.Context, refreshToken string) (*CodexTokenData, error) {
data := url.Values{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Because context.WithoutCancel(ctx) strips any deadline or timeout from the parent context, and the CodexAuth HTTP client is initialized without a default timeout (&http.Client{}), the background HTTP request has no timeout. If the upstream server hangs or the connection drops silently, this request could block indefinitely.

To prevent indefinite hangs, we should enforce a safety timeout (e.g., 30 seconds) on the context inside refreshTokensSingleFlight. Since this timeout is applied inside the single-flight function, it won't be affected by individual caller cancellations.

Suggested change
func (o *CodexAuth) refreshTokensSingleFlight(ctx context.Context, refreshToken string) (*CodexTokenData, error) {
data := url.Values{
func (o *CodexAuth) refreshTokensSingleFlight(ctx context.Context, refreshToken string) (*CodexTokenData, error) {
ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
defer cancel()
data := url.Values{

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 298b426a2e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +204 to +205
result, err, _ := codexRefreshGroup.Do(refreshToken, func() (interface{}, error) {
return o.refreshTokensSingleFlight(context.WithoutCancel(ctx), refreshToken)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Respect canceled refresh contexts

When the caller's context is already canceled or reaches its deadline, this still enters the singleflight function with context.WithoutCancel(ctx), so the POST no longer observes cancellation or deadlines. In this path NewCodexAuthWithProxyURL constructs an http.Client{} without a timeout, so a canceled request or shutdown context can now block indefinitely on a stalled token endpoint instead of returning context.Canceled/DeadlineExceeded; use a non-cancelable shared context only for protecting other waiters while still letting each caller stop waiting (or at least skip starting the refresh when its context is already done).

Useful? React with 👍 / 👎.

@luispater luispater added the wontfix This will not be worked on label Jun 20, 2026
@luispater luispater closed this Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

wontfix This will not be worked on

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Codex token refresh lacks concurrency de-duplication, causing refresh_token_reused

2 participants