Skip to content

Retry upstream "unexpected EOF" errors instead of returning them to the client#3726

Closed
hex-ci wants to merge 16 commits into
router-for-me:devfrom
hex-ci:main
Closed

Retry upstream "unexpected EOF" errors instead of returning them to the client#3726
hex-ci wants to merge 16 commits into
router-for-me:devfrom
hex-ci:main

Conversation

@hex-ci

@hex-ci hex-ci commented Jun 5, 2026

Copy link
Copy Markdown

Summary

When an upstream SSE/HTTP response is cut off mid-flight, executors surface the read failure as a plain unexpected EOF (e.g. scanner.Err() in claude_executor.go:512-519). This error carries no HTTP status (statusCodeFromError returns 0), so Manager.shouldRetryAfterError fell through its 429-only retry path and returned the error straight to the client. Some clients abort the whole request on seeing API Error: unexpected EOF.

This makes shouldRetryAfterError treat unexpected EOF like a normal retryable API error: it now retries immediately (honoring the configured request-retry count) rather than surfacing the error.

// shouldRetryAfterError, after the isRequestInvalidError check:
if isUnexpectedEOFError(err) {
    if !m.retryAllowed(attempt, providers) {
        return 0, false
    }
    return 0, true // immediate retry, no cooldown wait
}

isUnexpectedEOFError matches both errors.Is(err, io.ErrUnexpectedEOF) and a case-insensitive "unexpected eof" substring, so it works through wraps and across executors. Because the detection lives in the shared conductor retry loop, it covers non-streaming Execute and the streaming bootstrap phase (EOF before the first payload, surfaced via streamBootstrapError, whose Error()/Unwrap() preserve the cause).

Returning wait=0 is deliberate: a statusless EOF leaves the credential with NextRetryAfter == 0, so isAuthBlockedForModel keeps it immediately re-pickable; this also avoids the wait > maxWait early-return that would otherwise block a retry when the error happens to carry a 5xx status (which sets a 60s cooldown > the default 30s max-retry-interval).

Scope note: an EOF that occurs mid-stream (after bytes were already streamed to the client) is forwarded through wrapStreamResult and is intentionally not retried, since partial content has already been sent.

Tests

  • TestManager_ShouldRetryAfterError_RetriesUnexpectedEOF — statusless unexpected EOF and wrapped io.ErrUnexpectedEOF retry immediately (wait==0); stops once request-retry is exhausted.
  • TestManager_ShouldRetryAfterError_UnexpectedEOFRespectsRetryDisabled — no retry when request-retry=0.
  • go build ./..., go vet ./sdk/cliproxy/auth/, and go test ./... all pass.

devin-ai-integration Bot and others added 2 commits June 5, 2026 03:27
Retry upstream "unexpected EOF" errors instead of returning them to the client
@github-actions github-actions Bot changed the base branch from main to dev June 5, 2026 07:15
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

This pull request targeted main.

The base branch has been automatically changed to dev.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces handling and automatic retries for transient "unexpected EOF" errors in the auth conductor. It adds a helper function isUnexpectedEOFError to detect these errors and updates shouldRetryAfterError to trigger an immediate retry when they occur, provided retries are allowed. Additionally, unit tests have been added to verify this behavior and ensure that retry limits are respected. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a863ebfa9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread sdk/cliproxy/auth/conductor.go Outdated
Co-Authored-By: Hex <hex@codeigniter.org.cn>
Limit immediate unexpected-EOF retry to statusless errors

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5a0adba8ec

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread sdk/cliproxy/auth/conductor.go Outdated
Co-Authored-By: Hex <hex@codeigniter.org.cn>
Honor per-auth request_retry override for unexpected-EOF retries

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c67f875192

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread sdk/cliproxy/auth/conductor.go Outdated
devin-ai-integration Bot and others added 2 commits June 5, 2026 09:29
Co-Authored-By: Hex <hex@codeigniter.org.cn>
…auth

Record the EOF-producing auth at error time for per-auth retry overrides

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0473c0d3d6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread sdk/cliproxy/auth/conductor.go

Copy link
Copy Markdown

I hit a Grok Build/xAI case that appears to match this PR while testing the v7.1.50 release.

Environment:

  • Release binary: v7.1.50 (4f55ecca, linux amd64)
  • Previous known-good binary on the same machine: v7.1.33 (33983b6f)
  • Config had request-retry: 3 and max-retry-interval: 30
  • Provider/auth type: xAI OAuth auth entries exposed as prefixed Grok Build routes
  • Endpoint/model shape: POST /v1/responses, model: grok-default/grok-build-0.1, stream: false

Observed during upgrade smoke:

  • After installing v7.1.50, the first core Grok Build smoke returned HTTP 500 with body containing unexpected EOF.
  • Rolling back to v7.1.33 on the same config/auth entries immediately restored the same Grok Build route.

I then reproduced safely with an isolated v7.1.50 process on another localhost port using a copied auth dir. Basic Grok Build requests were not deterministic failures: /v1/models listed the Grok routes, and 21 short non-streaming/streaming calls across grok-default, grok-green, and grok-yellow all returned 200. So this looks intermittent/transport-like rather than a stable request-shape rejection.

Code inspection of v7.1.50 also seems to line up with this PR: Manager.shouldRetryAfterError returns false for a statusless non-429 error after the cooldown lookup, so a plain upstream unexpected EOF can bubble straight to the client even with request-retry > 0. This PR’s status == 0 && isUnexpectedEOFError(err) immediate retry path seems like the right fix for the failure mode I saw.

No token or auth values are included here.

Copy link
Copy Markdown

I also built and tested this together with PR #3734 on top of the v7.1.50 release commit.

Build under test:

Focused tests passed:

go test -count=1 ./sdk/cliproxy/auth ./internal/runtime/executor

That includes the new conductor tests from this PR for statusless unexpected EOF retry behavior and per-auth retry handling.

Live smoke results against the combined build:

/v1/models listed grok-default/grok-build-0.1, grok-green/grok-build-0.1, grok-yellow/grok-build-0.1, and go-ds-flash
simple Grok Build /v1/responses for all three profiles -> 200
Grok Build streaming /v1/responses -> 200
non-Grok go-ds-flash /v1/chat/completions -> 200

I could not force a deterministic live unexpected EOF in the short smoke, which matches the intermittent nature of the original failure. But the code path and tests in #3726 address the failure mode I hit on the v7.1.50 upgrade: a statusless upstream unexpected EOF would otherwise fall through as a non-429/no-cooldown error and return to the client despite request-retry > 0.

No token or auth values are included here.

@hex-ci hex-ci closed this Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants