fix(capacity-retry): expand markers to match real upstream transient errors

ImogeneOctaviap794 · ImogeneOctaviap794 · commit 6380626f84b4 · 2026-05-13T16:42:54.000+08:00
Production dialog_logs 10-min sample (886 requests / 149 response.failed events = 17% transient failure rate) reveals the existing marker list missed 100% of real failures:

  - "Our servers are currently overloaded. Please try again later." (90%)
  - "An error occurred while processing your request. You can retry..." (10%)

The codex CLI cosmetic message "Selected model is at capacity. Please try a different model." that users see is a CLIENT-SIDE fallback rendering, not the real upstream payload. So the previous marker list ("at capacity" / "try a different mode|model") matched the CLI display but never the wire content. Net effect: zero transparent retries in 24h despite ~17% upstream failure rate; every failure leaked to clients.

Fix: add markers covering both observed phrases plus the generic "try again later" tail. Negative test cases pinned for auth / context-length errors which must NOT retry.

Function kept named isCapacityError for minimal blast radius - all three call sites in handler.go (/v1/responses, /v1/chat/completions, /v1/compact) and handler_anthropic.go (/v1/messages) get the broader coverage automatically.
diff --git a/proxy/capacity_retry.go b/proxy/capacity_retry.go
@@ -6,22 +6,38 @@ import (
 	"github.com/tidwall/gjson"
 )
 
-// capacityErrorMarkers 上游 Codex 返回的"容量告急"错误特征关键词（小写比较）。
+// capacityErrorMarkers 上游 Codex 返回的"瞬时可重试错误"特征关键词（小写比较）。
 //
 // 典型错误消息（由上游 Responses SSE 的 response.failed 事件携带）：
 //
 //	"Selected model is at capacity. Please try a different mode"
 //	"The model you requested is at capacity..."
+//	"Our servers are currently overloaded. Please try again later."
+//	"An error occurred while processing your request. You can retry your request..."
 //
-// 命中任一 marker 即判定为容量错误，允许 codex2api 对该请求做透明重试
+// 命中任一 marker 即判定为可瞬时重试错误，允许 codex2api 对该请求做透明重试
 // （换一个账号再试），前提是响应流还未向下游客户端写入任何字节。
 //
+// 历史：v1.7.51 之前只匹配 "at capacity" 家族，但生产 dialog_logs 实测显示
+// codex CLI 渲染的 "Selected model is at capacity. Please try a different
+// model." 其实是客户端兜底文案——上游真实文案 90% 是
+// "Our servers are currently overloaded"，10% 是 "An error occurred while
+// processing your request"，原 marker 一个也命中不了，导致透明重试机制完全
+// 失效，错误全部漏给客户端。
+//
 // 参考：GitHub Issue openai/codex#17014 — 2026 年 4 月 gpt-5.4 区域性容量
 // 紧张时期，上游经常在成功建立流后、实际生成内容前抛出 response.failed。
 var capacityErrorMarkers = []string{
+	// 容量类（OpenAI 早期文案）
 	"at capacity",
 	"try a different mode",
 	"try a different model",
+	// 服务器过载（v1.7.51 起新增，生产主要错误）
+	"currently overloaded",
+	"servers are currently",
+	"try again later",
+	// 通用瞬时错误（上游显式说 "you can retry"）
+	"an error occurred while processing your request",
 }
 
 // isCapacityError 判断错误消息是否匹配"上游容量告急"特征。
diff --git a/proxy/capacity_retry_test.go b/proxy/capacity_retry_test.go
@@ -17,6 +17,15 @@ func TestIsCapacityError(t *testing.T) {
 		{"try a different model", "Please try a different model.", true},
 		{"rate limit is NOT capacity", "Rate limit exceeded", false},
 		{"quota is NOT capacity", "You exceeded your current quota", false},
+		// v1.7.51 起新增 markers（生产 dialog_logs 实测样本）
+		{"servers overloaded (90% 样本)",
+			"Our servers are currently overloaded. Please try again later.", true},
+		{"generic upstream error (10% 样本)",
+			"An error occurred while processing your request. You can retry your request, or contact us through our help center at help.openai.com if the error persists.", true},
+		{"only 'try again later' tail", "Service unavailable. Try again later.", true},
+		// 反向防误伤
+		{"auth error must NOT retry", "Invalid authentication credentials", false},
+		{"context length must NOT retry", "This model's maximum context length is 128000 tokens", false},
 	}
 	for _, c := range cases {
 		t.Run(c.name, func(t *testing.T) {