fix(litellm): 让 Claude Code GLM 限流由网关兜底

mudssky · mudssky · commit 943b0d341af8 · 2026-05-07T20:26:27.000+08:00
显式记录 LiteLLM 对 429 / RateLimitError 的短重试语义，并让 Claude Code GLM 入口在重试失败后继续走 DeepSeek fallback。同步 Trellis 任务上下文，保留本地 ignored 配置不纳入提交。

Constraint: 只覆盖 Claude Code GLM 入口，不为普通模型新增备用上游

Rejected: 全局提高重试次数 | 持续限流时只会增加等待且无法切到健康上游

Confidence: medium

Scope-risk: moderate

Directive: 不要扩大到 gpt/gemini/compat 通用模型，除非先定义备用 deployment 和成本策略

Tested: pnpm qa；YAML 解析 litellm.local.yaml/newapi.yaml；rumdl check ai/gateway/litellm/litellm.md

Not-tested: 未做真实上游 429 集成验证
diff --git a/.trellis/tasks/05-07-litellm-fallback-retry-429/check.jsonl b/.trellis/tasks/05-07-litellm-fallback-retry-429/check.jsonl
@@ -0,0 +1,3 @@
+{"file": ".trellis/tasks/05-07-litellm-fallback-retry-429/prd.md", "reason": "验收标准与范围边界"}
+{"file": ".trellis/tasks/05-07-litellm-fallback-retry-429/research/litellm-retry-fallback-429.md", "reason": "校验 LiteLLM 429 retry/fallback 配置是否符合官方语义"}
+{"file": ".trellis/spec/guides/index.md", "reason": "通用质量检查思考触发器"}
diff --git a/.trellis/tasks/05-07-litellm-fallback-retry-429/implement.jsonl b/.trellis/tasks/05-07-litellm-fallback-retry-429/implement.jsonl
@@ -0,0 +1,3 @@
+{"file": ".trellis/tasks/05-07-litellm-fallback-retry-429/prd.md", "reason": "实现范围、验收标准与技术决策"}
+{"file": ".trellis/tasks/05-07-litellm-fallback-retry-429/research/litellm-retry-fallback-429.md", "reason": "LiteLLM 429 retry/fallback 官方语义与配置建议"}
+{"file": ".trellis/spec/guides/index.md", "reason": "通用代码复用与跨层思考触发器"}
diff --git a/.trellis/tasks/05-07-litellm-fallback-retry-429/prd.md b/.trellis/tasks/05-07-litellm-fallback-retry-429/prd.md
@@ -0,0 +1,103 @@
+# LiteLLM Claude Code GLM 入口 429 无感 fallback
+
+## Goal
+
+让 `ai/gateway/litellm/litellm.local.yaml` 中 Claude Code GLM 主入口在上游返回 429 或临时限流时，优先由 LiteLLM 网关重试并按既有 fallback 路由切到 DeepSeek 备用上游，尽量让 Claude Code / 前端客户端感知不到瞬时 429。
+
+## What I already know
+
+* 用户希望 `ai/gateway/litellm/litellm.local.yaml` 在 fallback 的时候顺便重试，并“把 429 的请求覆盖让前端感知不到”。
+* 当前本地配置已经存在 `litellm_settings.num_retries: 2` 和 `router_settings.num_retries: 2`。
+* 当前 fallback 只覆盖 `cc-glmplan-opus -> claude-code-deepseek-v4-pro` 与 `cc-glmplan-haiku -> claude-code-deepseek-v4-flash`。
+* 当前 GLM Claude Code 入口设置了 `cooldown_time: 3600`，用于失败后冷却再恢复探测。
+* `qwen.yaml` 和 `docs/multi-newapi-routing.md` 已有固定模型 fallback 示例。
+* LiteLLM 官方文档把 429 归入 `RateLimitError`，普通 `fallbacks` 可覆盖该错误；fallback 会在当前模型重试耗尽后触发。
+* `router_settings.retry_policy.RateLimitErrorRetries` 可以显式声明 429 重试次数，避免只靠读者理解全局 `num_retries`。
+* 当前其它显式模型没有备用 deployment/model；仅配置重试不能隐藏持续 429，本任务不扩展这些模型。
+
+## Requirements
+
+* 本任务优先通过 LiteLLM YAML 配置解决，不新增代理服务或自定义中间层。
+* 只处理 Claude Code GLM 入口：`cc-glmplan-opus` 和 `cc-glmplan-haiku`。
+* LiteLLM 本地配置需要对 Claude Code GLM 入口的 429 / RateLimitError 执行网关内短重试。
+* GLM 入口重试失败后应按既有 fallback 路由切到 DeepSeek 备用上游，避免客户端立即收到 429。
+* 不改变 Claude Code 客户端传入的模型名和调用方式。
+* 配置注释需说明 retry、fallback、cooldown 的设计意图。
+* 对全部重试和 fallback 都失败的情况，文档需明确仍会向客户端返回错误。
+
+## Acceptance Criteria
+
+* [ ] `litellm.local.yaml` 中 429 相关重试 / fallback 配置语义明确，并与 LiteLLM 官方行为一致。
+* [ ] `cc-glmplan-opus` 遇到 429 时会尝试重试，仍失败后 fallback 到 `claude-code-deepseek-v4-pro`。
+* [ ] `cc-glmplan-haiku` 遇到 429 时会尝试重试，仍失败后 fallback 到 `claude-code-deepseek-v4-flash`。
+* [ ] 文档更新说明 429 被网关吸收的边界：全部重试和 fallback 都失败时仍会向客户端暴露错误。
+* [ ] YAML 能被解析，基础 QA 通过。
+
+## Definition of Done (team quality bar)
+
+* Tests added/updated where behavior can be validated locally.
+* Lint / typecheck / CI green where applicable.
+* Docs/notes updated if behavior changes.
+* Rollout/rollback considered if risky.
+
+## Out of Scope (explicit)
+
+* 不新增新的 LiteLLM 上游供应商或新依赖。
+* 不为 `gpt-5.5`、`gemini-3.1-pro`、`compat/claude-*`、`GLM-*`、`*` 等普通模型新增 429 备用部署。
+* 不承诺无限重试或完全吞掉持续性额度耗尽。
+* 不修改前端客户端代码。
+
+## Research References
+
+* [`research/litellm-retry-fallback-429.md`](research/litellm-retry-fallback-429.md) — 确认 LiteLLM 429 会走 `RateLimitError` retry/fallback 语义，并整理推荐配置边界。
+
+## Research Notes
+
+### Feasible approaches here
+
+**Approach A: 显式化现有 Claude Code 429 策略（Recommended）**
+
+* How it works: 保留现有 GLM -> DeepSeek fallback，补充 `retry_policy.RateLimitErrorRetries` 与文档说明。
+* Pros: 改动小，不改变客户端模型名，不新增上游或依赖，直接解决当前 GLM 限流场景。
+* Cons: 只对已有备用模型的 Claude Code 入口真正无感；其它模型持续 429 仍可能暴露。
+
+**Approach B: 为更多显式模型新增备用部署**
+
+* How it works: 为 `gpt-5.5`、`gemini-3.1-pro`、`compat/claude-*` 等配置备用 model/deployment，再通过 `fallbacks` 或 `order` 路由吸收 429。
+* Pros: 覆盖范围更广，普通前端模型也能更少看到 429。
+* Cons: 需要更多上游凭据、命名与成本策略，配置复杂度明显增加。
+
+**Approach C: 提高全局重试次数**
+
+* How it works: 增大 `num_retries` 或 429 retry 次数，给同一上游更多恢复机会。
+* Pros: 最少配置改动。
+* Cons: 持续限流时只会放大等待时间，不能切换到健康备用上游，不适合交互式前端/Claude Code。
+
+## Decision (ADR-lite)
+
+**Context**: 用户确认本次只做 Claude Code GLM 入口的 429 无感 fallback，其它模型没有备用 deployment，扩大范围会引入更多上游凭据、成本和路由策略问题。
+
+**Decision**: 采用 Approach A。保留 `cc-glmplan-opus/haiku` 的现有 DeepSeek fallback，显式补充 429 / `RateLimitError` retry 策略与文档边界说明。
+
+**Consequences**: Claude Code GLM 入口的瞬时 429 更可能被网关内 retry/fallback 吸收；普通模型的持续 429 不在本任务中隐藏，未来若需要覆盖需先新增备用上游。
+
+## Technical Approach
+
+* 在 `router_settings` 中显式配置 429 / `RateLimitError` 的短重试策略，保持整体交互延迟可控。
+* 保留现有 `fallbacks`：`cc-glmplan-opus -> claude-code-deepseek-v4-pro`、`cc-glmplan-haiku -> claude-code-deepseek-v4-flash`。
+* 保留 GLM deployment 的 `cooldown_time: 3600`，继续按额度耗尽/限流后的冷却策略恢复探测。
+* 更新 `litellm.md`，说明 429 先短重试、再 fallback 到 DeepSeek，且全部路径失败后仍会返回错误。
+
+## Implementation Plan
+
+* PR1: 更新 `litellm.local.yaml` 的 Router 429 retry 策略与中文注释。
+* PR2: 更新 `litellm.md` 的 Claude Code GLM fallback 说明与失败边界。
+* PR3: 运行 YAML 解析与项目 QA，确认配置文件格式有效。
+
+## Technical Notes
+
+* Inspected `ai/gateway/litellm/litellm.local.yaml`.
+* Inspected `ai/gateway/litellm/qwen.yaml`.
+* Inspected `ai/gateway/litellm/newapi.yaml`.
+* Inspected `ai/gateway/litellm/docs/multi-newapi-routing.md`.
+* Inspected `ai/gateway/litellm/litellm.md`.
diff --git a/.trellis/tasks/05-07-litellm-fallback-retry-429/research/litellm-retry-fallback-429.md b/.trellis/tasks/05-07-litellm-fallback-retry-429/research/litellm-retry-fallback-429.md
@@ -0,0 +1,80 @@
+# Research: LiteLLM retry and fallback for 429
+
+- Query: LiteLLM Proxy/Router 在上游 429 / RateLimitError 时如何重试、fallback、cooldown，以及本仓库 `ai/gateway/litellm/litellm.local.yaml` 应如何配置以尽量不把瞬时 429 暴露给前端客户端。
+- Scope: mixed
+- Date: 2026-05-07
+
+## Findings
+
+### Files found
+
+- `ai/gateway/litellm/litellm.local.yaml`: 当前 LiteLLM 本地默认配置，包含 Claude Code GLM 主入口、DeepSeek 兜底、全局 retry、Router fallback 与 cooldown。
+- `ai/gateway/litellm/newapi.yaml`: 旧/备用 NewAPI 配置，已有 `litellm_settings.num_retries` 与 `router_settings.num_retries`，但没有 Claude Code GLM fallback。
+- `ai/gateway/litellm/qwen.yaml`: 小型 fallback 示例，展示主模型、备用模型、`router_settings.fallbacks`、`allowed_fails`、`cooldown_time`、`num_retries` 的组合。
+- `ai/gateway/litellm/litellm.md`: 用户文档，已说明 Claude Code GLM 优先、DeepSeek 兜底和 cooldown 恢复探测语义。
+- `ai/gateway/litellm/docs/multi-newapi-routing.md`: 多 NewAPI 路由方案文档，其中方案 D 说明同模型多上游主备/fallback。
+- `ai/gateway/litellm/compose.yaml`: LiteLLM 容器入口；镜像目前使用 `${LITELLM_IMAGE:-docker.litellm.ai/berriai/litellm:main-latest}`，行为随 `main-latest` 漂移。
+- `.trellis/tasks/05-07-litellm-fallback-retry-429/prd.md`: 当前任务 PRD，目标是让 429 先由网关 retry/fallback 吸收，失败边界仍可暴露给客户端。
+- `.trellis/spec/node-script/frontend/index.md`: 当前任务包索引；该索引是 frontend 占位规范，本任务主要改 YAML/文档，没有发现更精确的 LiteLLM 配置规范。
+
+### Code patterns
+
+- `ai/gateway/litellm/litellm.local.yaml:45`: `cc-glmplan-opus` 是 Claude Code 主入口，实际上游为智谱 Anthropic 兼容端点。
+- `ai/gateway/litellm/litellm.local.yaml:52`: `cc-glmplan-opus` 配置了 `cooldown_time: 3600`，失败后该部署冷却 1 小时。
+- `ai/gateway/litellm/litellm.local.yaml:55`: `cc-glmplan-haiku` 复用同一智谱 Anthropic 兼容入口。
+- `ai/gateway/litellm/litellm.local.yaml:62`: `cc-glmplan-haiku` 同样配置了 `cooldown_time: 3600`。
+- `ai/gateway/litellm/litellm.local.yaml:64`: `claude-code-deepseek-v4-pro` 是 GLM 主入口失败后的 pro 兜底部署。
+- `ai/gateway/litellm/litellm.local.yaml:72`: `claude-code-deepseek-v4-flash` 是 Haiku/subagent 流量的轻量兜底部署。
+- `ai/gateway/litellm/litellm.local.yaml:116`: `litellm_settings` 里已有 `num_retries: 2` 和 `request_timeout: 60`。
+- `ai/gateway/litellm/litellm.local.yaml:126`: `router_settings` 开启 `enable_pre_call_checks`，并定义 Router 层行为。
+- `ai/gateway/litellm/litellm.local.yaml:130`: 当前 `fallbacks` 只覆盖 `cc-glmplan-opus -> claude-code-deepseek-v4-pro` 与 `cc-glmplan-haiku -> claude-code-deepseek-v4-flash`。
+- `ai/gateway/litellm/litellm.local.yaml:136`: `allowed_fails: 1` 让部署连续一次失败即进入 cooldown。
+- `ai/gateway/litellm/litellm.local.yaml:138`: `router_settings.num_retries: 2` 与 SDK 层重试次数保持一致。
+- `ai/gateway/litellm/qwen.yaml:21`: Qwen 示例在 `litellm_settings` 中设置 `num_retries: 2` 和 `request_timeout: 25`。
+- `ai/gateway/litellm/qwen.yaml:28`: Qwen 示例用 `router_settings.fallbacks` 声明主模型失败后切到备用模型。
+- `ai/gateway/litellm/qwen.yaml:36`: Qwen 示例用 `allowed_fails` 与 `cooldown_time` 避免故障上游持续接流量。
+- `ai/gateway/litellm/litellm.md:181`: 文档已写明 `cc-glmplan-opus` 失败后降级到 DeepSeek pro。
+- `ai/gateway/litellm/litellm.md:183`: 文档已说明 GLM 入口 cooldown 1 小时，冷却结束后下一次请求重新尝试 GLM。
+- `ai/gateway/litellm/docs/multi-newapi-routing.md:193`: 方案 D 将“同模型多 NewAPI 主备 / 容灾”定义为解决单上游波动或限流的推荐模式。
+- `ai/gateway/litellm/docs/multi-newapi-routing.md:227`: 方案 D 示例使用 `router_settings.fallbacks` 加 `num_retries: 2`。
+- `ai/gateway/litellm/compose.yaml:4`: 当前 LiteLLM 镜像未 pin 版本，默认跟随 `main-latest`。
+
+### External references
+
+- LiteLLM Routing / Router docs: Router 基础可靠性覆盖 cooldown、fallback、timeout、retry。`order` 可以用于同一 `model_name` 下的主备部署；上游失败包含 404、429 等时，会先尝试下一 `order`，每个 order 层有自己的 retry，所有 order 耗尽后才进入已配置 fallback。参考：<https://docs.litellm.ai/docs/routing>
+- LiteLLM Routing / Cooldowns docs: cooldown 作用在单个 deployment，而不是整个 model group。文档列出 429 rate limit 会触发 cooldown，默认 cooldown 为 5 秒；也可以全局或按模型设置 `cooldown_time`。参考：<https://docs.litellm.ai/docs/routing>
+- LiteLLM Routing / Retries docs: Router 对失败请求支持 retry；对 `RateLimitError` 使用 exponential backoff，普通错误立即重试；`retry_after` 可设置 retry 前最小等待时间。参考：<https://docs.litellm.ai/docs/routing>
+- LiteLLM Routing / Advanced retry policy docs: `router_settings.retry_policy` 可以按异常类型配置重试次数，例如 `RateLimitErrorRetries`；`allowed_fails_policy` 可以按异常类型配置进入 cooldown 前允许的失败次数，例如 `RateLimitErrorAllowedFails`。参考：<https://docs.litellm.ai/docs/routing>
+- LiteLLM Fallbacks docs: fallback 发生在某个调用 `num_retries` 后仍失败时，通常从一个 `model_name` 切到另一个 `model_name`；普通 `fallbacks` 覆盖剩余错误，包括 `litellm.RateLimitError`。参考：<https://docs.litellm.ai/docs/proxy/reliability>
+- LiteLLM Fallbacks advanced docs: 文档明确 fallback + retry + timeout + cooldown 配置覆盖 429、500 等错误，并示例 `num_retries` 是每个 `model_name` 上的 retry 次数，fallback 在 retry 后触发。参考：<https://docs.litellm.ai/docs/proxy/reliability>
+- LiteLLM All settings docs: `router_settings` 与 `litellm_settings` 有重叠时，`router_settings` 覆盖 `litellm_settings`；`router_settings` 支持 `retry_policy`、`allowed_fails_policy`、`fallbacks`、`num_retries`、`max_fallbacks`、`retry_after` 等字段。参考：<https://docs.litellm.ai/docs/proxy/config_settings>
+
+### Interpretation for this task
+
+- 对当前 Claude Code GLM 两个入口来说，已有 `router_settings.fallbacks` 理论上已经覆盖 429，因为 LiteLLM 文档将 429 归入普通 fallback 覆盖范围，并明确普通 `fallbacks` 包含 `RateLimitError`。
+- 当前 `router_settings.num_retries: 2` 表示 GLM 主入口会先在当前模型组内重试；重试仍失败后，才会进入 `cc-glmplan-opus -> claude-code-deepseek-v4-pro` 或 `cc-glmplan-haiku -> claude-code-deepseek-v4-flash`。
+- 当前 `allowed_fails: 1` 加 GLM 部署上的 `cooldown_time: 3600` 会让 GLM 在一次失败后冷却 1 小时。若 429 代表 Coding Plan 额度耗尽，这个策略合理；若 429 只是短时 RPM 抖动，1 小时可能过长。
+- 如果目标只是让 Claude Code GLM 入口的 429 尽量不透给客户端，最小实现通常是保留现有 fallback，并补充显式 `retry_policy.RateLimitErrorRetries`、必要时补充 `retry_after` 注释、更新文档说明边界。
+- 如果目标扩展到 `gpt-5.5`、`gemini-3.1-pro`、`compat/claude-*`、`GLM-*`、`*` 等其它模型，目前还没有对应备用 `model_name` 或备用 deployment；单靠 `num_retries` 只能重试同一上游，不能把持续 429 转到另一个供应商。
+- 对同一个外部模型名需要多上游容灾时，LiteLLM 官方 `order` 模式可能比显式 fallback 更自然：多个 deployment 共享同一 `model_name`，主 deployment 设置 `order: 1`，备 deployment 设置 `order: 2`，客户端模型名不变，Router 在 429 等失败后尝试下一 order。
+- 对跨模型降级，比如 GLM 失败切 DeepSeek，当前的 `router_settings.fallbacks` 更合适，因为这是从一个业务入口切到另一个内部备用 `model_name`。
+
+### Recommended implementation direction
+
+- 保持主入口和兜底模型名不变，避免改客户端配置。
+- 在 `router_settings` 下显式加入或确认：
+  - `num_retries: 2`：保留短重试，避免瞬时 429 直接 fallback。
+  - `retry_policy.RateLimitErrorRetries: 2`：让 429 语义显式，不依赖读者理解 `num_retries` 对所有错误生效。
+  - `fallbacks`：保留当前两条 Claude Code GLM 到 DeepSeek 的 fallback。
+  - `allowed_fails: 1` 与 GLM deployment `cooldown_time: 3600`：保留或按“额度耗尽 vs 短时限流”的产品判断调整。
+- 如果要降低交互延迟，不要盲目提高 `num_retries`；对前端/Claude Code 这类交互式调用，更可靠的策略通常是短重试后尽快 fallback。
+- 如果要保护其它显式模型的 429，需要先新增备用 model/deployment，再为这些 model group 增加 `fallbacks` 或 `order`，否则没有可切换的健康目标。
+
+## Caveats / Not Found
+
+- 未发现仓库内有 LiteLLM 429 retry/fallback 的自动化集成测试；官方文档提供 `mock_testing_fallbacks=true` 测试普通 fallback，但这不完全等价于真实上游 429。
+- 当前 compose 默认使用 `main-latest` 镜像，未 pin LiteLLM 版本；研究结论基于 2026-05-07 官方在线文档，实际运行行为可能随镜像更新变化。
+- 官方 config settings 文档提到 `retry_after` 默认 0，并说收到 `x-retry-after` 时会覆盖；未在本轮验证标准 `Retry-After` 响应头是否同样被当前镜像尊重。
+- fallback 只能隐藏“重试或备用模型最终成功”的 429；如果主模型和所有备用模型都限流、认证失败、额度耗尽或不可用，错误仍会返回客户端。
+- streaming 请求在已经向客户端发送部分 token 后，任何网关都很难无感切换到备用模型；本轮未验证 LiteLLM 当前镜像对 streaming 429 的 retry/fallback 行为。
+- `.trellis/spec/node-script/frontend/index.md` 是占位型 frontend 规范，未提供 LiteLLM YAML 配置专用约束；本任务更应以现有 `ai/gateway/litellm` 文件模式和官方 LiteLLM 文档为准。
diff --git a/.trellis/tasks/05-07-litellm-fallback-retry-429/task.json b/.trellis/tasks/05-07-litellm-fallback-retry-429/task.json
@@ -0,0 +1,26 @@
+{
+  "id": "litellm-fallback-retry-429",
+  "name": "litellm-fallback-retry-429",
+  "title": "brainstorm: LiteLLM fallback 重试 429",
+  "description": "",
+  "status": "in_progress",
+  "dev_type": null,
+  "scope": null,
+  "package": "node-script",
+  "priority": "P2",
+  "creator": "codex",
+  "assignee": "codex",
+  "createdAt": "2026-05-07",
+  "completedAt": null,
+  "branch": null,
+  "base_branch": "master",
+  "worktree_path": null,
+  "commit": null,
+  "pr_url": null,
+  "subtasks": [],
+  "children": [],
+  "parent": null,
+  "relatedFiles": [],
+  "notes": "",
+  "meta": {}
+}
diff --git a/ai/gateway/litellm/litellm.md b/ai/gateway/litellm/litellm.md
diff --git a/ai/gateway/litellm/newapi.yaml b/ai/gateway/litellm/newapi.yaml

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+{"file": ".trellis/tasks/05-07-litellm-fallback-retry-429/prd.md", "reason": "验收标准与范围边界"}`
	`2`	`+{"file": ".trellis/tasks/05-07-litellm-fallback-retry-429/research/litellm-retry-fallback-429.md", "reason": "校验 LiteLLM 429 retry/fallback 配置是否符合官方语义"}`
	`3`	`+{"file": ".trellis/spec/guides/index.md", "reason": "通用质量检查思考触发器"}`