Skip to content

feat: per-channel and per-key cooldown overlay#5618

Draft
phanen wants to merge 4 commits into
QuantumNous:mainfrom
phanen:feat/cooldown
Draft

feat: per-channel and per-key cooldown overlay#5618
phanen wants to merge 4 commits into
QuantumNous:mainfrom
phanen:feat/cooldown

Conversation

@phanen

@phanen phanen commented Jun 20, 2026

Copy link
Copy Markdown
  • feat: per-channel and per-key cooldown overlay
  • feat: add POST /api/channel/:id/clear_cooldown operator escape hatch

⚠️ 提交说明 / PR Notice

Important

  • 请提供人工撰写的简洁摘要,避免直接粘贴未经整理的 AI 输出。

📝 变更描述 / Description

上游失败就冷却 channel/key 一段时间, 防止下游请求频繁失败

🚀 变更类型 / Type of change

  • 🐛 Bug 修复 (Bug fix) - 请关联对应 Issue,避免将设计取舍、理解偏差或预期不一致直接归类为 bug
  • ✨ 新功能 (New feature) - 重大特性建议先通过 Issue 沟通
  • ⚡ 性能优化 / 重构 (Refactor)
  • 📝 文档更新 (Documentation)

🔗 关联任务 / Related Issue

✅ 提交前检查项 / Checklist

  • 人工确认: 我已亲自整理并撰写此描述,没有直接粘贴未经处理的 AI 输出。
  • 非重复提交: 我已搜索现有的 IssuesPRs,确认不是重复提交。
  • Bug fix 说明: 若此 PR 标记为 Bug fix,我已提交或关联对应 Issue,且不会将设计取舍、预期不一致或理解偏差直接归类为 bug。
  • 变更理解: 我已理解这些更改的工作原理及可能影响。
  • 范围聚焦: 本 PR 未包含任何与当前任务无关的代码改动。
  • 本地验证: 已在本地运行并通过测试或手动验证,维护者可以据此复核结果。
  • 安全合规: 代码中无敏感凭据,且符合项目代码规范。

📸 运行证明 / Proof of Work

检测失败: key #2 of channel #38 hit a business error, 然后 1 小时不再发送请求

Jun 20 13:13:21 arch new-api[1412292]: [ERR] 2026/06/20 - 13:13:21 | 202606200513213808099318268d9d6oTDhGQGG | channel error (channel #38, status code: 400): Access
denied, please make sure your account is in good standing. For details, see: https://help.aliyun.com/zh/model-studio/error-code#overdue-payment
Jun 20 13:13:21 arch new-api[1412292]: [INFO] 2026/06/20 - 13:13:21 | SYSTEM | cooldown: channel #38 classified as business (status=400, msg="Access denied, please m
ake sure your account is in good standing. For details, see: https://help.aliyun.com/zh/model-studio/error-code#overdue-payment")
Jun 20 13:13:21 arch new-api[1412292]: [INFO] 2026/06/20 - 13:13:21 | SYSTEM | key #2 of channel #38 hit a business error (status=400), entering 3600s cooldown until
 2026-06-20T14:13:21+08:00
Jun 20 13:13:21 arch new-api[1412292]: [ERR] 2026/06/20 - 13:13:21 | 202606200513213808099318268d9d6oTDhGQGG | relay error: Access denied, please make sure your acco
unt is in good standing. For details, see: https://help.aliyun.com/zh/model-studio/error-code#overdue-payment

Summary by CodeRabbit

  • New Features
    • Added cooldown overlays for channels and API-key slots, with separate durations for business vs temporary upstream failures.
    • Added admin endpoint to clear cooldowns for a channel (including per-key entries).
    • Added error classification using configurable business status-code ranges and keyword matching; tuned channel selection to avoid repeated failures.
    • Added automatic cooldown garbage-collection to remove expired overlays.
  • Bug Fixes
    • Improved multi-key error/retry behavior so cooldown targets the correct key slot (no fallback-to-default key).
    • Refined “no available key” outcomes and retry behavior when cooldown eliminates all candidates.
  • Tests
    • Added unit/regression coverage for cooldown marking, selection filtering, and key scoping.

phanen added 2 commits June 20, 2026 13:17
Problem: a 400 'overdue-payment' from a billing-class upstream
used to permanently disable the channel via the legacy auto-ban
path, requiring the operator to manually re-enable it. For
multi-key pools where keys share one account, the whole
pool was locked out even after the operator fixed the
upstream. The old binary 'disable the whole channel on
first failure' was the wrong tool — recovery needs to be
automatic for transient upstream faults and scoped to a
single key for shared-account business errors.

Solution: introduce an in-memory cooldown overlay consulted
by the channel selector and the per-key distributor. Errors
are classified into 'business' (400/402/403/422/451 plus
quota / overdue keywords, default 3600s) and 'temp' (5xx /
408 / 429, default 30s). Both auto-recover when the deadline
elapses. The legacy `ShouldDisableChannel` / `DisableChannel`
path stays as the fallback when the new knobs are set to
0 or negative, so operators who want the old behaviour
still have it. Per-key cooldown is the default for every
channel (single-key and multi-key symmetric); the selector
and `GetNextEnabledKey` consult the overlay together so
no path hands a cooldowned key to the distributor. The
distributor always writes the key index into the request
context (single-key writes 0, multi-key writes the picked
slot) and the controller reads it verbatim — the previous
hard-coded `keyIdx=0` fallback caused the 'key #0 forever'
regression on a 10-key pool.
Problem: the channel cooldown overlay is process-local state.
Once a 400 business error triggers a 3600s cooldown, the only
ways to recover are to wait for the deadline, restart the
process, or manually re-enable the channel in the admin UI.
For a long business cooldown (overdue-payment, quota
exhausted, etc.) all three are unacceptable.

Solution: expose the existing `model.ClearChannelCooldown`
helper as POST /api/channel/:id/clear_cooldown under AdminAuth.
The handler removes every overlay entry for the channel —
both the channel-level entry and every per-key entry whose
channelId matches — and returns the (channelRemoved,
keyRemoved) counts. Other channels are untouched. No audit
log because the action is non-destructive (in-memory skip
only) and a user-cache lookup at handler time would force
the test layer to mock redis.
@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a831514d-5500-4f49-be3c-a6a338161604

📥 Commits

Reviewing files that changed from the base of the PR and between a05e42c and bea9ad6.

📒 Files selected for processing (10)
  • controller/channel_cooldown_test.go
  • controller/relay.go
  • model/channel_cache.go
  • model/channel_cooldown.go
  • model/option.go
  • relay/compatible_handler.go
  • service/channel_classifier.go
  • service/channel_classifier_test.go
  • setting/operation_setting/operation_setting.go
  • setting/operation_setting/status_code_ranges_test.go
💤 Files with no reviewable changes (1)
  • model/channel_cooldown.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • model/option.go
  • service/channel_classifier.go
  • service/channel_classifier_test.go
  • controller/relay.go
  • controller/channel_cooldown_test.go
  • model/channel_cache.go

Walkthrough

This PR implements an in-memory channel cooldown overlay as an alternative to the hard auto-disable mechanism. When upstream errors occur, channels and individual API keys are placed into a timed cooldown instead of being disabled. The channel selectors skip cooled channels/keys during routing, and a new admin API endpoint allows clearing cooldowns manually.

Changes

Channel Cooldown Overlay

Layer / File(s) Summary
Cooldown configuration and error classification settings
setting/operation_setting/operation_setting.go, setting/operation_setting/status_code_ranges.go, setting/operation_setting/status_code_ranges_test.go, model/option.go
Adds TempErrorCooldownSeconds, BusinessErrorCooldownSeconds, BusinessErrorKeywords, and BusinessErrorStatusCodeRanges configuration variables with serialization helpers; wires them into InitOptionMap and updateOptionMap for dynamic option synchronization.
ChannelErrorKind classifier
service/channel_classifier.go, service/channel_classifier_test.go
New ClassifyChannelError function categorizes NewAPIError into unknown, business, or temp kinds using configured HTTP status code ranges and keyword lists; includes full test coverage for classification logic and precedence rules.
In-memory cooldown store
model/channel_cooldown.go, model/channel_cooldown_test.go, model/ability_cooldown_test.go
Two sync-protected maps (channel-level keyed by int, per-key keyed by packed uint64) with MarkCooldown/MarkKeyCooldown (longer-wins semantics), read helpers, periodic GC (StartCooldownGC), admin reset functions (ClearChannelCooldown, ClearAllCooldowns), and concurrency stress tests.
Cooldown-aware channel and key selection
model/channel.go, model/channel_cache.go, model/ability.go, model/channel_cache_cooldown_test.go
GetNextEnabledKey rejects cooled keys for both single-key and multi-key channels; GetRandomSatisfiedChannel snapshots cooldown IDs per call and skips channels/keys in cooldown, returning (nil, nil) when a priority bucket is fully cooled; DB-path GetChannel also filters by cooldown with full test coverage.
ChannelError KeyIndex and relay rewiring
types/channel_error.go, middleware/distributor.go, controller/relay.go, controller/channel_cooldown.go, controller/channel_cooldown_test.go
Adds optional KeyIndex *int to ChannelError; middleware now always writes ChannelMultiKeyIndex context key for both single and multi-key channels; buildChannelErrorFromContext populates KeyIndex from context; processChannelError delegates to handleChannelErrorCooldown, which classifies errors and marks cooldown at key or channel granularity, with legacy auto-disable fallback.
Admin API and startup wiring
controller/channel.go, router/api-router.go, main.go
Registers POST /:id/clear_cooldown route wired to ClearChannelCooldownHandler; starts model.StartCooldownGC(30) background GC task at application startup.
Debug logging cleanup
relay/compatible_handler.go
Disables debug log statement that printed request body content.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Relay as Relay/RelayTask
  participant Distributor as SetupContextForSelectedChannel
  participant buildCtx as buildChannelErrorFromContext
  participant processErr as processChannelError
  participant cooldownHandler as handleChannelErrorCooldown
  participant classifier as ClassifyChannelError
  participant store as CooldownStore
  participant selector as GetRandomSatisfiedChannel

  Client->>Relay: API request
  Relay->>Distributor: select channel
  Distributor->>store: InCooldownIDs / InCooldownKeyIndices
  store-->>Distributor: cooled IDs
  Distributor->>Distributor: skip cooled channels and keys
  Distributor-->>Relay: channel + ChannelMultiKeyIndex in context

  Relay->>Relay: upstream call fails
  Relay->>buildCtx: channel + gin.Context
  buildCtx-->>Relay: ChannelError{KeyIndex from ctx}
  Relay->>processErr: ChannelError, err

  processErr->>cooldownHandler: ChannelError, err
  cooldownHandler->>classifier: NewAPIError
  classifier-->>cooldownHandler: ChannelErrorKind

  alt business/temp + cooldown > 0
    cooldownHandler->>store: MarkKeyCooldown or MarkCooldown
  else business cooldown == 0 or unknown
    cooldownHandler->>cooldownHandler: disableChannelAsync
  end

  Relay->>selector: retry next channel
  selector->>store: snapshot InCooldownIDs
  store-->>selector: cooled set
  selector-->>Relay: next available channel
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • QuantumNous/new-api#1360: Introduced ContextKeyChannelMultiKeyIndex and per-key types.ChannelError fields that this PR's buildChannelErrorFromContext and KeyIndex-scoped cooldown logic directly depend on.
  • QuantumNous/new-api#1775: Both PRs modify controller/relay.go's channel error handling—specifically the async "disable channel on error" flow (moving/guarding the ShouldDisableChannel && AutoBan goroutine logic in #1775 vs. replacing that path with the new cooldown-aware handler in the main PR).
  • QuantumNous/new-api#2647: The main PR's new cooldown/disable path still calls service.ShouldDisableChannel as a fallback, and the retrieved PR changes ShouldDisableChannel via status-code auto-disable configuration—so the retrieved PR directly affects when the main PR will schedule channel disable.

Suggested reviewers

  • seefs001

Poem

🐇 When channels stumble and APIs cry,
No ban — just a cool-down beneath the digital sky.
Mark the keys, let the ticker run high,
The rabbit's overlay keeps relays nearby.
Skip and retry till the moment flies by! ⏰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 48.76% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat: per-channel and per-key cooldown overlay' accurately and concisely describes the main feature being added—a cooldown overlay system supporting both channel-level and per-key cooldown functionality.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (5)
setting/operation_setting/status_code_ranges_test.go (1)

100-108: ⚡ Quick win

Use assert for non-fatal value checks in these new tests.

These checks are non-setup expectations, so they should use assert.* while keeping require.* for setup/fatal preconditions.

As per coding guidelines, "**/*_test.go: Backend tests MUST use github.com/stretchr/testify/require for setup and fatal assertions, and github.com/stretchr/testify/assert for non-fatal value checks."

Also applies to: 116-121

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@setting/operation_setting/status_code_ranges_test.go` around lines 100 - 108,
In the test function in status_code_ranges_test.go, replace require.Truef with
assert.Truef and require.Falsef with assert.Falsef in both test loops that
validate the IsBusinessErrorStatusCode function behavior. These are non-fatal
value checks that should use assert rather than require, which should only be
used for setup and fatal preconditions.

Source: Coding guidelines

service/channel_classifier_test.go (1)

25-26: ⚡ Quick win

Switch non-fatal expectations to assert in these tests.

Most assertions here are value checks and should be assert.*; reserve require.* for setup/fatal guards.

As per coding guidelines, "**/*_test.go: Backend tests MUST use github.com/stretchr/testify/require for setup and fatal assertions, and github.com/stretchr/testify/assert for non-fatal value checks."

Also applies to: 33-35, 60-63, 73-75, 82-83, 90-91, 102-103, 106-112, 129-130

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@service/channel_classifier_test.go` around lines 25 - 26, In the
channel_classifier_test.go file, replace all non-fatal value check assertions
with assert functions instead of require functions. Specifically, change
require.Equal calls to assert.Equal for the value comparison of
ClassifyChannelError results and other non-fatal checks (at the lines mentioned
in the comment: 25-26, 33-35, 60-63, 73-75, 82-83, 90-91, 102-103, 106-112,
129-130). Reserve require.* only for setup operations and assertions that should
be fatal to the test execution.

Source: Coding guidelines

model/channel_cooldown.go (1)

237-250: 💤 Low value

Unused function: gcExpiredKeyCooldown is never called.

gcAllExpired handles both channel and key cooldown maps atomically, making this function dead code. Consider removing it to reduce maintenance surface.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@model/channel_cooldown.go` around lines 237 - 250, The function
`gcExpiredKeyCooldown` is never called and represents dead code that should be
removed. Since `gcAllExpired` already handles garbage collection of both channel
and key cooldown maps atomically, the separate `gcExpiredKeyCooldown` function
is redundant. Remove the entire `gcExpiredKeyCooldown` function definition to
reduce code maintenance burden and avoid confusion about which garbage
collection function should be used.
model/channel_cache.go (1)

130-162: 💤 Low value

Inconsistent indentation inside single-channel fast path.

Lines 144-159 appear to have incorrect indentation relative to the enclosing if channel, ok := ... block at line 137. The closing brace at line 160 aligns with line 137, but the inner code doesn't follow consistent indentation.

This doesn't affect correctness but impacts readability.

🔧 Suggested indentation fix
 	if len(channels) == 1 {
 		if channel, ok := channelsIDM[channels[0]]; ok {
 			// Single-channel fast path: respect cooldown so we don't keep
 			// hammering a temporarily broken channel. Returning (nil, nil)
 			// signals the caller to retry / move groups / fail. The debug
 			// log is the operator's confirmation that the filter is
 			// actually firing — without it, the symptom is just "the
 			// user got 400 anyway" and the cause is invisible.
-		if _, skip := cooldown[channel.Id]; skip {
-			logger.LogInfo(nil, fmt.Sprintf("selector skipped channel #%d: in cooldown", channel.Id))
-			return nil, nil
-		}
-		// Per-key cooldown overlay: skip a single-channel group
-		// whose only served key is in cooldown. Without this, the
-		// fast path hands the channel to the distributor, the
-		// distributor's GetNextEnabledKey returns NoAvailableKey,
-		// and the controller's retry loop picks the same channel
-		// again — an infinite no-channel loop that surfaces as
-		// repeated upstream 400s.
-		if !channelHasAnyAvailableKey(channel, time.Now()) {
-			logger.LogInfo(nil, fmt.Sprintf("selector skipped channel #%d: every key in cooldown", channel.Id))
-			return nil, nil
-		}
-		return channel, nil
+			if _, skip := cooldown[channel.Id]; skip {
+				logger.LogInfo(nil, fmt.Sprintf("selector skipped channel #%d: in cooldown", channel.Id))
+				return nil, nil
+			}
+			// Per-key cooldown overlay: skip a single-channel group
+			// whose only served key is in cooldown. Without this, the
+			// fast path hands the channel to the distributor, the
+			// distributor's GetNextEnabledKey returns NoAvailableKey,
+			// and the controller's retry loop picks the same channel
+			// again — an infinite no-channel loop that surfaces as
+			// repeated upstream 400s.
+			if !channelHasAnyAvailableKey(channel, time.Now()) {
+				logger.LogInfo(nil, fmt.Sprintf("selector skipped channel #%d: every key in cooldown", channel.Id))
+				return nil, nil
+			}
+			return channel, nil
 		}
 		return nil, fmt.Errorf("数据库一致性错误,渠道# %d 不存在,请联系管理员修复", channels[0])
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@model/channel_cache.go` around lines 130 - 162, The code block inside the
single-channel fast path starting with the `if channel, ok :=
channelsIDM[channels[0]]; ok {` check has inconsistent indentation. All
statements within this block, including the cooldown check with logger.LogInfo
calls, the channelHasAnyAvailableKey call, and the return statements, need to be
indented one level deeper to properly align as part of the if block. The closing
brace of the if channel block should remain aligned with the opening if
statement, but all content between the opening and closing braces must be
consistently indented inward.
controller/relay.go (1)

666-684: Unintended zero value for KeyIndex when context key is absent.

GetContextKeyInt() returns 0 (Go's zero value) when the context key is missing, so the idx >= 0 check on line 681 will incorrectly set KeyIndex = &0 instead of leaving it nil as intended by the comment. Since the distributor always writes the index in normal paths, consider either updating the comment to clarify this edge case behavior or adding an explicit check to distinguish between "key absent" and "key set to zero".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/relay.go` around lines 666 - 684, The issue in the
buildChannelErrorFromContext function is that the condition `if idx >= 0` cannot
distinguish between a missing context key (which should leave KeyIndex nil) and
a context key with a zero value (which should set KeyIndex to 0). Since
GetContextKeyInt returns 0 as Go's zero value when the key is missing, the
current logic incorrectly sets KeyIndex to &0 when the context key is absent.
Modify the code to explicitly check whether the context key actually exists
before treating it as a valid index, either by using a method that returns both
value and a boolean (ok pattern), or by verifying the raw context presence
before calling GetContextKeyInt.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@model/option.go`:
- Around line 569-572: The strconv.Atoi() calls in the
"TempErrorCooldownSeconds" and "BusinessErrorCooldownSeconds" case statements
ignore errors and silently coerce invalid non-numeric values to 0, which
disables cooldown behavior. Instead of discarding the error with underscore,
capture the error returned by strconv.Atoi() and return a parse error when
conversion fails, preventing invalid values from being accepted and maintaining
the existing runtime value when parsing fails.

In `@setting/operation_setting/operation_setting.go`:
- Around line 80-88: The BusinessErrorKeywordsFromString function clears the
global BusinessErrorKeywords slice immediately and then repopulates it
incrementally, allowing concurrent readers to observe an empty or partial state.
Instead, build the new keywords into a local slice first, then atomically swap
it with the global BusinessErrorKeywords under a shared lock. Additionally,
ensure all reads of BusinessErrorKeywords also use the same synchronization lock
to prevent concurrent classifiers from seeing inconsistent states during
updates.

---

Nitpick comments:
In `@controller/relay.go`:
- Around line 666-684: The issue in the buildChannelErrorFromContext function is
that the condition `if idx >= 0` cannot distinguish between a missing context
key (which should leave KeyIndex nil) and a context key with a zero value (which
should set KeyIndex to 0). Since GetContextKeyInt returns 0 as Go's zero value
when the key is missing, the current logic incorrectly sets KeyIndex to &0 when
the context key is absent. Modify the code to explicitly check whether the
context key actually exists before treating it as a valid index, either by using
a method that returns both value and a boolean (ok pattern), or by verifying the
raw context presence before calling GetContextKeyInt.

In `@model/channel_cache.go`:
- Around line 130-162: The code block inside the single-channel fast path
starting with the `if channel, ok := channelsIDM[channels[0]]; ok {` check has
inconsistent indentation. All statements within this block, including the
cooldown check with logger.LogInfo calls, the channelHasAnyAvailableKey call,
and the return statements, need to be indented one level deeper to properly
align as part of the if block. The closing brace of the if channel block should
remain aligned with the opening if statement, but all content between the
opening and closing braces must be consistently indented inward.

In `@model/channel_cooldown.go`:
- Around line 237-250: The function `gcExpiredKeyCooldown` is never called and
represents dead code that should be removed. Since `gcAllExpired` already
handles garbage collection of both channel and key cooldown maps atomically, the
separate `gcExpiredKeyCooldown` function is redundant. Remove the entire
`gcExpiredKeyCooldown` function definition to reduce code maintenance burden and
avoid confusion about which garbage collection function should be used.

In `@service/channel_classifier_test.go`:
- Around line 25-26: In the channel_classifier_test.go file, replace all
non-fatal value check assertions with assert functions instead of require
functions. Specifically, change require.Equal calls to assert.Equal for the
value comparison of ClassifyChannelError results and other non-fatal checks (at
the lines mentioned in the comment: 25-26, 33-35, 60-63, 73-75, 82-83, 90-91,
102-103, 106-112, 129-130). Reserve require.* only for setup operations and
assertions that should be fatal to the test execution.

In `@setting/operation_setting/status_code_ranges_test.go`:
- Around line 100-108: In the test function in status_code_ranges_test.go,
replace require.Truef with assert.Truef and require.Falsef with assert.Falsef in
both test loops that validate the IsBusinessErrorStatusCode function behavior.
These are non-fatal value checks that should use assert rather than require,
which should only be used for setup and fatal preconditions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4c80134f-2863-4ab3-995a-b2a93a6bfba8

📥 Commits

Reviewing files that changed from the base of the PR and between dfcb74b and a05e42c.

📒 Files selected for processing (21)
  • controller/channel.go
  • controller/channel_cooldown.go
  • controller/channel_cooldown_test.go
  • controller/relay.go
  • main.go
  • middleware/distributor.go
  • model/ability.go
  • model/ability_cooldown_test.go
  • model/channel.go
  • model/channel_cache.go
  • model/channel_cache_cooldown_test.go
  • model/channel_cooldown.go
  • model/channel_cooldown_test.go
  • model/option.go
  • router/api-router.go
  • service/channel_classifier.go
  • service/channel_classifier_test.go
  • setting/operation_setting/operation_setting.go
  • setting/operation_setting/status_code_ranges.go
  • setting/operation_setting/status_code_ranges_test.go
  • types/channel_error.go

Comment thread model/option.go Outdated
Comment thread setting/operation_setting/operation_setting.go
Problem: CodeRabbit flagged several issues against the
cooldown overlay PR — silent error swallowing on bad config
values, torn reads on the global keyword slice, dead code,
and value-check assertions that should be non-fatal. The
reviewer also caught a hard-to-spot regression in the
buildChannelErrorFromContext helper that, after the recent
rewrite, would still set KeyIndex=0 when the context key was
absent (the same shape the rewrite was supposed to fix).

Solution:

  model/option.go — capture strconv.Atoi errors for
  TempErrorCooldownSeconds and BusinessErrorCooldownSeconds
  and return them, so a typo in the system settings no longer
  silently coerces behaviour to 0 (which disables cooldown).
  Add the missing fmt import.

  setting/operation_setting/operation_setting.go — guard
  BusinessErrorKeywords behind a sync.RWMutex. The previous
  FromString cleared the global slice and then repopulated
  it, leaving concurrent classifier readers with an empty or
  partial list mid-update (silent misclassification that
  changes cooldown behaviour). The new path builds the
  replacement into a local slice first, then swaps under the
  write lock. ToString and the classifier consult the lock
  for read-side consistency. SetBusinessErrorKeywordsForTest
  is added so the classifier test can pin the keyword list
  without bypassing the lock.

  service/channel_classifier.go — read keywords via the
  exported BusinessErrorKeywordsSnapshot, which copies under
  the read lock so the classifier iterates a stable slice
  without holding the lock for the (potentially many)
  Contains() calls.

  controller/relay.go::buildChannelErrorFromContext — probe
  the context with c.Get (which returns ok) rather than
  c.GetInt, so a missing key no longer looks identical to
  key=0 and silently re-introduces the hard-coded-zero bug
  the rewrite was supposed to fix.

  model/channel_cooldown.go — remove the dead
  gcExpiredKeyCooldown helper. gcAllExpired already sweeps
  both maps atomically.

  model/channel_cache.go — fix the indentation inside the
  single-channel fast path so the body aligns with the
  enclosing if block.

  service/channel_classifier_test.go — switch non-fatal
  value checks from require.* to assert.* per the project
  guideline (require for setup, assert for value). Update
  withBusinessKeywords to use the new atomic setter.

  setting/operation_setting/status_code_ranges_test.go —
  same require/assert split. Also fix two assertions the
  rewrite inadvertently changed: 201, 301, 400 are NOT in
  the default retry ranges (the comment on the ranges says
  'no retry for 2xx' and explicitly excludes 400), so the
  original assertions (401, 429, 500, 599) are the correct
  ones.

  controller/channel_cooldown_test.go — TestProcessChannelError_PerKeyCooldown
  now sets ContextKeyChannelMultiKeyIndex=0 to mirror the
  real distributor path; without it the helper (which now
  distinguishes missing from zero) leaves KeyIndex nil and
  the test falls through to whole-channel cooldown. The
  comment on TestHandleChannelErrorCooldown_SingleKeyUsesKeyNotChannel
  is updated to reflect that the distributor is what writes
  KeyIndex=0, not the helper.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant