feat: per-channel and per-key cooldown overlay#5618
Conversation
Problem: a 400 'overdue-payment' from a billing-class upstream used to permanently disable the channel via the legacy auto-ban path, requiring the operator to manually re-enable it. For multi-key pools where keys share one account, the whole pool was locked out even after the operator fixed the upstream. The old binary 'disable the whole channel on first failure' was the wrong tool — recovery needs to be automatic for transient upstream faults and scoped to a single key for shared-account business errors. Solution: introduce an in-memory cooldown overlay consulted by the channel selector and the per-key distributor. Errors are classified into 'business' (400/402/403/422/451 plus quota / overdue keywords, default 3600s) and 'temp' (5xx / 408 / 429, default 30s). Both auto-recover when the deadline elapses. The legacy `ShouldDisableChannel` / `DisableChannel` path stays as the fallback when the new knobs are set to 0 or negative, so operators who want the old behaviour still have it. Per-key cooldown is the default for every channel (single-key and multi-key symmetric); the selector and `GetNextEnabledKey` consult the overlay together so no path hands a cooldowned key to the distributor. The distributor always writes the key index into the request context (single-key writes 0, multi-key writes the picked slot) and the controller reads it verbatim — the previous hard-coded `keyIdx=0` fallback caused the 'key #0 forever' regression on a 10-key pool.
Problem: the channel cooldown overlay is process-local state. Once a 400 business error triggers a 3600s cooldown, the only ways to recover are to wait for the deadline, restart the process, or manually re-enable the channel in the admin UI. For a long business cooldown (overdue-payment, quota exhausted, etc.) all three are unacceptable. Solution: expose the existing `model.ClearChannelCooldown` helper as POST /api/channel/:id/clear_cooldown under AdminAuth. The handler removes every overlay entry for the channel — both the channel-level entry and every per-key entry whose channelId matches — and returns the (channelRemoved, keyRemoved) counts. Other channels are untouched. No audit log because the action is non-destructive (in-memory skip only) and a user-cache lookup at handler time would force the test layer to mock redis.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (10)
💤 Files with no reviewable changes (1)
🚧 Files skipped from review as they are similar to previous changes (6)
WalkthroughThis PR implements an in-memory channel cooldown overlay as an alternative to the hard auto-disable mechanism. When upstream errors occur, channels and individual API keys are placed into a timed cooldown instead of being disabled. The channel selectors skip cooled channels/keys during routing, and a new admin API endpoint allows clearing cooldowns manually. ChangesChannel Cooldown Overlay
Sequence Diagram(s)sequenceDiagram
participant Client
participant Relay as Relay/RelayTask
participant Distributor as SetupContextForSelectedChannel
participant buildCtx as buildChannelErrorFromContext
participant processErr as processChannelError
participant cooldownHandler as handleChannelErrorCooldown
participant classifier as ClassifyChannelError
participant store as CooldownStore
participant selector as GetRandomSatisfiedChannel
Client->>Relay: API request
Relay->>Distributor: select channel
Distributor->>store: InCooldownIDs / InCooldownKeyIndices
store-->>Distributor: cooled IDs
Distributor->>Distributor: skip cooled channels and keys
Distributor-->>Relay: channel + ChannelMultiKeyIndex in context
Relay->>Relay: upstream call fails
Relay->>buildCtx: channel + gin.Context
buildCtx-->>Relay: ChannelError{KeyIndex from ctx}
Relay->>processErr: ChannelError, err
processErr->>cooldownHandler: ChannelError, err
cooldownHandler->>classifier: NewAPIError
classifier-->>cooldownHandler: ChannelErrorKind
alt business/temp + cooldown > 0
cooldownHandler->>store: MarkKeyCooldown or MarkCooldown
else business cooldown == 0 or unknown
cooldownHandler->>cooldownHandler: disableChannelAsync
end
Relay->>selector: retry next channel
selector->>store: snapshot InCooldownIDs
store-->>selector: cooled set
selector-->>Relay: next available channel
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (5)
setting/operation_setting/status_code_ranges_test.go (1)
100-108: ⚡ Quick winUse
assertfor non-fatal value checks in these new tests.These checks are non-setup expectations, so they should use
assert.*while keepingrequire.*for setup/fatal preconditions.As per coding guidelines, "
**/*_test.go: Backend tests MUST usegithub.com/stretchr/testify/requirefor setup and fatal assertions, andgithub.com/stretchr/testify/assertfor non-fatal value checks."Also applies to: 116-121
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@setting/operation_setting/status_code_ranges_test.go` around lines 100 - 108, In the test function in status_code_ranges_test.go, replace require.Truef with assert.Truef and require.Falsef with assert.Falsef in both test loops that validate the IsBusinessErrorStatusCode function behavior. These are non-fatal value checks that should use assert rather than require, which should only be used for setup and fatal preconditions.Source: Coding guidelines
service/channel_classifier_test.go (1)
25-26: ⚡ Quick winSwitch non-fatal expectations to
assertin these tests.Most assertions here are value checks and should be
assert.*; reserverequire.*for setup/fatal guards.As per coding guidelines, "
**/*_test.go: Backend tests MUST usegithub.com/stretchr/testify/requirefor setup and fatal assertions, andgithub.com/stretchr/testify/assertfor non-fatal value checks."Also applies to: 33-35, 60-63, 73-75, 82-83, 90-91, 102-103, 106-112, 129-130
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@service/channel_classifier_test.go` around lines 25 - 26, In the channel_classifier_test.go file, replace all non-fatal value check assertions with assert functions instead of require functions. Specifically, change require.Equal calls to assert.Equal for the value comparison of ClassifyChannelError results and other non-fatal checks (at the lines mentioned in the comment: 25-26, 33-35, 60-63, 73-75, 82-83, 90-91, 102-103, 106-112, 129-130). Reserve require.* only for setup operations and assertions that should be fatal to the test execution.Source: Coding guidelines
model/channel_cooldown.go (1)
237-250: 💤 Low valueUnused function:
gcExpiredKeyCooldownis never called.
gcAllExpiredhandles both channel and key cooldown maps atomically, making this function dead code. Consider removing it to reduce maintenance surface.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@model/channel_cooldown.go` around lines 237 - 250, The function `gcExpiredKeyCooldown` is never called and represents dead code that should be removed. Since `gcAllExpired` already handles garbage collection of both channel and key cooldown maps atomically, the separate `gcExpiredKeyCooldown` function is redundant. Remove the entire `gcExpiredKeyCooldown` function definition to reduce code maintenance burden and avoid confusion about which garbage collection function should be used.model/channel_cache.go (1)
130-162: 💤 Low valueInconsistent indentation inside single-channel fast path.
Lines 144-159 appear to have incorrect indentation relative to the enclosing
if channel, ok := ...block at line 137. The closing brace at line 160 aligns with line 137, but the inner code doesn't follow consistent indentation.This doesn't affect correctness but impacts readability.
🔧 Suggested indentation fix
if len(channels) == 1 { if channel, ok := channelsIDM[channels[0]]; ok { // Single-channel fast path: respect cooldown so we don't keep // hammering a temporarily broken channel. Returning (nil, nil) // signals the caller to retry / move groups / fail. The debug // log is the operator's confirmation that the filter is // actually firing — without it, the symptom is just "the // user got 400 anyway" and the cause is invisible. - if _, skip := cooldown[channel.Id]; skip { - logger.LogInfo(nil, fmt.Sprintf("selector skipped channel #%d: in cooldown", channel.Id)) - return nil, nil - } - // Per-key cooldown overlay: skip a single-channel group - // whose only served key is in cooldown. Without this, the - // fast path hands the channel to the distributor, the - // distributor's GetNextEnabledKey returns NoAvailableKey, - // and the controller's retry loop picks the same channel - // again — an infinite no-channel loop that surfaces as - // repeated upstream 400s. - if !channelHasAnyAvailableKey(channel, time.Now()) { - logger.LogInfo(nil, fmt.Sprintf("selector skipped channel #%d: every key in cooldown", channel.Id)) - return nil, nil - } - return channel, nil + if _, skip := cooldown[channel.Id]; skip { + logger.LogInfo(nil, fmt.Sprintf("selector skipped channel #%d: in cooldown", channel.Id)) + return nil, nil + } + // Per-key cooldown overlay: skip a single-channel group + // whose only served key is in cooldown. Without this, the + // fast path hands the channel to the distributor, the + // distributor's GetNextEnabledKey returns NoAvailableKey, + // and the controller's retry loop picks the same channel + // again — an infinite no-channel loop that surfaces as + // repeated upstream 400s. + if !channelHasAnyAvailableKey(channel, time.Now()) { + logger.LogInfo(nil, fmt.Sprintf("selector skipped channel #%d: every key in cooldown", channel.Id)) + return nil, nil + } + return channel, nil } return nil, fmt.Errorf("数据库一致性错误,渠道# %d 不存在,请联系管理员修复", channels[0]) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@model/channel_cache.go` around lines 130 - 162, The code block inside the single-channel fast path starting with the `if channel, ok := channelsIDM[channels[0]]; ok {` check has inconsistent indentation. All statements within this block, including the cooldown check with logger.LogInfo calls, the channelHasAnyAvailableKey call, and the return statements, need to be indented one level deeper to properly align as part of the if block. The closing brace of the if channel block should remain aligned with the opening if statement, but all content between the opening and closing braces must be consistently indented inward.controller/relay.go (1)
666-684: Unintended zero value forKeyIndexwhen context key is absent.
GetContextKeyInt()returns0(Go's zero value) when the context key is missing, so theidx >= 0check on line 681 will incorrectly setKeyIndex = &0instead of leaving itnilas intended by the comment. Since the distributor always writes the index in normal paths, consider either updating the comment to clarify this edge case behavior or adding an explicit check to distinguish between "key absent" and "key set to zero".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controller/relay.go` around lines 666 - 684, The issue in the buildChannelErrorFromContext function is that the condition `if idx >= 0` cannot distinguish between a missing context key (which should leave KeyIndex nil) and a context key with a zero value (which should set KeyIndex to 0). Since GetContextKeyInt returns 0 as Go's zero value when the key is missing, the current logic incorrectly sets KeyIndex to &0 when the context key is absent. Modify the code to explicitly check whether the context key actually exists before treating it as a valid index, either by using a method that returns both value and a boolean (ok pattern), or by verifying the raw context presence before calling GetContextKeyInt.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@model/option.go`:
- Around line 569-572: The strconv.Atoi() calls in the
"TempErrorCooldownSeconds" and "BusinessErrorCooldownSeconds" case statements
ignore errors and silently coerce invalid non-numeric values to 0, which
disables cooldown behavior. Instead of discarding the error with underscore,
capture the error returned by strconv.Atoi() and return a parse error when
conversion fails, preventing invalid values from being accepted and maintaining
the existing runtime value when parsing fails.
In `@setting/operation_setting/operation_setting.go`:
- Around line 80-88: The BusinessErrorKeywordsFromString function clears the
global BusinessErrorKeywords slice immediately and then repopulates it
incrementally, allowing concurrent readers to observe an empty or partial state.
Instead, build the new keywords into a local slice first, then atomically swap
it with the global BusinessErrorKeywords under a shared lock. Additionally,
ensure all reads of BusinessErrorKeywords also use the same synchronization lock
to prevent concurrent classifiers from seeing inconsistent states during
updates.
---
Nitpick comments:
In `@controller/relay.go`:
- Around line 666-684: The issue in the buildChannelErrorFromContext function is
that the condition `if idx >= 0` cannot distinguish between a missing context
key (which should leave KeyIndex nil) and a context key with a zero value (which
should set KeyIndex to 0). Since GetContextKeyInt returns 0 as Go's zero value
when the key is missing, the current logic incorrectly sets KeyIndex to &0 when
the context key is absent. Modify the code to explicitly check whether the
context key actually exists before treating it as a valid index, either by using
a method that returns both value and a boolean (ok pattern), or by verifying the
raw context presence before calling GetContextKeyInt.
In `@model/channel_cache.go`:
- Around line 130-162: The code block inside the single-channel fast path
starting with the `if channel, ok := channelsIDM[channels[0]]; ok {` check has
inconsistent indentation. All statements within this block, including the
cooldown check with logger.LogInfo calls, the channelHasAnyAvailableKey call,
and the return statements, need to be indented one level deeper to properly
align as part of the if block. The closing brace of the if channel block should
remain aligned with the opening if statement, but all content between the
opening and closing braces must be consistently indented inward.
In `@model/channel_cooldown.go`:
- Around line 237-250: The function `gcExpiredKeyCooldown` is never called and
represents dead code that should be removed. Since `gcAllExpired` already
handles garbage collection of both channel and key cooldown maps atomically, the
separate `gcExpiredKeyCooldown` function is redundant. Remove the entire
`gcExpiredKeyCooldown` function definition to reduce code maintenance burden and
avoid confusion about which garbage collection function should be used.
In `@service/channel_classifier_test.go`:
- Around line 25-26: In the channel_classifier_test.go file, replace all
non-fatal value check assertions with assert functions instead of require
functions. Specifically, change require.Equal calls to assert.Equal for the
value comparison of ClassifyChannelError results and other non-fatal checks (at
the lines mentioned in the comment: 25-26, 33-35, 60-63, 73-75, 82-83, 90-91,
102-103, 106-112, 129-130). Reserve require.* only for setup operations and
assertions that should be fatal to the test execution.
In `@setting/operation_setting/status_code_ranges_test.go`:
- Around line 100-108: In the test function in status_code_ranges_test.go,
replace require.Truef with assert.Truef and require.Falsef with assert.Falsef in
both test loops that validate the IsBusinessErrorStatusCode function behavior.
These are non-fatal value checks that should use assert rather than require,
which should only be used for setup and fatal preconditions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4c80134f-2863-4ab3-995a-b2a93a6bfba8
📒 Files selected for processing (21)
controller/channel.gocontroller/channel_cooldown.gocontroller/channel_cooldown_test.gocontroller/relay.gomain.gomiddleware/distributor.gomodel/ability.gomodel/ability_cooldown_test.gomodel/channel.gomodel/channel_cache.gomodel/channel_cache_cooldown_test.gomodel/channel_cooldown.gomodel/channel_cooldown_test.gomodel/option.gorouter/api-router.goservice/channel_classifier.goservice/channel_classifier_test.gosetting/operation_setting/operation_setting.gosetting/operation_setting/status_code_ranges.gosetting/operation_setting/status_code_ranges_test.gotypes/channel_error.go
Problem: CodeRabbit flagged several issues against the cooldown overlay PR — silent error swallowing on bad config values, torn reads on the global keyword slice, dead code, and value-check assertions that should be non-fatal. The reviewer also caught a hard-to-spot regression in the buildChannelErrorFromContext helper that, after the recent rewrite, would still set KeyIndex=0 when the context key was absent (the same shape the rewrite was supposed to fix). Solution: model/option.go — capture strconv.Atoi errors for TempErrorCooldownSeconds and BusinessErrorCooldownSeconds and return them, so a typo in the system settings no longer silently coerces behaviour to 0 (which disables cooldown). Add the missing fmt import. setting/operation_setting/operation_setting.go — guard BusinessErrorKeywords behind a sync.RWMutex. The previous FromString cleared the global slice and then repopulated it, leaving concurrent classifier readers with an empty or partial list mid-update (silent misclassification that changes cooldown behaviour). The new path builds the replacement into a local slice first, then swaps under the write lock. ToString and the classifier consult the lock for read-side consistency. SetBusinessErrorKeywordsForTest is added so the classifier test can pin the keyword list without bypassing the lock. service/channel_classifier.go — read keywords via the exported BusinessErrorKeywordsSnapshot, which copies under the read lock so the classifier iterates a stable slice without holding the lock for the (potentially many) Contains() calls. controller/relay.go::buildChannelErrorFromContext — probe the context with c.Get (which returns ok) rather than c.GetInt, so a missing key no longer looks identical to key=0 and silently re-introduces the hard-coded-zero bug the rewrite was supposed to fix. model/channel_cooldown.go — remove the dead gcExpiredKeyCooldown helper. gcAllExpired already sweeps both maps atomically. model/channel_cache.go — fix the indentation inside the single-channel fast path so the body aligns with the enclosing if block. service/channel_classifier_test.go — switch non-fatal value checks from require.* to assert.* per the project guideline (require for setup, assert for value). Update withBusinessKeywords to use the new atomic setter. setting/operation_setting/status_code_ranges_test.go — same require/assert split. Also fix two assertions the rewrite inadvertently changed: 201, 301, 400 are NOT in the default retry ranges (the comment on the ranges says 'no retry for 2xx' and explicitly excludes 400), so the original assertions (401, 429, 500, 599) are the correct ones. controller/channel_cooldown_test.go — TestProcessChannelError_PerKeyCooldown now sets ContextKeyChannelMultiKeyIndex=0 to mirror the real distributor path; without it the helper (which now distinguishes missing from zero) leaves KeyIndex nil and the test falls through to whole-channel cooldown. The comment on TestHandleChannelErrorCooldown_SingleKeyUsesKeyNotChannel is updated to reflect that the distributor is what writes KeyIndex=0, not the helper.
Important
📝 变更描述 / Description
上游失败就冷却 channel/key 一段时间, 防止下游请求频繁失败
🚀 变更类型 / Type of change
🔗 关联任务 / Related Issue
ClosesRelated: 建议增加自动禁用渠道的冷却半开恢复机制,避免依赖定时全量测试 #5420✅ 提交前检查项 / Checklist
Bug fix,我已提交或关联对应 Issue,且不会将设计取舍、预期不一致或理解偏差直接归类为 bug。📸 运行证明 / Proof of Work
检测失败:
key #2 of channel #38 hit a business error, 然后 1 小时不再发送请求Summary by CodeRabbit