Skip to content

Commit e5223ce

Browse files
authored
fix(vault): sticky pool failover (no main account, stops failover flap and notice spam) (#48)
* docs(plans): sticky pool failover plan * fix(vault): sticky pool failover selection (no main account) * test(pool): sticky-hold, flap, advance-wrap, swap-survival and spam regressions * docs(pools): describe sticky active-member selection * perf(vault): read-mostly fast path and epoch-scoped sticky pointer in ResolveActive * test(vault): cover sticky fast/advance paths, epoch clobber, and SetCurrentMembers prune * perf(vault): single time.Now snapshot per ResolveActive and drop write lock before degrade logging * test(vault): use realistic epoch>=1 in sticky-pointer tests and add concurrent degrade no-double-unlock test
1 parent 6d5e680 commit e5223ce

5 files changed

Lines changed: 732 additions & 25 deletions

File tree

CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -191,7 +191,7 @@ Auto-failover on 429/401 is primary; `pool rotate` is an override.
191191

192192
- **Single chokepoint (I2):** every `binding.Credential` / `OAuthIndex.Has` / `extractInjectableSecret` / persist consumer on the HTTP/HTTPS OAuth path routes through `PoolResolver.ResolveActive` (`resolveInjectionTarget` for pass-1 header + pass-2 swap; `resolveOAuthResponseAttribution` for response/persist). `idx.Has` is always called with the resolved member, never the pool. Plain credentials pass through unchanged; SSH/mail are non-OAuth, out of scope.
193193
- **QUIC scope:** the HTTP/1.x/HTTP/2 MITM addon implements the full feature set (R1, R3, Phase 2). The HTTP/3/QUIC path (`QUICProxy.buildPhantomPairs`, binding-header injection in `quic.go`) is a request-side buffered swap with **no response-side OAuth interception**, but IS pool-aware on the request side: `QUICProxy.resolvePoolTarget` (via `NewQUICProxy`'s `poolResolver`) selects the active member's real secret and routes through `buildPooledOAuthPhantomPairs` so the access phantom is the same pool-stable JWT (R3 holds over QUIC). QUIC does **not** do R1 attribution or Phase 2 failover — the injected member is whatever the HTTP path / `pool rotate` last made active, and a QUIC-only 429/401 or refresh is not acted on. Deployments needing R1/auto-failover must route the pooled upstream over HTTP/HTTPS.
194-
- **Active-member selection:** healthy or expired-cooldown members first, by configured position; if all are in cooldown, the soonest-recovering is returned with a WARNING (degrade, never hard-fail). Recovery is lazy (evaluated in `ResolveActive`, no scheduler).
194+
- **Active-member selection (sticky, no "main" account):** `ResolveActive` is sticky. There is no position-0 "main": once selection settles on a member sluice keeps returning it while it is healthy, even if a lower-position member recovers from cooldown. A new active member is chosen only when the current one cools. The chosen member is the next eligible one by position **starting after the current member and wrapping** (advance forward, never snap back to position 0). The sticky current-active pointer lives on the shared swap-surviving `PoolHealth` under the same mutex as the cooldown map, so it survives `NewPoolResolverShared` regeneration and atomic pointer swaps (CRITICAL-1) and a stale resolver generation cannot clobber it, exactly like cooldowns. This kills the live flap where a 60s `RateLimitCooldown` lapse re-selected an upstream-exhausted account and respammed `cred_failover` plus a Telegram notice every cooldown window. If every member is cooling, the existing degrade applies (operator-parked-but-healthy `ManualRotateReason` first, else soonest-recovering, with a WARNING, never hard-fail) and the sticky pointer is left untouched so a recovery advances forward rather than snapping to position 0. `sluice pool rotate` still works: it parks the active member, so the next `ResolveActive` advances to the next member and then stays there (no snap-back). Recovery is lazy (evaluated in `ResolveActive`, no scheduler). A selectable position-priority-vs-sticky strategy mode is a possible follow-up and is out of scope here.
195195
- **R1 refresh-token attribution / fail-closed:** when pass-2 swaps `SLUICE_PHANTOM:<pool>.refresh`, sluice records `realRefreshToken -> member` in a short-TTL map; on the token-endpoint response it recovers the member by that real refresh token and persists to it (`persistAddonOAuthTokens(member, ...)`, singleflight `"persist:"+member`). The join key is the real **refresh** token — never the access token, connection, or `OAuthIndex.Match` (two pooled members share `auth.openai.com`'s token URL and collide). Unrecoverable -> WARNING + skip the write (rotating refresh tokens are single-use; a mis-attributed write bricks both accounts). **Plain-credential disambiguation:** a plain OAuth credential sharing a pool's token URL also tags its injected refresh token `realRefreshToken -> <plain name>` (plain path in `buildPhantomPairs`/`buildOAuthPhantomPairs`'s `onRefreshInject`, incl. split-host expansion); on response a recovered non-member (`PoolForMember == ""`) is attributed 1:1 to that plain credential, NOT fail-closed as pooled. The pooled fail-closed path applies only when recovery fails or resolves to an actual member; `poolForResponse` gates the same on an independent `flowInjected` tag (set post-swap only if a pool phantom was present) before cooling a member.
196196
- **R3 pool-stable phantom JWT:** Codex access tokens are JWTs; the per-real-token `resignJWT` would emit a different phantom after each cross-member refresh, breaking "agent never notices". `poolStablePhantomAccess` (`internal/proxy/oauth_response.go`) builds the phantom JWT from a deterministic synthetic payload keyed on the **pool name** (`sub: sluice-pool:<pool>`, `iss: sluice-phantom`, fixed far-future `exp`, no `iat`), HMAC-SHA256 with the fixed key — byte-identical across switches, structurally valid. Pool name is JSON-marshaled (never concatenated) so quotes/control chars can't inject claims. Static-form fallback (`SLUICE_PHANTOM:<pool>.access`) only on the unreachable `json.Marshal` failure. Refresh phantom stays static `SLUICE_PHANTOM:<pool>.refresh`.
197197

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
# Sticky Pool Failover (no "main" account)
2+
3+
## Problem
4+
5+
Live on knuth: `pool openai_pool failed over openai_oauth -> openai_oauth_2 (429)` repeats
6+
every few minutes (20+ messages). Hermes keeps working. Root cause is a flap:
7+
8+
- `openai_oauth` is position 0 (the de-facto "main"). Its OpenAI quota is exhausted, so it
9+
429s on every request.
10+
- On 429 sluice fails over to `openai_oauth_2` and parks `openai_oauth` for only
11+
`vault.RateLimitCooldown` = 60s.
12+
- `PoolResolver.ResolveActive` picks the **first member in position order** that is healthy
13+
or whose cooldown expired. 60s later `openai_oauth`'s cooldown lapses, so it is re-selected
14+
even though `openai_oauth_2` was serving fine.
15+
- `openai_oauth` is still quota-exhausted upstream -> 429 -> failover again -> another
16+
identical Telegram notice. Repeats on every request gap > 60s.
17+
18+
The real OpenAI Codex quota window is hours/weekly, so re-probing the exhausted account
19+
every 60s and snapping back to it is wrong.
20+
21+
## Desired behavior (user decision)
22+
23+
Sticky failover: there is no "main" account. Stay on whichever member is currently active
24+
until **that** member exhausts, then advance to the next member. A lower-position member
25+
recovering from cooldown must NOT cause a switch back to it. A new failover (and its single
26+
Telegram notice) happens only on a genuine exhaustion transition, not on cooldown lapse.
27+
28+
Mode toggle (position-priority vs sticky as selectable strategies) is explicitly out of
29+
scope for this change; sticky becomes the behavior. Note it as a possible follow-up.
30+
31+
## Design
32+
33+
`ResolveActive` is the single source of truth: `internal/proxy/pool_failover.go` calls
34+
`pr.ResolveActive(pool)` to compute the new active ("to") after cooling the old member, so
35+
making selection sticky fixes the flap, the failover events, and the notification spam in
36+
one place.
37+
38+
1. Add a per-pool current-active pointer to the swap-surviving shared `PoolHealth`
39+
(`internal/vault/pool.go`), guarded by the same mutex as the cooldown map, so it
40+
survives `NewPoolResolverShared` regeneration and atomic swaps (same lifecycle as
41+
cooldowns; carry it the way cooldowns are carried).
42+
2. Sticky `ResolveActive(pool)`:
43+
- Let `cur` = shared current-active for the pool, if set and still a member of this
44+
generation.
45+
- If `cur` is set and healthy (no active cooldown) -> return `cur` (sticky hold).
46+
- Else pick the next eligible member by position **starting after `cur`'s position and
47+
wrapping** (so we advance forward, never snap back to position 0); skip members in
48+
active cooldown; treat `ManualRotateReason` parks with the existing
49+
better-degrade-target semantics. Set shared current-active to the picked member and
50+
return it.
51+
- If every member is cooling: keep the existing degrade behavior (operator-parked-but-
52+
healthy first, else soonest-recovering) and do NOT move the sticky pointer, so when a
53+
member recovers the next call advances to a healthy one rather than the absolute
54+
position-0 member.
55+
3. Operator `sluice pool rotate` semantics unchanged at the surface: it still parks the
56+
current active so the next `ResolveActive` advances; with sticky that advance lands on
57+
the next member and stays there (no snap-back), which is the intended rotate behavior.
58+
4. Keep all existing concurrency invariants: sticky pointer reads/writes under the same
59+
`PoolHealth.mu`; never lost across atomic pointer swap; a stale resolver generation must
60+
not clobber it (mirror the cooldown CRITICAL-1 handling).
61+
62+
## Out of scope
63+
64+
- Strategy/mode toggle across CLI/REST/Telegram + schema (`credential_pools.strategy`) —
65+
follow-up only; note the synergy, do not build.
66+
- Honoring upstream `Retry-After` / changing `RateLimitCooldown` constant — sticky makes
67+
the short cooldown harmless (we no longer re-probe the cooled member until forced), so a
68+
cooldown-duration change is unnecessary for this fix.
69+
70+
## Testing strategy
71+
72+
- Unit (vault): sticky hold — active member stays selected across many `ResolveActive`
73+
calls while a lower-position member is healthy.
74+
- Unit (vault): flap regression — fail member A (cooldown), `ResolveActive` returns B; A's
75+
cooldown expires; `ResolveActive` still returns B (no snap-back). B then cools ->
76+
advance to next, wrapping.
77+
- Unit (vault): all-cooling degrade unchanged; operator-parked degrade-target preserved.
78+
- Unit (vault): sticky pointer survives `NewPoolResolverShared` regeneration + swap, and a
79+
stale generation cannot clobber it (extend the existing CRITICAL-1 style test).
80+
- Unit (proxy): the failover path emits exactly ONE `cred_failover` + one notice per real
81+
exhaustion transition, and emits NOTHING when a non-active member's cooldown merely
82+
lapses (the spam regression, fail-before/pass-after).
83+
- Full `go test ./...`, `-race` on `internal/vault` and `internal/proxy`, gofumpt,
84+
golangci-lint, `go vet ./...` and `-tags=e2e ./e2e/`.
85+
86+
## Steps
87+
88+
### Task 1: Sticky selection in vault.PoolResolver
89+
- [x] Add swap-surviving per-pool current-active to shared `PoolHealth` (same mutex)
90+
- [x] Rewrite `ResolveActive` to the sticky algorithm above; preserve degrade + parked semantics
91+
- [x] Preserve CRITICAL-1 invariants (no loss/clobber across swap; stale generation safe)
92+
- [x] Unit tests: sticky hold, flap regression (no snap-back), advance+wrap, degrade unchanged, swap-survival
93+
- [x] `go test ./internal/vault/ -race`, gofumpt, vet
94+
95+
### Task 2: Failover path + notification spam regression
96+
- [x] Confirm `pool_failover.go` from->to now changes only on real exhaustion (sticky source of truth); adjust only if it bypasses `ResolveActive`
97+
- [x] Test: one `cred_failover`+notice per real transition; zero events when a non-active member's cooldown lapses (fail-before/pass-after)
98+
- [x] `go test ./internal/proxy/ -race`, gofumpt, vet
99+
100+
### Task 3: Docs + final validation
101+
- [x] Update CLAUDE.md credential-pools section to describe sticky selection (replace the position-priority wording) and note the mode-toggle follow-up
102+
- [x] `gofumpt -l` clean; `golangci-lint run ./...` 0 issues; full `go test ./...`; `go vet ./...`; `go vet -tags=e2e ./e2e/`
103+
- [x] Independently verify committed HEAD builds and tests pass (do not trust subagent green)

internal/proxy/pool_failover_test.go

Lines changed: 104 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -193,11 +193,23 @@ func TestFailoverCooldownTTLAndLazyRecovery(t *testing.T) {
193193
t.Fatalf("auth-fail cooldown TTL = %v, want ~%v", gotTTL, vault.AuthFailCooldown)
194194
}
195195

196-
// Lazy recovery: force the cooldown to the past; ResolveActive must
197-
// treat memA as eligible again with no background scheduler involved.
196+
// Lazy recovery: force the cooldown to the past. memA becomes ELIGIBLE
197+
// again with no background scheduler involved (CooldownUntil reports it
198+
// is no longer cooling). Selection is STICKY, so memB - which we failed
199+
// over to and which is healthy - remains active: a recovered lower-
200+
// position member must NOT snap back (the flap fix). memA's eligibility
201+
// only matters as a future advance target if memB later exhausts.
198202
pr.MarkCooldown("memA", time.Now().Add(-time.Second), "expired")
203+
if _, cooling := pr.CooldownUntil("memA"); cooling {
204+
t.Fatal("after expiry memA must no longer be cooling (lazy recovery, no scheduler)")
205+
}
206+
if active, _ := pr.ResolveActive("codex_pool"); active != "memB" {
207+
t.Fatalf("after expiry active = %q, want memB (sticky: recovered memA must NOT snap back)", active)
208+
}
209+
// Confirm memA IS the advance target once memB exhausts (wrap forward).
210+
pr.MarkCooldown("memB", time.Now().Add(vault.RateLimitCooldown), "429")
199211
if active, _ := pr.ResolveActive("codex_pool"); active != "memA" {
200-
t.Fatalf("after expiry active = %q, want memA (lazy recovery)", active)
212+
t.Fatalf("after memB exhausts active = %q, want memA (advance forward to recovered member)", active)
201213
}
202214
}
203215

@@ -346,6 +358,95 @@ func TestFailoverAuditEvent(t *testing.T) {
346358
}
347359
}
348360

361+
// TestStickyFailoverNoSpamOnNonActiveCooldownLapse is the notification-spam
362+
// regression for the live knuth flap. memA (position 0) is upstream-
363+
// exhausted. Sequence:
364+
//
365+
// 1. memA 429 -> exactly ONE cred_failover (memA->memB) + one notice.
366+
// 2. memA's short cooldown lapses. Under the OLD position-priority
367+
// ResolveActive, the next request's injection would re-select memA
368+
// (lower position, no longer cooling), hit the still-exhausted upstream,
369+
// 429 again, and emit ANOTHER identical failover + Telegram notice -
370+
// repeating every cooldown window (the 20+ message spam).
371+
// 3. With sticky selection memB stays active after memA recovers, so the
372+
// subsequent request resolves to memB, succeeds (200), and produces NO
373+
// additional failover event or notice.
374+
//
375+
// Fail-before: step 3 would record a 2nd cred_failover (the flap).
376+
// Pass-after: exactly one cred_failover total, one notice total.
377+
func TestStickyFailoverNoSpamOnNonActiveCooldownLapse(t *testing.T) {
378+
dir := t.TempDir()
379+
logPath := filepath.Join(dir, "audit.log")
380+
logger, err := audit.NewFileLogger(logPath)
381+
if err != nil {
382+
t.Fatalf("NewFileLogger: %v", err)
383+
}
384+
t.Cleanup(func() { _ = logger.Close() })
385+
386+
addon, _, prPtr := setupPoolAddon(t, "memA", "memB")
387+
addon.auditLog = logger
388+
client := setupAddonConn(addon, "auth.example.com:443")
389+
pr := prPtr.Load()
390+
391+
var notices int
392+
addon.SetOnFailover(func(FailoverEvent) { notices++ })
393+
394+
if got, _ := pr.ResolveActive("codex_pool"); got != "memA" {
395+
t.Fatalf("pre = %q, want memA", got)
396+
}
397+
398+
// 1. memA exhausts -> one real failover to memB.
399+
f1 := newPoolRespFlow(client, 429, []byte(`{"error":"rate_limited"}`))
400+
addon.flowInjected.Tag(f1.Id, "memA")
401+
addon.Response(f1)
402+
if got, _ := pr.ResolveActive("codex_pool"); got != "memB" {
403+
t.Fatalf("after 429 active = %q, want memB", got)
404+
}
405+
406+
// 2. memA's short cooldown lapses (still upstream-exhausted in reality).
407+
pr.MarkCooldown("memA", time.Now().Add(-time.Second), "expired")
408+
409+
// The next request's injection resolves the pool. STICKY: it must stay
410+
// on memB, NOT snap back to the recovered-but-still-exhausted memA. This
411+
// is the single source of truth that kills the flap.
412+
if got, _ := pr.ResolveActive("codex_pool"); got != "memB" {
413+
t.Fatalf("FLAP: after memA cooldown lapse, injection resolves %q, want memB", got)
414+
}
415+
416+
// 3. The subsequent legitimate request therefore hits memB and succeeds;
417+
// a 200 is a documented no-op (no failover, no notice).
418+
f2 := newPoolRespFlow(client, 200, []byte(`{"ok":true}`))
419+
addon.flowInjected.Tag(f2.Id, "memB")
420+
addon.Response(f2)
421+
422+
if err := logger.Close(); err != nil {
423+
t.Fatalf("logger close: %v", err)
424+
}
425+
data, err := os.ReadFile(logPath)
426+
if err != nil {
427+
t.Fatalf("read audit log: %v", err)
428+
}
429+
failovers := 0
430+
for _, line := range strings.Split(strings.TrimSpace(string(data)), "\n") {
431+
if line == "" {
432+
continue
433+
}
434+
var evt audit.Event
435+
if uerr := json.Unmarshal([]byte(line), &evt); uerr != nil {
436+
t.Fatalf("unmarshal %q: %v", line, uerr)
437+
}
438+
if evt.Action == "cred_failover" || evt.Action == "pool_exhausted" {
439+
failovers++
440+
}
441+
}
442+
if failovers != 1 {
443+
t.Fatalf("got %d failover audit events, want exactly 1 (sticky must not flap on non-active cooldown lapse)\n%s", failovers, data)
444+
}
445+
if notices != 1 {
446+
t.Fatalf("got %d failover notices, want exactly 1 (no Telegram spam on cooldown lapse)", notices)
447+
}
448+
}
449+
349450
// TestPoolForResponseResolvesActiveMember sanity-checks the destination ->
350451
// pool reverse mapping used by handlePoolFailover.
351452
func TestPoolForResponseResolvesActiveMember(t *testing.T) {

0 commit comments

Comments
 (0)