Skip to content

Commit 22243f4

Browse files
committed
docs(pools): describe sticky active-member selection
1 parent e1b7ad1 commit 22243f4

2 files changed

Lines changed: 12 additions & 12 deletions

File tree

CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -191,7 +191,7 @@ Auto-failover on 429/401 is primary; `pool rotate` is an override.
191191

192192
- **Single chokepoint (I2):** every `binding.Credential` / `OAuthIndex.Has` / `extractInjectableSecret` / persist consumer on the HTTP/HTTPS OAuth path routes through `PoolResolver.ResolveActive` (`resolveInjectionTarget` for pass-1 header + pass-2 swap; `resolveOAuthResponseAttribution` for response/persist). `idx.Has` is always called with the resolved member, never the pool. Plain credentials pass through unchanged; SSH/mail are non-OAuth, out of scope.
193193
- **QUIC scope:** the HTTP/1.x/HTTP/2 MITM addon implements the full feature set (R1, R3, Phase 2). The HTTP/3/QUIC path (`QUICProxy.buildPhantomPairs`, binding-header injection in `quic.go`) is a request-side buffered swap with **no response-side OAuth interception**, but IS pool-aware on the request side: `QUICProxy.resolvePoolTarget` (via `NewQUICProxy`'s `poolResolver`) selects the active member's real secret and routes through `buildPooledOAuthPhantomPairs` so the access phantom is the same pool-stable JWT (R3 holds over QUIC). QUIC does **not** do R1 attribution or Phase 2 failover — the injected member is whatever the HTTP path / `pool rotate` last made active, and a QUIC-only 429/401 or refresh is not acted on. Deployments needing R1/auto-failover must route the pooled upstream over HTTP/HTTPS.
194-
- **Active-member selection:** healthy or expired-cooldown members first, by configured position; if all are in cooldown, the soonest-recovering is returned with a WARNING (degrade, never hard-fail). Recovery is lazy (evaluated in `ResolveActive`, no scheduler).
194+
- **Active-member selection (sticky, no "main" account):** `ResolveActive` is sticky. There is no position-0 "main": once selection settles on a member sluice keeps returning it while it is healthy, even if a lower-position member recovers from cooldown. A new active member is chosen only when the current one cools. The chosen member is the next eligible one by position **starting after the current member and wrapping** (advance forward, never snap back to position 0). The sticky current-active pointer lives on the shared swap-surviving `PoolHealth` under the same mutex as the cooldown map, so it survives `NewPoolResolverShared` regeneration and atomic pointer swaps (CRITICAL-1) and a stale resolver generation cannot clobber it, exactly like cooldowns. This kills the live flap where a 60s `RateLimitCooldown` lapse re-selected an upstream-exhausted account and respammed `cred_failover` plus a Telegram notice every cooldown window. If every member is cooling, the existing degrade applies (operator-parked-but-healthy `ManualRotateReason` first, else soonest-recovering, with a WARNING, never hard-fail) and the sticky pointer is left untouched so a recovery advances forward rather than snapping to position 0. `sluice pool rotate` still works: it parks the active member, so the next `ResolveActive` advances to the next member and then stays there (no snap-back). Recovery is lazy (evaluated in `ResolveActive`, no scheduler). A selectable position-priority-vs-sticky strategy mode is a possible follow-up and is out of scope here.
195195
- **R1 refresh-token attribution / fail-closed:** when pass-2 swaps `SLUICE_PHANTOM:<pool>.refresh`, sluice records `realRefreshToken -> member` in a short-TTL map; on the token-endpoint response it recovers the member by that real refresh token and persists to it (`persistAddonOAuthTokens(member, ...)`, singleflight `"persist:"+member`). The join key is the real **refresh** token — never the access token, connection, or `OAuthIndex.Match` (two pooled members share `auth.openai.com`'s token URL and collide). Unrecoverable -> WARNING + skip the write (rotating refresh tokens are single-use; a mis-attributed write bricks both accounts). **Plain-credential disambiguation:** a plain OAuth credential sharing a pool's token URL also tags its injected refresh token `realRefreshToken -> <plain name>` (plain path in `buildPhantomPairs`/`buildOAuthPhantomPairs`'s `onRefreshInject`, incl. split-host expansion); on response a recovered non-member (`PoolForMember == ""`) is attributed 1:1 to that plain credential, NOT fail-closed as pooled. The pooled fail-closed path applies only when recovery fails or resolves to an actual member; `poolForResponse` gates the same on an independent `flowInjected` tag (set post-swap only if a pool phantom was present) before cooling a member.
196196
- **R3 pool-stable phantom JWT:** Codex access tokens are JWTs; the per-real-token `resignJWT` would emit a different phantom after each cross-member refresh, breaking "agent never notices". `poolStablePhantomAccess` (`internal/proxy/oauth_response.go`) builds the phantom JWT from a deterministic synthetic payload keyed on the **pool name** (`sub: sluice-pool:<pool>`, `iss: sluice-phantom`, fixed far-future `exp`, no `iat`), HMAC-SHA256 with the fixed key — byte-identical across switches, structurally valid. Pool name is JSON-marshaled (never concatenated) so quotes/control chars can't inject claims. Static-form fallback (`SLUICE_PHANTOM:<pool>.access`) only on the unreachable `json.Marshal` failure. Refresh phantom stays static `SLUICE_PHANTOM:<pool>.refresh`.
197197

docs/plans/20260518-sticky-failover.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -86,18 +86,18 @@ one place.
8686
## Steps
8787

8888
### Task 1: Sticky selection in vault.PoolResolver
89-
- [ ] Add swap-surviving per-pool current-active to shared `PoolHealth` (same mutex)
90-
- [ ] Rewrite `ResolveActive` to the sticky algorithm above; preserve degrade + parked semantics
91-
- [ ] Preserve CRITICAL-1 invariants (no loss/clobber across swap; stale generation safe)
92-
- [ ] Unit tests: sticky hold, flap regression (no snap-back), advance+wrap, degrade unchanged, swap-survival
93-
- [ ] `go test ./internal/vault/ -race`, gofumpt, vet
89+
- [x] Add swap-surviving per-pool current-active to shared `PoolHealth` (same mutex)
90+
- [x] Rewrite `ResolveActive` to the sticky algorithm above; preserve degrade + parked semantics
91+
- [x] Preserve CRITICAL-1 invariants (no loss/clobber across swap; stale generation safe)
92+
- [x] Unit tests: sticky hold, flap regression (no snap-back), advance+wrap, degrade unchanged, swap-survival
93+
- [x] `go test ./internal/vault/ -race`, gofumpt, vet
9494

9595
### Task 2: Failover path + notification spam regression
96-
- [ ] Confirm `pool_failover.go` from->to now changes only on real exhaustion (sticky source of truth); adjust only if it bypasses `ResolveActive`
97-
- [ ] Test: one `cred_failover`+notice per real transition; zero events when a non-active member's cooldown lapses (fail-before/pass-after)
98-
- [ ] `go test ./internal/proxy/ -race`, gofumpt, vet
96+
- [x] Confirm `pool_failover.go` from->to now changes only on real exhaustion (sticky source of truth); adjust only if it bypasses `ResolveActive`
97+
- [x] Test: one `cred_failover`+notice per real transition; zero events when a non-active member's cooldown lapses (fail-before/pass-after)
98+
- [x] `go test ./internal/proxy/ -race`, gofumpt, vet
9999

100100
### Task 3: Docs + final validation
101-
- [ ] Update CLAUDE.md credential-pools section to describe sticky selection (replace the position-priority wording) and note the mode-toggle follow-up
102-
- [ ] `gofumpt -l` clean; `golangci-lint run ./...` 0 issues; full `go test ./...`; `go vet ./...`; `go vet -tags=e2e ./e2e/`
103-
- [ ] Independently verify committed HEAD builds and tests pass (do not trust subagent green)
101+
- [x] Update CLAUDE.md credential-pools section to describe sticky selection (replace the position-priority wording) and note the mode-toggle follow-up
102+
- [x] `gofumpt -l` clean; `golangci-lint run ./...` 0 issues; full `go test ./...`; `go vet ./...`; `go vet -tags=e2e ./e2e/`
103+
- [x] Independently verify committed HEAD builds and tests pass (do not trust subagent green)

0 commit comments

Comments
 (0)