docs(pools): describe sticky active-member selection

nnemirovsky · nnemirovsky · commit 22243f400a86 · 2026-05-18T20:55:18.000+08:00
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -191,7 +191,7 @@ Auto-failover on 429/401 is primary; `pool rotate` is an override.
 
 - **Single chokepoint (I2):** every `binding.Credential` / `OAuthIndex.Has` / `extractInjectableSecret` / persist consumer on the HTTP/HTTPS OAuth path routes through `PoolResolver.ResolveActive` (`resolveInjectionTarget` for pass-1 header + pass-2 swap; `resolveOAuthResponseAttribution` for response/persist). `idx.Has` is always called with the resolved member, never the pool. Plain credentials pass through unchanged; SSH/mail are non-OAuth, out of scope.
 - **QUIC scope:** the HTTP/1.x/HTTP/2 MITM addon implements the full feature set (R1, R3, Phase 2). The HTTP/3/QUIC path (`QUICProxy.buildPhantomPairs`, binding-header injection in `quic.go`) is a request-side buffered swap with **no response-side OAuth interception**, but IS pool-aware on the request side: `QUICProxy.resolvePoolTarget` (via `NewQUICProxy`'s `poolResolver`) selects the active member's real secret and routes through `buildPooledOAuthPhantomPairs` so the access phantom is the same pool-stable JWT (R3 holds over QUIC). QUIC does **not** do R1 attribution or Phase 2 failover — the injected member is whatever the HTTP path / `pool rotate` last made active, and a QUIC-only 429/401 or refresh is not acted on. Deployments needing R1/auto-failover must route the pooled upstream over HTTP/HTTPS.
-- **Active-member selection:** healthy or expired-cooldown members first, by configured position; if all are in cooldown, the soonest-recovering is returned with a WARNING (degrade, never hard-fail). Recovery is lazy (evaluated in `ResolveActive`, no scheduler).
+- **Active-member selection (sticky, no "main" account):** `ResolveActive` is sticky. There is no position-0 "main": once selection settles on a member sluice keeps returning it while it is healthy, even if a lower-position member recovers from cooldown. A new active member is chosen only when the current one cools. The chosen member is the next eligible one by position **starting after the current member and wrapping** (advance forward, never snap back to position 0). The sticky current-active pointer lives on the shared swap-surviving `PoolHealth` under the same mutex as the cooldown map, so it survives `NewPoolResolverShared` regeneration and atomic pointer swaps (CRITICAL-1) and a stale resolver generation cannot clobber it, exactly like cooldowns. This kills the live flap where a 60s `RateLimitCooldown` lapse re-selected an upstream-exhausted account and respammed `cred_failover` plus a Telegram notice every cooldown window. If every member is cooling, the existing degrade applies (operator-parked-but-healthy `ManualRotateReason` first, else soonest-recovering, with a WARNING, never hard-fail) and the sticky pointer is left untouched so a recovery advances forward rather than snapping to position 0. `sluice pool rotate` still works: it parks the active member, so the next `ResolveActive` advances to the next member and then stays there (no snap-back). Recovery is lazy (evaluated in `ResolveActive`, no scheduler). A selectable position-priority-vs-sticky strategy mode is a possible follow-up and is out of scope here.
 - **R1 refresh-token attribution / fail-closed:** when pass-2 swaps `SLUICE_PHANTOM:<pool>.refresh`, sluice records `realRefreshToken -> member` in a short-TTL map; on the token-endpoint response it recovers the member by that real refresh token and persists to it (`persistAddonOAuthTokens(member, ...)`, singleflight `"persist:"+member`). The join key is the real **refresh** token — never the access token, connection, or `OAuthIndex.Match` (two pooled members share `auth.openai.com`'s token URL and collide). Unrecoverable -> WARNING + skip the write (rotating refresh tokens are single-use; a mis-attributed write bricks both accounts). **Plain-credential disambiguation:** a plain OAuth credential sharing a pool's token URL also tags its injected refresh token `realRefreshToken -> <plain name>` (plain path in `buildPhantomPairs`/`buildOAuthPhantomPairs`'s `onRefreshInject`, incl. split-host expansion); on response a recovered non-member (`PoolForMember == ""`) is attributed 1:1 to that plain credential, NOT fail-closed as pooled. The pooled fail-closed path applies only when recovery fails or resolves to an actual member; `poolForResponse` gates the same on an independent `flowInjected` tag (set post-swap only if a pool phantom was present) before cooling a member.
 - **R3 pool-stable phantom JWT:** Codex access tokens are JWTs; the per-real-token `resignJWT` would emit a different phantom after each cross-member refresh, breaking "agent never notices". `poolStablePhantomAccess` (`internal/proxy/oauth_response.go`) builds the phantom JWT from a deterministic synthetic payload keyed on the **pool name** (`sub: sluice-pool:<pool>`, `iss: sluice-phantom`, fixed far-future `exp`, no `iat`), HMAC-SHA256 with the fixed key — byte-identical across switches, structurally valid. Pool name is JSON-marshaled (never concatenated) so quotes/control chars can't inject claims. Static-form fallback (`SLUICE_PHANTOM:<pool>.access`) only on the unreachable `json.Marshal` failure. Refresh phantom stays static `SLUICE_PHANTOM:<pool>.refresh`.
 
diff --git a/docs/plans/20260518-sticky-failover.md b/docs/plans/20260518-sticky-failover.md
@@ -86,18 +86,18 @@ one place.
 ## Steps
 
 ### Task 1: Sticky selection in vault.PoolResolver
-- [ ] Add swap-surviving per-pool current-active to shared `PoolHealth` (same mutex)
-- [ ] Rewrite `ResolveActive` to the sticky algorithm above; preserve degrade + parked semantics
-- [ ] Preserve CRITICAL-1 invariants (no loss/clobber across swap; stale generation safe)
-- [ ] Unit tests: sticky hold, flap regression (no snap-back), advance+wrap, degrade unchanged, swap-survival
-- [ ] `go test ./internal/vault/ -race`, gofumpt, vet
+- [x] Add swap-surviving per-pool current-active to shared `PoolHealth` (same mutex)
+- [x] Rewrite `ResolveActive` to the sticky algorithm above; preserve degrade + parked semantics
+- [x] Preserve CRITICAL-1 invariants (no loss/clobber across swap; stale generation safe)
+- [x] Unit tests: sticky hold, flap regression (no snap-back), advance+wrap, degrade unchanged, swap-survival
+- [x] `go test ./internal/vault/ -race`, gofumpt, vet
 
 ### Task 2: Failover path + notification spam regression
-- [ ] Confirm `pool_failover.go` from->to now changes only on real exhaustion (sticky source of truth); adjust only if it bypasses `ResolveActive`
-- [ ] Test: one `cred_failover`+notice per real transition; zero events when a non-active member's cooldown lapses (fail-before/pass-after)
-- [ ] `go test ./internal/proxy/ -race`, gofumpt, vet
+- [x] Confirm `pool_failover.go` from->to now changes only on real exhaustion (sticky source of truth); adjust only if it bypasses `ResolveActive`
+- [x] Test: one `cred_failover`+notice per real transition; zero events when a non-active member's cooldown lapses (fail-before/pass-after)
+- [x] `go test ./internal/proxy/ -race`, gofumpt, vet
 
 ### Task 3: Docs + final validation
-- [ ] Update CLAUDE.md credential-pools section to describe sticky selection (replace the position-priority wording) and note the mode-toggle follow-up
-- [ ] `gofumpt -l` clean; `golangci-lint run ./...` 0 issues; full `go test ./...`; `go vet ./...`; `go vet -tags=e2e ./e2e/`
-- [ ] Independently verify committed HEAD builds and tests pass (do not trust subagent green)
+- [x] Update CLAUDE.md credential-pools section to describe sticky selection (replace the position-priority wording) and note the mode-toggle follow-up
+- [x] `gofumpt -l` clean; `golangci-lint run ./...` 0 issues; full `go test ./...`; `go vet ./...`; `go vet -tags=e2e ./e2e/`
+- [x] Independently verify committed HEAD builds and tests pass (do not trust subagent green)