diff --git a/docs/plans/20260515-approval-coalescing.md b/docs/plans/20260515-approval-coalescing.md new file mode 100644 index 0000000..1fd89f8 --- /dev/null +++ b/docs/plans/20260515-approval-coalescing.md @@ -0,0 +1,126 @@ +# Coalesce Duplicate Approval Prompts (broker-level dedup) + +## Overview + +When the agent fires many requests to the same target while the **first** +approval is still pending, the broker creates one independent waiter + one +Telegram message **per request** — the wall of identical "Hermes wants to +connect to: HTTPS cas-server.xethub.hf.co:443 …" prompts the user reported, +each demanding its own tap, even though "Always Allow" persists a rule keyed +only by `destination:port`. + +Fix: **coalesce pending approvals by their persistence-equivalent target** +(`dest:port`). The first request to a target opens one prompt; concurrent +requests to the same target while it is pending **attach to that same pending +approval**. On resolve, every attached waiter gets the same response. This +generalises the QUIC-only buffering pattern in `internal/proxy/server.go` to +the channel-agnostic broker so HTTP/HTTPS / gRPC / WS *and connection-level +SSH / IMAP / SMTP* (which all share one broker call site) get dedup. + +**Scoping decision (from review):** ship the actual fix — broker coalescing — +in **Phase 1**, with the final coalesced count folded into the *existing* +resolve/cancel message edit (zero extra Telegram API calls, zero throttle +logic). The live mid-burst "+N pending" counter, whose only complexity is Bot +API flood-control realism, is **Phase 2 (optional)** and orthogonal to fixing +the prompt wall. + +## Context + +Verified against the working tree on `main` (tip `20cc367`): + +- **Exactly three `broker.Request` call sites exist** (grep-confirmed): + 1. `internal/proxy/request_policy.go:299` (`resolveAsk`) — the **sole** Ask path for per-request HTTP/gRPC/WS **and** connection-level SSH/IMAP/SMTP. Connection-level Ask is deferred at `internal/proxy/server.go:376-393` and resolved through this same `CheckAndConsume → resolveAsk → broker.Request`. There is **no separate SSH/IMAP/SMTP call site.** + 2. `internal/proxy/server.go:2477` — QUIC (already has its own packet-buffering dedup; **left untouched**). + 3. `internal/mcp/gateway.go:228` — MCP tool calls, keyed by tool name + args, `port 0`. A `dest:port` dedup key would wrongly collapse semantically distinct tool calls (different `ToolArgs`, arg-sensitive ContentInspector/exec rules). **MCP must opt out.** +- `internal/channel/broker.go`: `Broker{waiters map[string]waiter}` `:32`; ids `req_N` auto-increment `:161`, **not keyed by target**; `broadcast` to all channels `:226`/`:277-289`; blocking `select` `:232-271`; primary `ch` is **buffered cap 1** `:162`. `Resolve(id,resp)` `:344-358` first-wins per id, deletes waiter under lock, releases lock, *then* `w.ch <- resp` + `cancelOnChannels`. +- Timeout path `:248-271` does `delete(b.waiters,id)` + `b.timedOut[id]=now`. A coalesced sub has **no id of its own**, so it must use a *different* select arm (detach-from-subs only) — the existing arm would tear down the shared primary waiter. +- `persistApprovalRule` `internal/proxy/server.go:506-530` writes `store.AddRule(verdict, RuleOpts{Destination:dest, Ports:[]int{port}, Source:"approval"})` — **dest:port only, no method/path**, guarded by `reloadMu`. `store.AddRule` is the only entry point; **there is no existing "rule exists?" query** — adding one is a plain SELECT, no migration. +- Telegram `internal/telegram/approval.go`: `msgMap sync.Map` of `approvalMsg{messageID, req}` `:163`; inline keyboard built `:141-150`; resolve edit `:332-353` and cancel edit `:181-199` both use `NewEditMessageText` **without** re-attaching markup (intentional — they remove the keyboard). Library is `go-telegram-bot-api`; **`EditMessageTextAndMarkup` does not exist** — keyboard preservation needs an explicit `tgbotapi.EditMessageTextConfig` with `.ReplyMarkup` set to the rebuilt inline keyboard. +- HTTP per-request already bypasses the 5/min limiter via `WithBypassRateLimit` (`request_policy.go:268`, `broker.go:152`), so coalescing won't interact with it. + +## Development Approach + +- **Testing approach**: Regular (code first, then tests). Tests enumerated as explicit per-task items. +- gofumpt before commit (`feedback_gofumpt`); PR to `main` (`feedback_pr_workflow`). +- Two phases. Phase 1 is the complete fix and is shippable alone. + +## Testing Strategy + +- **Unit (broker)**: N concurrent same-`dest:port` `Request` → exactly **one** `broadcast`; one `Resolve` fans to **all N**; late arrival between resolve and index-clear does **not** attach to a dead waiter (drives the exact interleave); a **sub whose caller timed out and detached does not block `Resolve` fan-out**; deny/timeout/shutdown fan to all subs; different `dest:port` not coalesced; `WithNoCoalesce` requests never coalesce; first-wins across channels still works on the primary id. +- **Unit (persist)**: M coalesced `ResponseAlwaysAllow` → rule inserted **once** (new `HasApprovalRule` SELECT under `reloadMu`); **MCP calls with different `ToolArgs` are NOT coalesced.** +- **Unit (Telegram)**: final count folded into the existing resolve/cancel edit (no extra Send); Phase 2 only — live counter edit preserves keyboard and honors `retry_after`. +- **E2e** (`e2e`): burst of HTTP requests to one Ask destination through the proxy with the webhook channel; assert one approval delivered, one response fans to all, all proceed. Confirm the webhook/HTTP channel either implements `CoalesceNotifier` or is an explicit no-op so the assertion is meaningful. + +## Phase 1 — Broker coalescing + final-count-on-resolve (the fix) + +### Task 1: Broker coalescing core + +**Files:** Modify `internal/channel/broker.go`; Modify `internal/channel/broker_test.go` + +- Add to `Broker`: `dedupIndex map[string]string` (`dedupKey → primary reqID`); extend `waiter` with `subs []chan Response`, `count int` (starts 1), and `dedupKey string`. +- Add `WithNoCoalesce()` request option (escape hatch). +- `Request`: compute `dedupKey := dest + ":" + strconv.Itoa(port)` (proto-agnostic — matches the proto-agnostic persisted rule). If `WithNoCoalesce` set, skip all dedup logic. +- Under `b.mu`: if `dedupIndex[dedupKey]` exists and that primary waiter is still in `waiters` → create a **buffered (cap 1)** response chan, append to `waiters[primary].subs`, `count++`, capture primary id + count; release lock; (Phase 2 only: notify channels of new count). Block on the sub chan using a **sub-specific select** arm: `case resp := <-subCh` (return it) / `case <-deadline.C` (under `b.mu`: remove this chan from `waiters[primary].subs` if still present, then return timeout deny — **no `waiters` delete, no `timedOut` entry**, must not tear down the shared waiter) / `case <-b.done` (drain subCh non-blocking then deny). +- Else: today's behavior — new id, register waiter, set `dedupIndex[dedupKey]=id`, record `dedupKey` on the waiter, `broadcast`. +- `Resolve(id,resp)`: under `b.mu` look up waiter, **snapshot the whole waiter (including the `subs` slice)**, delete `b.waiters[id]` **and** `delete(b.dedupIndex, w.dedupKey)` in the *same locked section* (this is what closes the late-attach race), release lock, then fan `resp` to `w.ch` and every chan in the snapshot `subs` (all buffered cap 1 so a send to a detached/timed-out sub never blocks), then `cancelOnChannels(id)`. First-wins preserved: only the call that finds the waiter present wins. +- Timeout/`done`/shutdown of the **primary**: fan the terminal response to all snapshot subs and clear `dedupIndex` under the same lock discipline. +- write tests: concurrent dedup → one broadcast; fan-out to all N; late-attach interleave (no attach to dead waiter); sub-timeout-detach does not block fan-out; deny/timeout/shutdown fan-out; distinct dest:port; `WithNoCoalesce`; cross-channel first-wins. +- run `go test ./internal/channel/...` — must pass before Task 2. + +### Task 2: Persist-once (idempotent approval rule) + +**Files:** Modify `internal/store/store.go`; Modify `internal/proxy/server.go`; Modify `internal/store/store_test.go` + +- Add `Store.HasApprovalRule(verdict, dest string, port int) (bool, error)` — plain SELECT against `rules` where `source='approval'` AND verdict/destination/port match. **No migration** (read-only query). +- In `persistApprovalRule` (`server.go:506`), under the existing `reloadMu`, call `HasApprovalRule` first and skip `AddRule` + engine recompile if present (M coalesced callers serialize on `reloadMu`; first inserts, rest no-op). This is the chosen design over "only primary persists" because it needs no broker→persist signaling and is robustly idempotent under concurrent resolve fan-out. (Scope note: this is *not* a vehicle for fixing any pre-existing manual double-tap dup-row behavior — that is out of scope and not a design driver.) +- write tests: M concurrent persists → exactly one row; existing single-persist path unchanged. +- run `go test ./internal/store/... ./internal/proxy/...` — must pass before Task 3. + +### Task 3: Route call sites; MCP opt-out + +**Files:** Modify `internal/mcp/gateway.go`; audit `internal/proxy/request_policy.go`, `internal/proxy/server.go` + +- `request_policy.go:299` (HTTP/gRPC/WS + connection-level SSH/IMAP/SMTP): coalesce **uniformly**. Rationale: an SSH/IMAP burst to one `dest:port` persists the *same* single `dest:port` rule as HTTP — the plan's own "persistence granularity = dedup granularity" thesis applies identically; no per-protocol special-casing, no `checkContext` plumbing. +- `mcp/gateway.go:228`: pass `WithNoCoalesce()` — distinct `ToolArgs` are semantically distinct and arg-sensitive (ContentInspector/exec). This is the call site that genuinely needs the escape hatch. +- QUIC (`server.go:2477`): untouched (its own buffering remains). +- write tests: MCP calls with differing `ToolArgs` produce distinct prompts (not coalesced); SSH-style connection-level Ask to same dest:port coalesces. +- run `go test ./...` — must pass before Task 4. + +### Task 4: Final count on the existing resolve/cancel edit + +**Files:** Modify `internal/telegram/approval.go`; Modify `internal/channel/broker.go` (expose final `count` on resolve); Modify `internal/telegram/approval_test.go` + +- Broker passes the final coalesced `count` to channels on cancel/resolve (extend the existing cancel/resolve notification path; no new Telegram Send). +- Telegram resolve edit (`:332-353`) / cancel edit (`:181-199`): when `count > 1`, render e.g. "Always allowed — applied to N requests at HH:MM:SS". **Zero extra API calls** — folded into the one edit that already happens. +- write tests: count rendered correctly for count==1 and count>1; no additional `Send` beyond the existing single edit. +- run `go test ./internal/telegram/...` — must pass. + +### Task 5: Verify acceptance + docs + +- verify the prompt-wall scenario: burst → one prompt → one tap dismisses all (e2e). +- run full suite `go test ./... -timeout 30s`; run e2e `go test -tags=e2e ./e2e/ -count=1 -timeout=300s`. +- update CLAUDE.md "Channel/approval abstraction" + "QUIC broker dedup" notes to mention broker-level coalescing. +- move plan to `docs/plans/completed/`. + +## Phase 2 (optional) — Live mid-burst counter + +Only if the live "+N more pending" indicator is wanted on top of Phase 1. + +**Files:** Modify `internal/channel` (new `CoalesceNotifier interface { ApprovalCoalesced(id string, count int) }`); Modify `internal/telegram/approval.go` + +- Broker, on attaching a sub, best-effort calls `ch.ApprovalCoalesced(primaryID, count)` for channels implementing the interface (panic-recover like `broadcast`). +- `TelegramChannel.ApprovalCoalesced`: edit the stored message appending `+%d more identical request(s) to this target pending`, **preserving the keyboard** via an explicit `tgbotapi.EditMessageTextConfig` with `.ReplyMarkup` = pointer to the inline keyboard rebuilt from `req.ID` (the `:141-150` shape) — `EditMessageTextAndMarkup` does not exist in the library. +- Throttle: ≥800ms between edits per message, coalesce intermediate bumps; **on HTTP 429 honor `retry_after`** and drop intermediate edits (the final count still rides Phase 1's resolve edit, so a dropped mid-burst edit loses nothing material). +- Tests: keyboard preserved across the edit; `retry_after` honored (no tight-loop retry); final count still correct via Phase 1 path. + +## Out of scope + +Cross-port coalescing of one host; wildcard "similar resource" grouping (persisted rule is exact host:port — dedup must match); aggregate burst audit metric (per-request audit already covers it); fixing any pre-existing manual double-tap dup-row behavior. + +## Risks / decisions + +- **Late-attach race** (arrival between resolve and dedupIndex clear): closed by deleting `waiters[id]` and `dedupIndex[key]` in the *same* locked section and fanning out to a subs **snapshot** taken under that lock; sends happen after unlock to buffered (cap 1) chans. Explicit interleave unit test (Task 1). +- **Sub timeout vs shared waiter**: subs use a dedicated select arm that detaches only themselves; never delete the primary waiter or write `timedOut`. Buffered sub chans guarantee `Resolve` fan-out never blocks on a departed sub. +- **Call-site topology**: only three `broker.Request` sites; SSH/IMAP/SMTP share `request_policy.go:299` with HTTP (coalesce uniformly); MCP `gateway.go:228` opts out; QUIC untouched. (Original plan's "SSH needs WithNoCoalesce" was based on a non-existent call site — corrected.) +- **Behavioral change**: an operator who wanted to deny individual requests now sees one prompt for the burst — matches persisted-rule granularity (one rule covers them all) and the user's explicit request; intended. +- **Telegram realism**: Phase 1 adds **zero** new API calls (final count rides the existing edit). Flood-control risk is entirely confined to optional Phase 2, which handles `retry_after`. diff --git a/docs/plans/20260515-credential-pool-failover.md b/docs/plans/20260515-credential-pool-failover.md new file mode 100644 index 0000000..dd36c29 --- /dev/null +++ b/docs/plans/20260515-credential-pool-failover.md @@ -0,0 +1,123 @@ +# Credential Pool with Auto-Failover (multi-account OAuth) + +## Overview + +Let a single phantom identity the agent sees be backed by **N real OAuth +credentials** (a "pool"), with sluice picking which real account to inject and +**auto-failing-over to the next member when the upstream returns 429 +(rate-limited) or 401 / `invalid_grant` (auth-failure)**. Primary use case: two +OpenAI Codex OAuth accounts driven by one Hermes agent, so quota exhaustion on +one account transparently rolls onto the other. + +The agent always holds **one pool-scoped phantom pair** +(`SLUICE_PHANTOM:.access` / `.refresh`). Sluice maps the pool phantom to +the *currently active member's* real token at injection time, and persists +refreshed tokens back to the member that actually issued them. + +**Phantom-stability decision (resolved — see Risk R3):** OpenAI Codex access +tokens are JWTs and `resignJWT` (`internal/proxy/oauth_response.go:49-69`) is +deterministic *per real token*, not per pool — so the naive design would emit a +**different** phantom JWT after every cross-member refresh, breaking the +"agent never notices" property. Phase 1 therefore makes pooled OAuth +credentials use a **pool-stable synthetic resign**: the phantom JWT is built +from a deterministic synthetic payload keyed on the *pool name* (stable +`sub`/`iss`, far-future `exp`), HMAC'd with the existing fixed key, so the +phantom is byte-identical across member switches while remaining a +structurally valid JWT. (Fallback if Codex/Hermes is verified to treat the +access token as opaque: emit the static `SLUICE_PHANTOM:.access` form. +The synthetic-JWT path is primary because `resignJWT` exists specifically +because *something* parses the JWT client-side; we must not assume opacity.) + +## Context + +Verified against the working tree on `main` (tip `20cc367`; all symbol/line +references checked against current files, not a fixed SHA): + +- `internal/vault/oauth.go:9-54` — `OAuthCredential{AccessToken, RefreshToken, TokenURL, ExpiresAt}`; `UpdateTokens()` recomputes `ExpiresAt`. `ExpiresAt` is stored but **never read** before injection today. +- `internal/proxy/phantom.go:42-44` — `PhantomToken(name) = "SLUICE_PHANTOM:"+name`. OAuth variant `SLUICE_PHANTOM:.access` / `.refresh`. Phantom strings derive purely from the credential name — this is what makes a *pool-scoped* phantom feasible. +- `internal/proxy/oauth_response.go` — `oauthPhantomAccess(credName, realToken...)` `:27`, `resignJWT(realToken)` `:49-69` (HMACs `header.payload` with a fixed key; **per-real-token deterministic, NOT per-pool** — root of Risk R3). +- `internal/vault/binding.go:78-134` — `BindingResolver.Resolve`/`ResolveForProtocol` return the **first** match only. `CredentialsForDestination` `:184-214` returns all matching cred names (used by phantom-swap pass 2). +- `internal/proxy/addon.go`: + - `injectHeaders` `:532`, pass-1 single-credential header inject `~:553` (`provider.Get(binding.Credential)`). + - response credential match `idx.Match(f.Request.URL)` at `~:768` and `~:895` — **no per-stream linkage to the request that was injected** (root of Risk R1). + - `swapOAuthTokens` `:1059`, `persistAddonOAuthTokens` `:1098-1162` (singleflight key `"persist:"+credName`), `connState` keyed by `ClientConn.Id` `:42` (one client conn, many h2 streams). + - `extractInjectableSecret` / `OAuthIndex.Has(credName)` gate JSON-envelope vs raw injection — **a pool name is not in `credential_meta`, so `Has(pool)` is false** unless pool indirection is applied at every consumer (Important I2). +- `internal/proxy/oauth_index.go:81-95` — `OAuthIndex.Match(url) -> (credName, ok)` strict **1:1** token_url→credential. `OAuthIndex.Has` `:114`. Two Codex accounts share `auth.openai.com`'s token URL → they collide here; response attribution **must not** use this for pooled creds (Risk R1). +- `internal/store/store.go`: `BindingRow` `:539`, `AddBinding` `:570-650`, UNIQUE `(credential, LOWER(destination))` (`migrations/000005_...sql`). `CredentialMeta{Name,CredType,TokenURL,CreatedAt}` `:1689`, `AddCredentialMeta` upsert `:1697`, `ListCredentialMeta` `:1740`. Latest migration is `000005`; new one is `000006`. +- Hot-reload: `cmd/sluice/main.go:649-716` `reloadAll` rebuilds engine, `StoreResolver`, `UpdateOAuthIndex`, env injection; triggered by SIGHUP and the **2s** `PRAGMA data_version` watcher (this 2s lag is a correctness amplifier for failover — Important I1). +- No existing failover / rotation / health / round-robin logic anywhere (grep-confirmed). + +Decision (confirmed with user): **auto-failover on 429/401**. Manual `pool +rotate` is an operator override, not the primary mechanism. + +## Development Approach + +- **Testing approach**: Regular (code first, then tests). Tests are enumerated as explicit per-task checklist items, not deferred to phase exits. +- gofumpt before every commit (`feedback_gofumpt`); PR to `main`, never direct push (`feedback_pr_workflow`). +- Phased — each phase independently shippable and testable. Do not start a phase before the previous one's tests pass. +- This is a `feat`; new migration `000006` (next in sequence). No major bump implied (no CLI/schema break to existing surface). + +## Testing Strategy + +- **Unit**: pool CRUD + member ordering; active-member selection (healthy / in-cooldown / all-down degrade); **pool→active-member resolution at the single chokepoint**; **OAuth refresh attribution by injected real refresh token, including the two-members-one-token-URL collision (must land on the correct member)**; **fail-closed when the member tag is absent (skip persist, log, no guess)**; **phantom access token byte-identical across a member switch** (synthetic-JWT path); failover classification (429 / 403+`insufficient_quota` / 401 / `invalid_grant` / 200 / 5xx-noop); cooldown expiry. +- **Unit**: migration up/down; CLI parse + error paths; pool/credential namespace mutual-exclusion. +- **E2e** (`e2e` tag): two fake OAuth upstreams behind one pool; assert (a) A used until it 429s, (b) sluice switches to B for the *next* request, (c) B's refreshed tokens land in B's vault entry not A's, (d) the phantom access token the agent receives is byte-identical before and after failover (test upstream must return real **JWT** tokens so this assertion is meaningful). + +## Phases + +### Phase 0 — Data model + CLI (no runtime behavior change) + +1. **Migration** `internal/store/migrations/000006_credential_pools.up.sql` (+`.down.sql`): + - `credential_pools(name TEXT PRIMARY KEY, strategy TEXT NOT NULL DEFAULT 'failover' CHECK(strategy IN ('failover')), created_at TEXT)`. + - `credential_pool_members(pool TEXT, credential TEXT, position INTEGER NOT NULL, PRIMARY KEY(pool,credential), FOREIGN KEY(pool) REFERENCES credential_pools(name) ON DELETE CASCADE)`. + - `credential_health(credential TEXT PRIMARY KEY, status TEXT NOT NULL DEFAULT 'healthy' CHECK(status IN ('healthy','cooldown')), cooldown_until TEXT, last_failure_reason TEXT, updated_at TEXT)`. +2. **Store API** `internal/store/store.go`: `CreatePool`, `AddPoolMember`, `ListPools`, `GetPool` (members ordered by `position`), `RemovePool`; `SetCredentialHealth`, `GetCredentialHealth`, `ListCredentialHealth`. App-layer CHECK: a member must be an existing `oauth` cred with non-empty `token_url`; reject `static`. + - **Orphan-member cleanup**: `cred remove ` of a pooled member must either cascade-remove the member row or mark it missing. Decision: `cred remove` errors if the credential is a live pool member ("remove it from pool `

` first"); document in CLI help. (No silent dangling rows.) +3. **CLI** `cmd/sluice/cred.go` new `pool` subtree: `pool create --members a,b[,c] [--strategy failover]`, `pool list`, `pool status ` (member order + health + active), `pool rotate ` (manual override), `pool remove `. +4. **Namespace**: pool names and credential names share one namespace. `pool create` rejects a name that collides with an existing credential; `cred add` rejects a name colliding with an existing pool. Bind a pool via `sluice binding add --destination ` (pool name stored verbatim in `bindings.credential`). + +Phase 0 exit: pools definable/inspectable; `reloadAll` loads pool + health tables into a new in-memory `PoolResolver` (atomic-pointer-swapped, parallel to `StoreResolver`), but injection does not consult it. + +### Phase 1 — Phantom indirection (pool phantom → active member) + +Active member changes only via `pool rotate` in this phase. + +1. **Single chokepoint for pool→member expansion** `internal/vault/pool.go` (new): + - `PoolResolver.IsPool(name) bool`; `ResolveActive(name) (member string, ok bool)` — if `name` is a pool, first member whose health is `healthy` or whose `cooldown_until <= now`, in `position` order; if all in cooldown, return the soonest-recovering member and log a WARNING. If `name` is a plain credential, return it unchanged. + - **Mandatory task: enumerate and route every `binding.Credential` / `OAuthIndex.Has` / `extractInjectableSecret` / `findAdder`/persist consumer through `ResolveActive` at one chokepoint** (grep `binding.Credential`, `\.Has(`, `extractInjectableSecret`). Do **not** scatter `IsPool` checks across pass-1/pass-2 only — that was the original gap. +2. **Injection** `internal/proxy/addon.go`: pass-1 header and pass-2 phantom swap call the chokepoint so the *real* value injected is the active member's, while the agent's pool-scoped phantom string is what gets matched/replaced. +3. **Per-request member tag — precise join key** (resolves Risk R1): + - When pass-2 swaps the agent's `SLUICE_PHANTOM:.refresh` to a real refresh token in an outbound token-endpoint request, sluice **records `realRefreshToken → member`** in a short-TTL map (the refresh token value is sluice's own injected bytes, unique per member, and is the field actually present in an RFC-6749 refresh-grant body — *not* the access token, which a refresh POST need not carry). `connState` keyed by `ClientConn.Id` is insufficient (one client conn multiplexes both members' h2 streams), so the join key is the real refresh-token value, not the connection. + - On the token-endpoint **response**, the handler recovers `member` from that map by the real refresh token sluice sent in the matching request. Persist refreshed tokens to *that member* (`persistAddonOAuthTokens(member,...)`, singleflight `"persist:"+member`). + - **Fail-closed (mandatory enumerated task + unit test):** if the member cannot be recovered, do **not** guess and do **not** fall back to `OAuthIndex.Match` for pooled token URLs — log a WARNING and skip the vault write so the next refresh retries. Dedicated unit test: two members, same token URL, assert a B-refresh never overwrites A's vault entry, and a missing tag results in zero writes. +4. **Pool-stable phantom** (resolves Risk R3): for pooled OAuth creds, `oauthPhantomAccess`/`resignJWT` produce a JWT from a deterministic synthetic payload keyed on the **pool name** (not the member's real token), so it is byte-identical across member switches. Enumerated unit test asserts byte-identity across a switch. Document the static-form fallback and the reason it is not the default. +5. `cmd/sluice/main.go:reloadAll` builds & swaps `PoolResolver` + health snapshot alongside the existing swaps. + +Phase 1 exit: `pool rotate` flips the backing account; agent's phantom unchanged byte-for-byte; refreshes attributed correctly; fail-closed proven by test. + +### Phase 2 — Auto-failover on 429 / 401 + +1. **Failure classification** in `SluiceAddon.Response` for pooled destinations: + - `429`, or `403` with body error `insufficient_quota`/quota-exhaustion → rate-limited. + - `401`, or token-endpoint body `invalid_grant`/`invalid_token` → auth-failure. + - `5xx` and everything else → no-op (upstream-side; failing over would thrash both accounts — documented choice). +2. **Prompt failover (resolves Important I1):** on classification, update the in-memory `PoolResolver` health **synchronously before the response returns** (atomic-pointer swap or dedicated mutex on the health map — call out the locking discipline), so the *very next* request injects the new active member. Also write `SetCredentialHealth(member, 'cooldown', now+ttl, reason)` to the store for durability; the 2s data-version watcher then merely reconciles. Do **not** rely on the 2s watcher for the active-member change — that lag was an error amplifier. + - Cooldown TTLs as named consts in `internal/vault/pool.go`: rate-limit 60s, auth-fail 300s (a broken refresh token will not self-heal quickly). Lazy recovery: `ResolveActive` treats expired cooldown as eligible — no scheduler. +3. **Audit**: emit `cred_failover` with `Reason = ":->:<429|403|401|invalid_grant>"`. +4. **Telegram notify** (best-effort, non-blocking, never blocks injection): one-line "pool `` failed over ``→`` ()". +5. **No in-flight retry** of the triggering request in Phase 2 (it returns its error; the next request uses the new member). Transparent retry is out of scope (needs body buffering; unsafe for non-idempotent calls). + +Phase 2 exit: e2e proves A 429 → next request uses B → B's refresh persists to B → phantom byte-unchanged. + +## Out of scope / future work + +Transparent in-flight retry; round-robin/weighted (`strategy` reserved, `failover` only); active health probes / half-open; multi-agent pools with independent active pointers. + +## Risks / decisions + +- **R1 (critical, resolved in Phase 1.3):** refresh-token mis-attribution corrupts both accounts (rotating refresh tokens are single-use; filing B's new token under A invalidates A and bricks B's old one). Join key is the real **refresh** token sluice injected, not the access token, not the client connection. Fail-closed, never guess. Dedicated collision unit test mandatory. +- **R3 (critical, resolved in Phase 1.4):** `resignJWT` is per-real-token, so the agent's phantom would change on every cross-member refresh — the headline guarantee. Resolved via pool-keyed synthetic JWT; byte-identity unit test mandatory; static-form fallback documented. +- **I1 (important, resolved in Phase 2.2):** the 2s data-version watcher must not gate the active-member switch — synchronous in-memory health update on `Response`, store write only reconciles. +- **I2 (important, resolved in Phase 1.1):** all `binding.Credential`/`OAuthIndex.Has`/`extractInjectableSecret`/`findAdder` consumers routed through one `ResolveActive` chokepoint, not just the two injection passes. +- Namespace collision resolved by mutual-exclusion at create time (Phase 0.4). Orphan pool members resolved by blocking `cred remove` of a live member (Phase 0.2). +- Alternative rejected for this use case: scheduled `sluice cred update` rotation — cannot react to a 429 in real time and races the async OAuth vault writer.