Skip to content

Commit a6b24e8

Browse files
committed
fix(vault): monotonic cooldown (in-memory + durable); doc pool phantom shapes
1 parent 914bc09 commit a6b24e8

6 files changed

Lines changed: 174 additions & 7 deletions

File tree

CLAUDE.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -214,7 +214,7 @@ Extends phantom swap to handle OAuth credentials bidirectionally. Static credent
214214

215215
### Credential pools and auto-failover
216216

217-
A **credential pool** lets one phantom identity the agent sees be backed by **N real OAuth credentials**. The agent always holds a single pool-scoped phantom pair (`SLUICE_PHANTOM:<pool>.access` / `SLUICE_PHANTOM:<pool>.refresh`); sluice maps it to the *currently active member's* real tokens at injection time and persists refreshed tokens back to the member that issued them. Primary use case: two OpenAI Codex OAuth accounts behind one agent so quota exhaustion on one account transparently rolls onto the other. Pool members must be `oauth` credentials — `static` members are rejected. `cred remove` errors on a credential that is a live pool member. **One credential belongs to at most one pool**: proxy attribution (`PoolResolver.PoolForMember`) maps a member back to a single pool, so a credential shared across pools would persist/audit a token response against the wrong pool's phantom and leave the agent with an unreplaceable phantom. `pool create` rejects a member that is already in another pool (enforced inside the same transaction as the member insert).
217+
A **credential pool** lets one phantom identity the agent sees be backed by **N real OAuth credentials**. The agent always holds a single pool-scoped phantom pair, byte-stable across member switches: the **access** phantom is a synthetic pool-stable JWT (HS256, `sub: sluice-pool:<pool>`, `iss: sluice-phantom`, fixed far-future `exp`, built by `poolStablePhantomAccess`) — byte-identical for a given pool regardless of which member is active; the **refresh** phantom is the static string `SLUICE_PHANTOM:<pool>.refresh` (from `oauthPhantomRefresh`'s request-side strip path). Sluice maps the pair to the *currently active member's* real tokens at injection time and persists refreshed tokens back to the member that issued them. Primary use case: two OpenAI Codex OAuth accounts behind one agent so quota exhaustion on one account transparently rolls onto the other. Pool members must be `oauth` credentials — `static` members are rejected. `cred remove` errors on a credential that is a live pool member. **One credential belongs to at most one pool**: proxy attribution (`PoolResolver.PoolForMember`) maps a member back to a single pool, so a credential shared across pools would persist/audit a token response against the wrong pool's phantom and leave the agent with an unreplaceable phantom. `pool create` rejects a member that is already in another pool (enforced inside the same transaction as the member insert).
218218

219219
**CLI:**
220220

@@ -235,13 +235,13 @@ Auto-failover on 429/401 is the primary mechanism; `pool rotate` is an operator
235235
- **Single chokepoint (I2):** every `binding.Credential` / `OAuthIndex.Has` / `extractInjectableSecret` / persist consumer on the HTTP/HTTPS OAuth path routes through `PoolResolver.ResolveActive` (`resolveInjectionTarget` for pass-1 header + pass-2 phantom swap; `resolveOAuthResponseAttribution` for the response/persist path). `idx.Has` is always called with the resolved member name, never the pool. Plain (non-pool) credentials pass through `ResolveActive` unchanged. SSH/mail/QUIC are non-OAuth and out of scope.
236236
- **Active-member selection:** healthy or expired-cooldown members first, by configured position; if all members are in cooldown, the soonest-recovering member is returned with a WARNING (degrade, never hard-fail). Recovery is lazy — evaluated in `ResolveActive`, no scheduler.
237237
- **R1 refresh-token attribution / fail-closed:** when pass-2 swaps `SLUICE_PHANTOM:<pool>.refresh`, sluice records `realRefreshToken → member` in a short-TTL map. On the token-endpoint response it recovers the member by that real refresh token and persists to that member (`persistAddonOAuthTokens(member, ...)`, singleflight key `"persist:"+member`). The join key is the real **refresh** token sluice injected — never the access token, the client connection, or `OAuthIndex.Match` (two pooled members share `auth.openai.com`'s token URL and collide there). If the member is unrecoverable: WARNING + skip the vault write, never guess. Rotating refresh tokens are single-use, so a mis-attributed write would brick both accounts — fail-closed is mandatory.
238-
- **R3 pool-stable phantom JWT:** Codex access tokens are JWTs and the per-real-token `resignJWT` would emit a *different* phantom after every cross-member refresh, breaking the "agent never notices" guarantee. Pooled OAuth `oauthPhantomAccess`/`resignJWT` instead build the phantom JWT from a deterministic synthetic payload keyed on the **pool name** (stable `sub`/`iss`, far-future `exp`), HMAC'd with the existing fixed key — byte-identical across member switches while still a structurally valid JWT. Static-form fallback (`SLUICE_PHANTOM:<pool>.access`) is documented for the case where the agent is verified to treat the access token as opaque.
238+
- **R3 pool-stable phantom JWT:** Codex access tokens are JWTs and the per-real-token `resignJWT` would emit a *different* phantom after every cross-member refresh, breaking the "agent never notices" guarantee. The dedicated `poolStablePhantomAccess` (in `internal/proxy/oauth_response.go`) instead builds the phantom JWT from a deterministic synthetic payload keyed on the **pool name** (`sub: sluice-pool:<pool>`, `iss: sluice-phantom`, fixed far-future `exp`, no `iat`), HMAC-SHA256'd with the existing fixed key — byte-identical across member switches while still a structurally valid JWT. The pool name is JSON-marshaled (never concatenated) so a name with quotes/control chars cannot inject claims. Static-form fallback (`SLUICE_PHANTOM:<pool>.access`) is emitted only on the unreachable `json.Marshal` failure of the fixed struct (and is documented as the equivalent for an agent verified to treat the access token as opaque). The **refresh** phantom is unaffected — it stays the static `SLUICE_PHANTOM:<pool>.refresh`.
239239

240240
**Phase 2 — auto-failover on 429 / 401:**
241241

242242
- **Classification** (`classifyFailover` in `internal/proxy/pool_failover.go`, called from `SluiceAddon.Response` for pooled destinations): `429` or `403 + insufficient_quota` → rate-limited; `401` or token-body `invalid_grant` / `invalid_token` → auth-failure; `5xx` / other → no-op. The token-endpoint body is only trusted when the request URL matched the OAuth index.
243243
- **Pool attribution for the response** (`poolForResponse`): a response is attributed to a pool either (a) when the flow's CONNECT host has a pooled binding (the API-host 429/403 path), **or** (b) when the request URL matches the OAuth token-URL index for a credential that is a pool member (the token-endpoint 401 / `invalid_grant` path). Case (b) is essential: an OAuth refresh hits the credential's token-URL host (e.g. `auth.openai.com`), which has no pool binding — only the API host (e.g. `api.openai.com`) does — so without the token-URL index match the token-endpoint classification would be dead code for the Codex deployment. `idx.Match` is strict 1:1 token_url→credential, so case (b) cools the exact member whose refresh token was injected.
244-
- **Synchronous in-memory failover (I1):** health is updated in-process *before* the response returns — `MarkCooldown` takes the resolver write lock, `ResolveActive` the read lock — so the active-member switch never waits on the 2s data-version watcher (which only reconciles). A detached `onFailover` callback also writes `SetCredentialHealth(member, 'cooldown', now+ttl, reason)` for durability. Cooldown TTLs: `vault.RateLimitCooldown` = 60s, `vault.AuthFailCooldown` = 300s. No in-flight retry — the next request uses the new member.
244+
- **Synchronous in-memory failover (I1):** health is updated in-process *before* the response returns — `MarkCooldown` takes the resolver write lock, `ResolveActive` the read lock — so the active-member switch never waits on the 2s data-version watcher (which only reconciles). A detached `onFailover` callback also writes `SetCredentialHealth(member, 'cooldown', now+ttl, reason)` for durability. Cooldown TTLs: `vault.RateLimitCooldown` = 60s, `vault.AuthFailCooldown` = 300s. **Cooldown extension is monotonic on both layers:** a member parked for an auth failure (300s) that subsequently trips a rate-limit (60s) keeps the LATER expiry — `MarkCooldown` (in-memory) and `SetCredentialHealth`'s `cooldown` upsert (durable, via a `CASE`/comparison against the stored future `cooldown_until`) both keep `max(existing-future, new)` so a known-bad credential is never made eligible early. Only the extend path is monotonic: an explicit clear (zero/past `until` in `MarkCooldown`) and any transition to `status='healthy'` still shorten/clear (recovery intact), and lazy expiry still wins over an already-expired stored cooldown. No in-flight retry — the next request uses the new member.
245245
- **Reload does not resurrect a cooled member:** because the durable `SetCredentialHealth` write is detached and best-effort, any reload (SIGHUP or the 2s data-version watcher firing on *any* unrelated DB write) rebuilds the resolver from store rows alone via `NewPoolResolver`. `Server.StorePool` therefore calls `PoolResolver.MergeLiveCooldowns(prev)` to carry forward still-active in-memory cooldowns from the resolver being replaced before the atomic swap. The merge is monotonic (a live cooldown is never shortened/erased by an unrelated reload) and drops cooldowns for credentials no longer in any pool.
246246
- **Audit:** a `cred_failover` event (Verdict `failover`, Credential = the cooled-down member) with `Reason = "<pool>:<from>-><to>:<429|403|401|invalid_grant>"`, emitted synchronously in `handlePoolFailover`.
247247
- **Telegram:** a best-effort non-blocking notice "pool <name> failed over <a> -> <b> (<reason>)" (plain text — `TelegramChannel.Notify` sends with no parse mode); the store write and every broker channel `Notify` are detached into their own goroutine so the response path never blocks.

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -288,7 +288,7 @@ github_pat static api.github.com
288288

289289
## Credential Pools
290290

291-
A credential pool lets a single phantom identity the agent sees be backed by **N real OAuth credentials**, with sluice auto-failing-over to the next member when the upstream rejects the active one. Primary use case: two OpenAI Codex OAuth accounts driven by one agent, so quota exhaustion on one account transparently rolls onto the other. The agent always holds one pool-scoped phantom pair (`SLUICE_PHANTOM:<pool>.access` / `.refresh`); sluice maps it to the currently active member's real token at injection time and persists refreshed tokens back to the member that issued them.
291+
A credential pool lets a single phantom identity the agent sees be backed by **N real OAuth credentials**, with sluice auto-failing-over to the next member when the upstream rejects the active one. Primary use case: two OpenAI Codex OAuth accounts driven by one agent, so quota exhaustion on one account transparently rolls onto the other. The agent always holds one pool-scoped phantom pair, byte-stable across member switches: the **access** phantom is a synthetic pool-stable JWT (HS256, `sub: sluice-pool:<name>`, `iss: sluice-phantom`, far-future `exp`) that is byte-identical for a given pool regardless of which member is active, so a cross-member failover never changes the access token the agent holds; the **refresh** phantom is the static string `SLUICE_PHANTOM:<pool>.refresh`. Sluice maps the pair to the currently active member's real token at injection time and persists refreshed tokens back to the member that issued them.
292292

293293
```bash
294294
sluice pool create <name> --members credA,credB[,credC] [--strategy failover]

internal/store/pools.go

Lines changed: 37 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -327,13 +327,47 @@ func (s *Store) SetCredentialHealth(credential, status string, cooldownUntil tim
327327
} else {
328328
cu = nil
329329
}
330+
// Monotonic extend for the durable row, mirroring MarkCooldown's
331+
// in-memory invariant. When the incoming write is a cooldown AND the
332+
// stored row already has a cooldown_until strictly in the future that
333+
// is LATER than the incoming one, keep the stored (longer) value: a
334+
// short rate-limit cooldown must never shorten a longer auth-failure
335+
// cooldown, even on the durable side, so restart durability matches
336+
// the resolver. Any transition to "healthy" (excluded.status =
337+
// 'healthy', whose cooldown_until is NULL) always overwrites, so the
338+
// recovery/heal path is intact. cooldown_until is always written as
339+
// UTC RFC3339 by this function, so the string comparison is a valid
340+
// chronological ordering; the datetime('now') guard makes an already
341+
// expired stored cooldown lose to the fresh future one (lazy expiry
342+
// preserved).
330343
_, err := s.db.Exec(
331344
`INSERT INTO credential_health (credential, status, cooldown_until, last_failure_reason, updated_at)
332345
VALUES (?, ?, ?, ?, datetime('now'))
333346
ON CONFLICT(credential) DO UPDATE SET
334-
status = excluded.status,
335-
cooldown_until = excluded.cooldown_until,
336-
last_failure_reason = excluded.last_failure_reason,
347+
cooldown_until = CASE
348+
WHEN excluded.status = 'cooldown'
349+
AND credential_health.cooldown_until IS NOT NULL
350+
AND credential_health.cooldown_until > strftime('%Y-%m-%dT%H:%M:%SZ', 'now')
351+
AND credential_health.cooldown_until > excluded.cooldown_until
352+
THEN credential_health.cooldown_until
353+
ELSE excluded.cooldown_until
354+
END,
355+
status = CASE
356+
WHEN excluded.status = 'cooldown'
357+
AND credential_health.cooldown_until IS NOT NULL
358+
AND credential_health.cooldown_until > strftime('%Y-%m-%dT%H:%M:%SZ', 'now')
359+
AND credential_health.cooldown_until > excluded.cooldown_until
360+
THEN credential_health.status
361+
ELSE excluded.status
362+
END,
363+
last_failure_reason = CASE
364+
WHEN excluded.status = 'cooldown'
365+
AND credential_health.cooldown_until IS NOT NULL
366+
AND credential_health.cooldown_until > strftime('%Y-%m-%dT%H:%M:%SZ', 'now')
367+
AND credential_health.cooldown_until > excluded.cooldown_until
368+
THEN credential_health.last_failure_reason
369+
ELSE excluded.last_failure_reason
370+
END,
337371
updated_at = excluded.updated_at`,
338372
credential, status, cu, nilIfEmpty(reason),
339373
)

internal/store/pools_test.go

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -278,6 +278,74 @@ func TestCredentialHealthCRUD(t *testing.T) {
278278
}
279279
}
280280

281+
func TestSetCredentialHealthMonotonicCooldown(t *testing.T) {
282+
s := newTestStore(t)
283+
284+
// Seed a long auth-failure cooldown (now+300s).
285+
authUntil := time.Now().Add(300 * time.Second).UTC().Truncate(time.Second)
286+
if err := s.SetCredentialHealth("a", "cooldown", authUntil, "401 auth fail"); err != nil {
287+
t.Fatalf("seed cooldown: %v", err)
288+
}
289+
290+
// A subsequent shorter rate-limit cooldown (now+60s) must NOT shorten
291+
// the durable row — restart durability must match the resolver.
292+
rlUntil := time.Now().Add(60 * time.Second).UTC().Truncate(time.Second)
293+
if err := s.SetCredentialHealth("a", "cooldown", rlUntil, "429 rate limited"); err != nil {
294+
t.Fatalf("shorter cooldown write: %v", err)
295+
}
296+
h, _ := s.GetCredentialHealth("a")
297+
if h == nil || !h.CooldownUntil.Equal(authUntil) {
298+
t.Fatalf("after shorter write CooldownUntil = %v, want %v (NOT shortened)",
299+
cooldownOf(h), authUntil)
300+
}
301+
if h.LastFailureReason != "401 auth fail" {
302+
t.Errorf("reason = %q, want %q (longer cooldown's metadata kept)", h.LastFailureReason, "401 auth fail")
303+
}
304+
305+
// A strictly LATER cooldown does extend.
306+
laterUntil := authUntil.Add(120 * time.Second)
307+
if err := s.SetCredentialHealth("a", "cooldown", laterUntil, "429 again"); err != nil {
308+
t.Fatalf("later cooldown write: %v", err)
309+
}
310+
h, _ = s.GetCredentialHealth("a")
311+
if h == nil || !h.CooldownUntil.Equal(laterUntil) {
312+
t.Fatalf("after later write CooldownUntil = %v, want %v (extended)",
313+
cooldownOf(h), laterUntil)
314+
}
315+
316+
// Transition to healthy clears, even though a longer cooldown is active
317+
// (recovery/heal path must remain intact).
318+
if err := s.SetCredentialHealth("a", "healthy", time.Time{}, ""); err != nil {
319+
t.Fatalf("heal write: %v", err)
320+
}
321+
h, _ = s.GetCredentialHealth("a")
322+
if h == nil || h.Status != "healthy" || !h.CooldownUntil.IsZero() {
323+
t.Errorf("after heal = %+v, want healthy/zero (recovery must not be blocked by monotonicity)", h)
324+
}
325+
326+
// An already-expired stored cooldown loses to a fresh future one
327+
// (lazy expiry preserved at the durable layer too).
328+
pastUntil := time.Now().Add(-time.Hour).UTC().Truncate(time.Second)
329+
if err := s.SetCredentialHealth("b", "cooldown", pastUntil, "stale"); err != nil {
330+
t.Fatalf("seed stale cooldown: %v", err)
331+
}
332+
freshUntil := time.Now().Add(60 * time.Second).UTC().Truncate(time.Second)
333+
if err := s.SetCredentialHealth("b", "cooldown", freshUntil, "429"); err != nil {
334+
t.Fatalf("fresh cooldown write: %v", err)
335+
}
336+
h, _ = s.GetCredentialHealth("b")
337+
if h == nil || !h.CooldownUntil.Equal(freshUntil) {
338+
t.Errorf("fresh cooldown after stale = %v, want %v", cooldownOf(h), freshUntil)
339+
}
340+
}
341+
342+
func cooldownOf(h *CredentialHealth) interface{} {
343+
if h == nil {
344+
return nil
345+
}
346+
return h.CooldownUntil
347+
}
348+
281349
// TestMigration000006DownUp verifies the pool migration is reversible.
282350
func TestMigration000006DownUp(t *testing.T) {
283351
dir := t.TempDir()

internal/vault/pool.go

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -261,6 +261,22 @@ func (pr *PoolResolver) MarkCooldown(credential string, until time.Time, reason
261261
delete(pr.health.health, credential)
262262
return
263263
}
264+
// Monotonic extend: a member parked for an auth failure (300s) that
265+
// subsequently trips a rate-limit (60s) must NOT have its cooldown
266+
// shortened — a known-bad credential would become eligible far too
267+
// early. Keep the LATER of the existing future cooldown and the new
268+
// one. This is ONLY the extend path: an explicit clear/recover (the
269+
// zero/past `until` branch above, and SetCredentialHealth "healthy"
270+
// on the durable side) still shortens/clears, and a strictly later
271+
// `until` still extends. Lazy expiry in ResolveActive/CooldownUntil
272+
// is unaffected because an expired existing cooldown is in the past
273+
// and `until.After(existing.cooldownUntil)` is true, so the fresh
274+
// future cooldown wins.
275+
if existing, ok := pr.health.health[credential]; ok &&
276+
existing.cooldownUntil.After(time.Now()) &&
277+
!until.After(existing.cooldownUntil) {
278+
return
279+
}
264280
pr.health.health[credential] = memberHealth{cooldownUntil: until, reason: reason}
265281
}
266282

0 commit comments

Comments
 (0)