Skip to content

Commit fb5528d

Browse files
authored
feat(proxy): credential pool auto-failover + approval-prompt coalescing (#43)
* feat(channel): coalesce duplicate approval prompts by dest:port * wip(proxy): checkpoint persist-once + store work before respawn * test(proxy): assert concurrent same-target asks coalesce to one prompt * feat(store): add credential pool + health schema and store API Migration 000006 adds credential_pools, credential_pool_members and credential_health tables. Store API: CreatePoolWithMembers (atomic; namespace mutual-exclusion + oauth-member validation in one tx), GetPool, ListPools, RemovePool, PoolExists, PoolsForMember, and Set/Get/ListCredentialHealth. Covered by CRUD/ordering/validation tests and an up/down migration reversibility test. * feat(vault): add PoolResolver pool->active-member chokepoint PoolResolver is the single place a bound pool name expands to a concrete credential. IsPool/ResolveActive (first healthy or expired-cooldown member in position order; degrade to soonest-recovering when all down) and MarkCooldown for Phase 2 synchronous in-memory failover. Locking discipline documented: membership immutable per instance (atomic-pointer swap on reload), health mutated in place under an RWMutex. RateLimit (60s) and AuthFail (300s) cooldown TTL consts. Nil-safe. * feat(cli): add pool subcommands and credential/pool namespace guards sluice pool create|list|status|rotate|remove. status computes the active member via the same PoolResolver logic the proxy uses so it never disagrees with what gets injected; rotate parks the active member (lazy-recovery cooldown) so the next member takes over. cred add rejects a name colliding with an existing pool; cred remove is blocked (before the vault delete) when the credential is a live pool member, so no dangling member rows or destroyed secrets. * feat(proxy): wire PoolResolver into server, addon and reloadAll Server gains an atomic PoolResolver pointer (parallel to the binding resolver), StorePool/PoolResolverPtr, and threads it into SluiceAddon via WithPoolResolver. addon.resolvePoolMember is the chokepoint helper (non-pool names passthrough). main.go registers the pool subcommand, loads the resolver at startup, and rebuilds+atomically swaps it in reloadAll alongside the binding/oauth reloads. Injection does not consult it yet (Phase 1). * feat(mcp): opt MCP tool calls out of approval coalescing * wip(telegram): checkpoint final-count edit before respawn * fix(telegram): use coalesced-count label in resolve edit body * docs(plans): convert tasks/phases to exec checkboxes; mark completed work * feat(channel): re-confirm broker coalescing tests green * feat(store): idempotent approval-rule persist * test(proxy): confirm full suite green after coalescing merge * feat(telegram): render coalesced count on resolve and cancel edits * docs(plans): complete approval-coalescing; move to completed * test(store): confirm credential-pool Phase 0 green post-merge * feat(proxy): pool phantom indirection + R1 attribution + R3 stable JWT * feat(proxy): auto-failover on 429/401 for credential pools * docs(plans): complete credential-pool-failover; move to completed * fix(proxy): address comprehensive review findings * fix(proxy): correct token-endpoint failover member attribution + harden cooldown durability * style: satisfy golangci-lint (errorlint, QF1002, unparam) * test(e2e): pool failover + approval coalescing end-to-end * fix(proxy): address Copilot review (per-request failover attribution, JWT payload marshaling, coalesced-count accuracy, single-pool membership) * fix: address Copilot re-review (failover callback registration, cred-remove fail-closed, CLAUDE.md accuracy) * fix(proxy): split-host pool OAuth refresh attribution + protocol-aware failover lookup * fix(proxy): classify token-endpoint 403 invalid_grant as auth-failure; audit detected protocol * fix(vault): monotonic cooldown (in-memory + durable); doc pool phantom shapes * fix(proxy): fail-closed unattributed pooled refresh; broker resolve/detach race; scope pool failover fallback to pooled requests * fix(cli): reject pool removal while bindings still reference the pool * fix(store): enforce namespace + pool-member + health-row invariants at store layer; engine-aware persist fast-path * fix(proxy): pool-namespace covered-set; shared-health prune for non-members; store-gated vault delete on cred remove * fix(store): guard CAS credential-rollback for pool members + health cleanup; fix e2e coalesce deadlock-on-failure * fix(store): RemovePool health cleanup; reject live-pool-member meta downgrade; CAS-aware health delete * fix(proxy): API-host failover requires per-flow pool-usage evidence (non-consuming peek, no blind fallback) * fix(cred): validate vault before store removal; cap coalesced subscribers; CLAUDE.md attribution accuracy * fix(store): failover health write no-ops for non-pool-member (atomic; no health-row resurrection) * fix: guard pool-rotate health write; atomic REST cred removal; gate MarkCooldown to current member set * fix(proxy): free flow tag after Response; tag token-host flow only on actual pool-phantom swap * fix: RemoveCredentialFully cleans health on partial-cleanup finish; exclude 5xx from token-endpoint failover classification * fix: pool+epoch-scoped health guards; CancelAll lost-wakeup; pooled-JWT phantom swap in query/path * fix(proxy): plain-cred refresh attribution on shared token URL; pool-aware QUIC injection * fix: R3-stable QUIC pool phantom; pool+epoch in refresh attribution; atomic RemovePoolIfUnreferenced; REST cred-remove missing-secret tolerant * fix(store): binding creation requires live credential or pool; CLAUDE.md QUIC R3 accuracy * test(telegram): deterministically sync cancel-edit assertion (deflake TestCancelApprovalRendersCoalescedCount) * fix: Telegram cred_meta on all adds; QUIC R3 comments; typed 409-vs-500 for RemoveCredentialFully * fix(telegram): roll back vault secret if cred-add metadata/binding fails; completed-plan R3 doc accuracy * fix(telegram): roll back credential_meta (CAS) as well as vault on env-var binding failure
1 parent 84a6dc6 commit fb5528d

52 files changed

Lines changed: 14363 additions & 357 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CLAUDE.md

Lines changed: 41 additions & 1 deletion
Large diffs are not rendered by default.

README.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -286,6 +286,31 @@ github_pat static api.github.com
286286

287287
**Supported response formats:** Both `application/json` and `application/x-www-form-urlencoded` token responses per RFC 6749.
288288

289+
## Credential Pools
290+
291+
A credential pool lets a single phantom identity the agent sees be backed by **N real OAuth credentials**, with sluice auto-failing-over to the next member when the upstream rejects the active one. Primary use case: two OpenAI Codex OAuth accounts driven by one agent, so quota exhaustion on one account transparently rolls onto the other. The agent always holds one pool-scoped phantom pair, byte-stable across member switches: the **access** phantom is a synthetic pool-stable JWT (HS256, `sub: sluice-pool:<name>`, `iss: sluice-phantom`, far-future `exp`) that is byte-identical for a given pool regardless of which member is active, so a cross-member failover never changes the access token the agent holds; the **refresh** phantom is the static string `SLUICE_PHANTOM:<pool>.refresh`. Sluice maps the pair to the currently active member's real token at injection time and persists refreshed tokens back to the member that issued them.
292+
293+
```bash
294+
sluice pool create <name> --members credA,credB[,credC] [--strategy failover]
295+
sluice pool list
296+
sluice pool status <name>
297+
sluice pool rotate <name> # operator override: force next member
298+
sluice pool remove <name>
299+
```
300+
301+
Members are existing OAuth credentials (static credentials are rejected). Member order is the failover order.
302+
303+
**Auto-failover behavior:**
304+
305+
- HTTP 429, or 403 with a quota-exhaustion body -> the active member is rate-limited; cooled down for **60s**.
306+
- HTTP 401, or a token-endpoint body of `invalid_grant` / `invalid_token` -> the active member's token is rejected; cooled down for **300s**.
307+
- 2xx, 5xx, and any other status -> no-op (a server-side error is not evidence the account is exhausted).
308+
- The active-member switch is **synchronous**: the cooldown is recorded in memory before the response returns, so the very next request injects the next member. The durable store write only reconciles for restarts.
309+
- **No in-flight retry**: the triggering request still returns its own upstream error to the agent; the agent's own retry resolves to the freshly-activated next member.
310+
- Every failover emits a `cred_failover` audit event (`Reason = "<pool>:<from>-><to>:<tag>"`) and a best-effort Telegram notice.
311+
312+
The phantom access token is **byte-identical across a member switch** (pooled OAuth credentials use a pool-keyed synthetic JWT resign), so the agent never observes the rollover.
313+
289314
## Approval Channels
290315

291316
Sluice broadcasts "ask" verdicts to all configured approval channels. The first channel to respond wins. Other channels get a cancellation notice.

cmd/sluice/binding_test.go

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,16 @@ func setupBindingDB(t *testing.T) string {
2727
if err != nil {
2828
t.Fatalf("create test DB: %v", err)
2929
}
30+
// Seed the credentials these tests bind to. AddBinding /
31+
// AddRuleAndBinding now require the referenced credential (or pool) to
32+
// exist, mirroring the real flow where "sluice cred add" creates the
33+
// credential before "sluice binding add" binds to it. Without this, the
34+
// binding-CLI tests would be exercising an impossible state.
35+
for _, c := range []string{"mycred", "cred_a", "cred_b"} {
36+
if err := db.AddCredentialMeta(c, "static", ""); err != nil {
37+
t.Fatalf("seed credential meta %q: %v", c, err)
38+
}
39+
}
3040
_ = db.Close()
3141
return dbPath
3242
}

cmd/sluice/cred.go

Lines changed: 86 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -227,6 +227,15 @@ func handleCredAdd(args []string) error {
227227
}
228228
defer func() { _ = db.Close() }()
229229

230+
// Namespace mutual-exclusion: a credential must not shadow a pool. Pool
231+
// and credential names share one namespace so a bound destination
232+
// resolves unambiguously to either a pool or a plain credential.
233+
if exists, perr := db.PoolExists(name); perr != nil {
234+
return fmt.Errorf("check pool name collision: %w", perr)
235+
} else if exists {
236+
return fmt.Errorf("name %q is already a credential pool; pool and credential names share one namespace", name)
237+
}
238+
230239
// Inputs validated and DB is open. Now persist the credential.
231240
vs, err := openVaultStore(*dbPath)
232241
if err != nil {
@@ -553,13 +562,85 @@ func handleCredRemove(args []string) error {
553562
}
554563
name := fs.Arg(0)
555564

565+
// Removal order (Finding 1, round-13 + Finding 3, round-9):
566+
//
567+
// 1. Open/validate the vault store FIRST (no delete yet -- just
568+
// confirm it opens). If the configured backend cannot be opened
569+
// (e.g. a non-age provider unsupported by the CLI), abort BEFORE
570+
// any metadata is removed. Doing the store removal first and then
571+
// discovering the vault is unopenable would leave credential_meta
572+
// gone while the vault secret + bindings/rules are orphaned.
573+
//
574+
// 2. Run the store-layer pool-membership gate (RemoveCredentialMeta).
575+
// This is the atomic, fail-closed guard: it refuses inside its own
576+
// transaction if the credential is still a live pool member,
577+
// closing the TOCTOU window where a separate pre-check passes and a
578+
// concurrent caller then creates a pool with this credential before
579+
// the vault secret is deleted.
580+
//
581+
// 3. Only call vs.Remove AFTER that gate succeeds. If the gate
582+
// refuses, the vault secret is left untouched and no window exists
583+
// where the secret is gone but credential_pool_members still
584+
// references it.
585+
//
586+
// (1) precedes (2) so an unopenable vault aborts before any metadata is
587+
// removed; (2) still precedes (3) so the store gate always runs before
588+
// the actual secret delete.
556589
vs, err := openVaultStore(*dbPath)
557590
if err != nil {
558591
return err
559592
}
560593

561-
// Remove from vault. If already gone (previous partial cleanup),
562-
// continue to DB cleanup so stale rules/bindings can be removed.
594+
// Only consult/mutate the DB if it already exists (do not create it as
595+
// a side effect of a removal).
596+
dbExists := false
597+
if _, statErr := os.Stat(*dbPath); statErr == nil {
598+
dbExists = true
599+
} else if !os.IsNotExist(statErr) {
600+
return fmt.Errorf("access database %q for credential removal of %q (refusing to remove; a pool member may otherwise be orphaned): %w", *dbPath, name, statErr)
601+
}
602+
603+
var db *store.Store
604+
if dbExists {
605+
var derr error
606+
db, derr = store.New(*dbPath)
607+
if derr != nil {
608+
// Fail closed: the DB exists but cannot be opened, so the
609+
// pool-membership gate cannot run. Proceeding to delete the
610+
// vault secret would orphan any credential_pool_members row
611+
// pointing at this now-missing credential -- exactly what the
612+
// gate prevents. Refuse the removal instead.
613+
return fmt.Errorf("open database %q to check pool membership for %q (refusing to remove; a pool member may otherwise be orphaned): %w", *dbPath, name, derr)
614+
}
615+
defer func() { _ = db.Close() }()
616+
617+
// GATE + atomic store cleanup (Finding 2, round-15). This MUST run
618+
// before the vault delete. RemoveCredentialFully runs the
619+
// fail-closed pool-member guard AND deletes credential_meta,
620+
// credential_health, all bindings on the credential, and all
621+
// auto-created rules in ONE transaction. If the credential is still
622+
// a live pool member (or any store delete fails) it returns an
623+
// error with NOTHING removed and the vault secret below is never
624+
// touched — no partially-deleted-credential window.
625+
metaDeleted, rmBindings, rmRules, rmErr := db.RemoveCredentialFully(name)
626+
if rmErr != nil {
627+
return fmt.Errorf("remove credential store state for %q (refusing to delete the vault secret so the credential is not partially deleted): %w", name, rmErr)
628+
}
629+
if metaDeleted {
630+
fmt.Printf("removed credential metadata for %q\n", name)
631+
}
632+
if rmRules > 0 {
633+
fmt.Printf("removed %d auto-created rule(s) for credential %q\n", rmRules, name)
634+
}
635+
if rmBindings > 0 {
636+
fmt.Printf("removed %d binding(s) for %q\n", rmBindings, name)
637+
}
638+
}
639+
640+
// Store removal already succeeded (or the DB does not exist). Now it is
641+
// safe to delete the vault secret. If already gone (previous partial
642+
// cleanup), continue to DB cleanup so stale rules/bindings can be
643+
// removed.
563644
if err := vs.Remove(name); err != nil {
564645
if !os.IsNotExist(err) {
565646
return fmt.Errorf("remove: %w", err)
@@ -569,57 +650,9 @@ func handleCredRemove(args []string) error {
569650
fmt.Printf("credential %q removed\n", name)
570651
}
571652

572-
// Clean up associated bindings and auto-created rules. Only open the
573-
// store if the DB file exists to avoid creating it as a side effect of
574-
// a credential removal.
575-
if _, statErr := os.Stat(*dbPath); statErr != nil {
576-
if !os.IsNotExist(statErr) {
577-
log.Printf("warning: cannot access database %q for cleanup: %v (stale rules/bindings may remain)", *dbPath, statErr)
578-
}
579-
return nil
580-
}
581-
582-
db, err := store.New(*dbPath)
583-
if err != nil {
584-
log.Printf("warning: could not open database %q for cleanup: %v (stale rules/bindings may remain)", *dbPath, err)
585-
return nil
586-
}
587-
defer func() { _ = db.Close() }()
588-
589-
// Remove rules tagged either by "sluice cred add --destination"
590-
// (cred-add:<name>) or by "sluice binding add" (binding-add:<name>).
591-
// Both paths may have produced rules associated with this credential,
592-
// and failing to clean up either set leaves orphan allow rules in
593-
// the store.
594-
var total int64
595-
for _, src := range []string{
596-
store.CredAddSourcePrefix + name,
597-
store.BindingAddSourcePrefix + name,
598-
} {
599-
n, rmErr := db.RemoveRulesBySource(src)
600-
if rmErr != nil {
601-
log.Printf("warning: failed to remove rules with source %q for credential %q: %v", src, name, rmErr)
602-
continue
603-
}
604-
total += n
605-
}
606-
if total > 0 {
607-
fmt.Printf("removed %d auto-created rule(s) for credential %q\n", total, name)
608-
}
609-
removed, rmBindErr := db.RemoveBindingsByCredential(name)
610-
if rmBindErr != nil {
611-
log.Printf("warning: failed to remove bindings for %q: %v", name, rmBindErr)
612-
} else if removed > 0 {
613-
fmt.Printf("removed %d binding(s) for %q\n", removed, name)
614-
}
615-
616-
// Remove credential metadata (type, token_url).
617-
metaDeleted, rmMetaErr := db.RemoveCredentialMeta(name)
618-
if rmMetaErr != nil {
619-
log.Printf("warning: failed to remove credential meta for %q: %v", name, rmMetaErr)
620-
} else if metaDeleted {
621-
fmt.Printf("removed credential metadata for %q\n", name)
622-
}
653+
// Bindings and auto-created rules were already removed atomically with
654+
// credential_meta + health by RemoveCredentialFully above, before the
655+
// vault secret was deleted. Nothing left to clean up here.
623656
return nil
624657
}
625658

0 commit comments

Comments
 (0)