diff --git a/rfcs/THV-0072-vmcp-multi-upstream-partial-auth.md b/rfcs/THV-0072-vmcp-multi-upstream-partial-auth.md new file mode 100644 index 0000000..467338d --- /dev/null +++ b/rfcs/THV-0072-vmcp-multi-upstream-partial-auth.md @@ -0,0 +1,1020 @@ +# RFC-0072: Graceful Degradation for Multi-Upstream OAuth in vMCP + +- **Status**: Draft +- **Author(s)**: @tgrunnagle +- **Created**: 2026-05-13 +- **Last Updated**: 2026-05-13 +- **Target Repository**: toolhive +- **Related Issues**: [toolhive#5162](https://github.com/stacklok/toolhive/issues/5162) + +## Summary + +When a VirtualMCPServer's embedded authorization server fronts multiple upstream +Identity Providers, today's chain is all-or-nothing: any one failure invalidates +every collected upstream token and locks the user out of every backend, even +backends that depend only on the upstreams that already succeeded. This RFC +introduces opt-in partial-completion semantics, controlled by a single +`optional` flag (default `false`) on each upstream provider. Marking any +provider `optional: true` opts the deployment into partial completion for +that provider; leaving every provider at the default reproduces today's +all-or-nothing behavior exactly. Backends whose required upstream is missing +are filtered out of the vMCP's exposed catalog and refused at request time; +re-authentication is restart-all, with full identity-consistency checks +preserved. + +## Problem Statement + +The embedded auth server walks `EmbeddedAuthServerConfig.UpstreamProviders` in +order. After each upstream callback, the handler calls `nextMissingUpstream` +(pkg/authserver/server/handlers/handler.go:118-133) +to redirect to the next provider without a stored token, and only issues an +authorization code when every upstream has one +(pkg/authserver/server/handlers/callback.go:368-478). +If any upstream returns an error or the user declines a consent screen, +`handleUpstreamError` +(pkg/authserver/server/handlers/callback.go:325-362) +calls `DeleteUpstreamTokens` and the client receives a generic +`access_denied` with the hint `"upstream authentication failed"`. + +- **Current behavior**: one upstream IdP outage or one declined consent screen + blocks every backend on a vMCP — even backends with no dependency on the + failed upstream. +- **Affected users**: operators running vMCPs that aggregate backends across + multiple SaaS IdPs (e.g. `github` + `slack` + `google`). End users see "the + whole thing is down" when only one provider has a problem. +- **Worth solving**: vMCP's value proposition is composition across + heterogeneous backends; the auth layer should fail in proportion to what is + actually broken, not collapse the whole surface. + +The system already knows which backends bind to which upstream — the strategy +configs name them directly via `UpstreamInjectConfig.ProviderName`, +`TokenExchangeConfig.SubjectProviderName`, and +`AwsStsConfig.SubjectProviderName` +(pkg/vmcp/auth/types/types.go). That +information is just not consulted when deciding how to fail. + +## Goals + +- Let the chain complete when only a subset of upstreams succeed, when the + operator opts in. +- Preserve today's all-or-nothing semantics as the default so existing + deployments are unaffected. +- Keep the cross-leg identity-consistency invariant + (pkg/authserver/server/handlers/callback.go:387-410) + intact under partial completion and under re-authentication. +- Filter backends whose required upstream is missing out of the vMCP's + exposed tool/resource/prompt catalog, with a defense-in-depth refusal at + dispatch time. +- Define a single, well-specified path for recovering filtered backends: + client-driven `/authorize` restart-all. +- Handle RFC 6749 §4.1.2.1 upstream errors (`access_denied`, + `invalid_request`, `server_error`, `temporarily_unavailable`, …) cleanly: + optional upstreams are skipped, required upstreams fail the chain. + +## Non-Goals + +- **Per-upstream retry.** This RFC does not add a "retry just provider X" + path. Recovery is restart-all. See *Alternatives Considered*. +- **MCP-protocol-level signaling of filtered state.** The vMCP server does + not advertise to clients that *additional* tools would exist if more + upstreams were authorized; capability listings simply omit them. The + protocol has no field for "tool exists but is unauthorized" and we will + not add one. +- **Silent backend dropping on refresh-token expiry.** When a previously + satisfied upstream's refresh token expires, the affected backend's next + call surfaces an auth error that the client converts into a fresh + `/authorize` flow. We do not quietly remove the backend mid-session. +- **Dynamic addition of upstreams at runtime.** `UpstreamProviders` is still + reconciled from the CRD. +- **Changing the chain-walk order or making it parallel.** Sequential + walk is preserved; only the completion predicate and the failure path + change. + +## Proposed Solution + +### High-Level Design + +Three concrete changes, all driven by a new per-upstream `optional` flag +(default `false`): + +1. **Required vs. optional upstreams.** `UpstreamProviderConfig` gains an + `optional` bool (default `false`). An upstream with `optional: false` is + required; one with `optional: true` may be skipped. The provider named + by `authzConfig.inline.primaryUpstreamProvider` (explicit or + auto-selected) is *always* required — admission rejects a configuration + that sets `optional: true` on the primary, and runtime treats the + primary as required regardless of the field. Existing manifests that + never set `optional` reproduce today's behavior exactly: every upstream + defaults to required, so the chain is all-or-nothing. +2. **Skip-and-continue on optional-upstream failure.** When an upstream + marked `optional: true` returns an RFC 6749 §4.1.2.1 error (including + user-declined `access_denied`), the chain records the upstream as + *skipped for this session* and advances to the next missing upstream + instead of tearing the whole flow down. The auth code is issued when + every upstream has been attempted and every required upstream has a + token. +3. **Skip unauthorized backends at session init, fail-closed.** vMCP + aggregates capabilities once per MCP session, at `initialize` + time, and freezes the result on the session value + (pkg/vmcp/session/factory.go:420-506); + `tools/list`, `resources/list`, `resources/templates/list`, and + `prompts/list` later replay that cached snapshot. The filter + therefore runs *before* per-backend initialization, in the + pre-init backend filter loop of `makeBaseSession` + (pkg/vmcp/session/factory.go:427-435). + Each backend's required upstream is read off its `AuthConfig + *authtypes.BackendAuthStrategy` + (pkg/vmcp/types.go:346-349) + and checked against the per-session upstream-token set from + storage — looked up by the `tsid` claim on the incoming bearer + token, exactly as the existing TokenValidator middleware does at + pkg/auth/token.go:1184-1200. Backends + whose required upstream is missing are dropped from the slice + before any `initOneBackend` goroutine runs: no HTTP connection, + no handshake, no capability listing, no entry into the routing + table or the per-session SDK tool store. Upstream tokens + themselves never ride in the bearer token. No new dispatcher + gate is added: if a stored upstream token disappears from + storage between `initialize` and a later call (refresh-token + expiry, IdP admin revoke), the existing backend auth strategies + already fail closed — `UpstreamInjectStrategy`, + `TokenExchangeStrategy`, and `AwsStsStrategy` all return + `ErrUpstreamTokenNotFound` when the configured provider is + absent from `identity.UpstreamTokens` + (pkg/vmcp/auth/strategies/upstream_inject.go:80-83, + tokenexchange.go:124-127, aws_sts.go:136-139), + which surfaces as the existing unauthorized error to the client. + Recovery against a degraded session is whole-session: the client + re-runs `/authorize` for a new JWT and reconnects, which produces + a fresh `initialize` and a re-aggregated capability snapshot. + +Recovery is "restart-all" from the *user's* perspective — they re-walk +the chain end-to-end — but there is no special server-side restart +branch. Every `/authorize` already mints a fresh `SessionID` via +`rand.Text()` +(pkg/authserver/server/handlers/authorize.go:73-93), +so re-running `/authorize` simply creates a brand-new session and walks +the chain from leg one against that new session. Prior sessions (with +real tokens, skip rows, or a mix) sit untouched in storage and expire +on their normal session-end paths; nothing in the server needs to +reach across the two sessions. The identity-consistency check applies +across legs *within* a single session's walk — its job is to catch a +divergent upstream identity mid-chain, not to bridge two `/authorize` +invocations. + +```mermaid +flowchart TD + A[Client → /authorize] --> D[Mint fresh SessionID
start chain at leg 1] + D --> E[Leg N callback] + E --> F{Upstream error?} + F -- no --> G{More upstreams
unattempted?} + F -- "yes, required
(optional=false)" --> H[Fail entire chain
delete all rows
existing handleUpstreamError] + F -- "yes, optional
(optional=true)" --> I[Write skip row
continue chain] + I --> G + G -- yes --> J[nextUnattemptedUpstream] + G -- no --> N{All required
have tokens?} + J --> L[Redirect to next upstream] + L --> E + N -- yes --> K[Identity consistency check] + N -- no --> H + K -- ok --> M[Issue authorization code] + K -- mismatch --> H +``` + +### Detailed Design + +#### Component Changes + +##### Operator: CRD additions + +One new field on an existing type in +cmd/thv-operator/api/v1beta1/mcpexternalauthconfig_types.go: + +```go +// UpstreamProviderConfig gains an Optional flag. +type UpstreamProviderConfig struct { + // ... existing fields (Name, Type, OIDCConfig, OAuth2Config) ... + + // Optional marks this upstream as eligible to be skipped during chain + // completion. When false (the default), the upstream is required: any + // failure tears the chain down with today's all-or-nothing semantics. + // When true, an RFC 6749 §4.1.2.1 error from the upstream (or a user + // declining consent) is recorded as a per-session skip and the chain + // continues; backends that depend on this upstream are filtered from + // the vMCP catalog for that session. + // + // The provider named by spec.incomingAuth.authzConfig.inline.primaryUpstreamProvider + // (or, when unset, the first listed upstream) is always required + // regardless of this field; the admission webhook rejects a + // configuration that sets the primary upstream's Optional to true. + // + // Leaving Optional unset on every upstream reproduces today's + // all-or-nothing behavior exactly — existing manifests need no change. + // + // +kubebuilder:default=false + // +optional + Optional bool `json:"optional,omitempty"` +} +``` + +Validation extension to `MCPExternalAuthConfig` and `VirtualMCPServer` +admission webhooks: + +- Reject configurations where the primary upstream (resolved via + `AuthzConfigRef.ExplicitPrimaryUpstreamProvider()` — + cmd/thv-operator/api/v1beta1/mcpserver_types.go:657-668, + falling back to `upstreamProviders[0].Name`) has `Optional: true`. +- No advisory status condition is needed; rejection at admission is the + right level — degrading silently here would let a misconfigured policy + authorize against missing claims. + +A new validation rule on `MCPExternalAuthConfig.validateUpstreamProvider` +(cmd/thv-operator/api/v1beta1/mcpexternalauthconfig_types.go:1124) +is not required — `Optional` is a leaf bool with no cross-field semantics +at the per-provider level. Cross-field validation lives on the parent. + +##### Auth server: chain completion logic + +Three changes in +pkg/authserver/server/handlers/: + +1. **`nextMissingUpstream` → `nextUnattemptedUpstream`** + (pkg/authserver/server/handlers/handler.go:118-133). + A single `GetAllUpstreamTokens(sessionID)` call returns every row the + session has — real tokens and skip rows alike — and the helper walks + the configured `UpstreamProviders` in order, returning the first + provider with no row at all. Chain completion is a two-part predicate + evaluated after each successful leg: `nextUnattemptedUpstream` returns + empty *and* every required upstream (i.e. every upstream with + `Optional == false`) has a stored, non-skipped token. Optional + upstreams are still walked so the user has a chance to authorize + them; only after they have been attempted (and either succeeded or + been recorded as skipped) does the chain consider itself complete. + When every upstream is required — the default for unchanged manifests + — the predicate reduces to "every upstream has a non-skipped token", + identical to today's `nextMissingUpstream` behavior, and the storage + call shape is the same single bulk read. + +2. **Skipped-upstream record reuses the existing row.** + `storage.UpstreamTokens` + (pkg/authserver/storage/types.go:55-97) + gains two fields — `Skipped bool` and `SkippedReason string` — and a + "skip" is just an existing-shape row written via the existing + `StoreUpstreamTokens` API with `Skipped: true`, `SkippedReason` set + to the RFC 6749 error code (or `chain_failed`), empty token material, + and the same `UserID` / `UpstreamSubject` / `ClientID` binding fields + the row would have carried on success. No new storage-interface + methods are needed, the in-memory and Redis backends store the two + new fields as part of the row's normal serialization, and the + required-upstream-failure path is unchanged: + `handleUpstreamError`'s existing call to `DeleteUpstreamTokens` + already wipes every row for the session, skip rows included. + + Read-path call sites are updated to filter skipped rows out of any + map intended for token use: + + - `InProcessService.GetAllValidTokens` + (pkg/auth/upstreamtoken/service.go:83-125) + skips rows where `Skipped == true` when building its + `map[string]string` of provider → access token. The hydration + downstream of this — `TokenValidator.loadUpstreamTokens` / + `Middleware` populating `identity.UpstreamTokens` + (pkg/auth/token.go:1184-1248) + — therefore never sees a skip entry, so the inject, token-exchange, + and AWS STS strategies (and Cedar's primary-upstream binding) work + unchanged. + - `InProcessService.GetValidToken` (and the singular + `UpstreamTokenStorage.GetUpstreamTokens` path it sits over) treats a + skip row as `ErrNotFound` for callers that want token material, + matching the existing "row absent" contract. + - The chain handler is the one deliberate consumer of skip state: it + calls `GetAllUpstreamTokens` directly and inspects `.Skipped` on the + returned rows. + +3. **`handleUpstreamError` per-upstream branch** + (pkg/authserver/server/handlers/callback.go:325-362). + Look up the current upstream (carried on the pending authorization via + `PendingAuthorization.UpstreamProviderName` — + pkg/authserver/storage/types.go:397-453) + and decide: + + - If the upstream has `Optional == false` (including the primary): + preserve today's behavior — `DeleteUpstreamTokens`, write generic + `access_denied` to the client. + - If the upstream has `Optional == true`: write a skip row via + `StoreUpstreamTokens` with `Skipped: true`, `SkippedReason` set to + the upstream's RFC 6749 error code (or `chain_failed` for + transport-level failures), and the session's existing identity + binding fields populated; then invoke `continueChainOrComplete` + exactly as the success path does. The user is redirected onward to + the next upstream or, if none remains, sees a successful + authorization at the client. + + The handler is the single place that classifies §4.1.2.1 error codes; + we do not propagate distinct codes to the downstream client (the + existing generic `access_denied` posture is preserved for the + required-upstream path). + +4. **Identity consistency on completion** + (pkg/authserver/server/handlers/callback.go:387-410) + is unchanged in shape. It still walks every stored upstream row and + verifies the resolved subject matches; the walk learns to ignore rows + with `Skipped == true` (no token material to compare against) so the + subset under consistency check is exactly the set of required, real + upstream tokens. Skip rows still carry the session's `UserID` / + `UpstreamSubject` binding so the row is auditable on its own merits. + +##### Auth server: re-authorization is a fresh session + +`/authorize` requires no changes for this RFC. The existing handler +already generates a fresh `SessionID` for every call +(pkg/authserver/server/handlers/authorize.go:73-93) +and does not correlate the incoming request to any prior session — the +client supplies its own `client_id`, `state`, and PKCE challenge, none +of which name a server-side session. A user retrying after a failed or +declined upstream therefore lands in a brand-new session and walks the +full chain from leg one against rows that did not exist a moment ago. + +Whatever the prior session held — real tokens, skip rows, or both — +stays in storage at its old `SessionID` key and ages out under the +existing TTL and `cleanupExpired` paths +(pkg/authserver/storage/types.go:55-97, +the existing `SessionExpiresAt` lifetime). Because the new walk never +sees the prior session, no `DeleteUpstreamTokens` step is needed at +`/authorize` entry, and there is no cross-session identity-takeover +surface to defend against: the same `client_id` coming back as a +different upstream identity simply opens a fresh session that walks +the chain afresh, exactly as if it were a first-time client. + +The identity-consistency check +(pkg/authserver/server/handlers/callback.go:387-410) +remains a *within-session* invariant — it catches a divergent upstream +identity across legs of the same walk. It is not asked to bridge two +`/authorize` invocations. + +##### vMCP: backend → upstream binding + +Add a helper in +pkg/vmcp/auth/types/types.go: + +```go +// RequiredUpstreamProvider returns the upstream provider that this backend +// auth strategy requires a valid token from, or empty if the strategy is +// independent of the embedded auth server's upstream chain (e.g. service +// account, anonymous). +func (s *BackendAuthStrategy) RequiredUpstreamProvider() string { + switch { + case s.UpstreamInject != nil: + return s.UpstreamInject.ProviderName + case s.TokenExchange != nil: + return s.TokenExchange.SubjectProviderName + case s.AwsSts != nil: + return s.AwsSts.SubjectProviderName + default: + return "" + } +} +``` + +Backends whose strategy returns empty are always available — they have no +upstream dependency. + +##### vMCP: backend filtering at session creation + +The new filter lives at exactly one point in the vMCP lifecycle: +the pre-init backend filter loop in `makeBaseSession`, which decides +which backends will have their `initOneBackend` goroutine scheduled. +There is no separate dispatcher-level re-check, because the existing +per-call upstream-token lookup already fails closed when a token is +missing — see "Out-of-band revocation between `initialize` and call" +below. + +The filter reads the per-session upstream-token set from upstream +token storage; the bearer token's `tsid` claim names the session, but +the tokens themselves are loaded from storage by the existing +`TokenValidator` middleware +(pkg/auth/token.go:1184-1248) +and hydrated onto `identity.UpstreamTokens` before any vMCP handler +runs. The filter therefore consults `IdentityFromContext(ctx).UpstreamTokens` +— a storage-backed view, not a claim-derived one — and never decodes +or trusts upstream-token state out of the JWT itself. + +1. **Skip unauthorized backends during session initialization.** + Capabilities are aggregated and frozen at `initialize` time in + `makeBaseSession` + (pkg/vmcp/session/factory.go:420-506) + and replayed thereafter; `tools/list` and friends do not + re-aggregate. Hook the filter in the pre-init backend filter loop + (pkg/vmcp/session/factory.go:427-435): + for each backend, if `b.AuthConfig.RequiredUpstreamProvider()` + (pkg/vmcp/types.go:346-349) + is non-empty and absent from `identity.UpstreamTokens`, drop the + backend before scheduling its goroutine. `initOneBackend` never + runs, no connection or handshake is attempted, the aggregator + sees a smaller backend set, and the routing table / tools / + resources / prompts / SDK per-session tool store all exclude the + backend by construction. + +2. **Out-of-band revocation between `initialize` and call.** A + stored upstream token can disappear after the snapshot is built + (refresh-token expiry, IdP admin revoke). No new dispatcher gate + is added: the `TokenValidator` middleware re-hydrates + `identity.UpstreamTokens` from storage per request, and the + existing strategies — `UpstreamInjectStrategy` + (pkg/vmcp/auth/strategies/upstream_inject.go:80-83), + `TokenExchangeStrategy` + (pkg/vmcp/auth/strategies/tokenexchange.go:124-127), + `AwsStsStrategy` + (pkg/vmcp/auth/strategies/aws_sts.go:136-139) + — already return `ErrUpstreamTokenNotFound` when their configured + provider is missing. Same gate covers a misbehaving client + fabricating a call against a filtered tool. + +Filtering is structurally invisible to the MCP protocol: the client just +sees a smaller capability set, exactly as if those backends were never +configured. We do not extend the protocol to signal degraded state. End +users discover what they cannot do by trying to use it; the failing +tool/resource error includes the missing upstream's name for self-service +re-auth. + +#### API Changes + +No external HTTP API or MCP protocol changes, and the +`UpstreamTokenStorage` interface +(pkg/authserver/storage/types.go:495-516) +is unchanged. Skip rows ride on the existing `StoreUpstreamTokens` / +`GetUpstreamTokens` / `GetAllUpstreamTokens` / `DeleteUpstreamTokens` +methods. The only persistence-shape change is two new fields on the +already-stored row: + +```go +// UpstreamTokens (pkg/authserver/storage/types.go:55-97) gains: +type UpstreamTokens struct { + // ... existing fields (ProviderID, AccessToken, RefreshToken, IDToken, + // ExpiresAt, SessionExpiresAt, UserID, UpstreamSubject, ClientID) ... + + // Skipped marks this row as a tombstone for an optional upstream that + // was attempted for the session and either returned an RFC 6749 + // §4.1.2.1 error or had its consent screen declined by the user. + // When true, AccessToken / RefreshToken / IDToken / ExpiresAt are + // zero and read paths that build a live-token view (notably + // upstreamtoken.InProcessService.GetAllValidTokens) must skip the + // row. Binding fields (UserID, UpstreamSubject, ClientID) carry the + // session's identity so the row is auditable on its own merits. + Skipped bool + + // SkippedReason records why the upstream was skipped. Sourced from + // RFC 6749 §4.1.2.1 error codes (`access_denied`, `invalid_request`, + // `server_error`, `temporarily_unavailable`, …) plus a small + // internal allowlist (`chain_failed`, `network_error`). Empty when + // Skipped is false. Never set to upstream `error_description` text. + SkippedReason string +} +``` + +In-memory and Redis backends serialise the two new fields as part of +the row's existing encoding; no new keys, no new TTL semantics. Read +call sites are updated so the skip flag is observed exactly where it +matters: + +- `InProcessService.GetAllValidTokens` + (pkg/auth/upstreamtoken/service.go:83-125) + drops rows with `Skipped == true` from the `map[string]string` it + returns to `TokenValidator.loadUpstreamTokens` — so + `identity.UpstreamTokens` (the storage-backed view consumed by every + backend auth strategy, by the vMCP filter, and by Cedar) only ever + contains real, live tokens. +- The singular `GetValidToken` path treats a skip row as not-present, + matching its existing contract. +- The chain handler queries `GetAllUpstreamTokens` directly and + inspects `.Skipped` on each row to make the + `nextUnattemptedUpstream` and "every required upstream satisfied" + decisions in a single storage call per leg. + +#### Configuration Changes + +Example `MCPExternalAuthConfig` for a vMCP fronting github, slack, and +google backends, where github carries primary identity and slack and +google are best-effort: + +```yaml +apiVersion: toolhive.stacklok.dev/v1beta1 +kind: MCPExternalAuthConfig +metadata: + name: vmcp-multi-idp +spec: + type: embedded + embeddedAuthServer: + issuer: https://vmcp.example.com + upstreamProviders: + - name: github + type: oidc + # optional: false (default) — primary identity, must always succeed + oidcConfig: { ... } + - name: slack + type: oauth2 + optional: true # slack outage filters slack-mcp only + oauth2Config: { ... } + - name: google + type: oidc + optional: true # declined google consent filters gdrive-mcp only + oidcConfig: { ... } +``` + +A manifest that leaves `optional` unset on every provider behaves +identically to today: every upstream is required, and a single failure +tears down the whole chain. + +And on the `VirtualMCPServer`: + +```yaml +spec: + incomingAuth: + authzConfig: + inline: + primaryUpstreamProvider: github # webhook enforces github.optional==false + policies: [ ... ] +``` + +#### Data Model Changes + +The only persistence-shape change is the two new fields on +`storage.UpstreamTokens` described in *API Changes* above: `Skipped +bool` and `SkippedReason string`. A skipped upstream is just an +existing-shape row at the existing `(sessionID, providerName)` key with +`Skipped == true` and empty token material; no new map, hash, key +space, or storage object is introduced. Both backends gain the two +fields in their existing serialization path (the in-memory store +already holds `*UpstreamTokens` by reference; the Redis store +serialises the struct as one value per key) and the row's existing TTL +governs its lifetime. + +Skip rows have no independent retention: they live with the rest of +the session's upstream rows and share the session's TTL, ageing out +under the existing `cleanupExpired` and `SessionExpiresAt` paths. +Retrying a previously declined upstream requires the client to re-run +`/authorize`, which mints a fresh `SessionID` and walks the full +chain again against a new, empty session — consistent with the +whole-chain recovery model used elsewhere in this RFC. + +No changes to `PendingAuthorization` +(pkg/authserver/storage/types.go:397-453). +The chain-leg correlation still uses `UpstreamProviderName` on the +pending record. + +## Security Considerations + +### Threat Model + +- **Threat A — Cross-identity token binding.** A malicious or compromised + upstream could try to bind tokens for *its* identity into a session + established under a different identity. Mitigation: the existing + identity-consistency check + (pkg/authserver/server/handlers/callback.go:387-410) + walks every stored upstream token and compares + `UpstreamTokens.UserID`/`UpstreamSubject` against the resolved subject; + this check is unchanged. Skipped upstreams contribute no token and so + cannot inject a divergent identity. +- **Threat B — Cross-session identity adoption on re-`/authorize`.** + A user (or attacker) running a second `/authorize` flow could in + principle try to adopt or otherwise reuse the session state of an + earlier flow. Mitigation: not applicable — the auth server already + mints a fresh `SessionID` on every `/authorize` + (pkg/authserver/server/handlers/authorize.go:73-93) + and never correlates an incoming request to a prior session. The new + flow's `(sessionID, providerName)` rows are disjoint from any + earlier session's rows; there is no overlap to attack. The prior + session's rows age out under their existing TTL. Within the new + walk, the identity-consistency check + (pkg/authserver/server/handlers/callback.go:387-410) + continues to catch a divergent upstream identity across legs of the + *same* session, which is the property this RFC needs. +- **Threat C — Operator misconfiguration: optional primary.** An operator + could mark the authz-policy primary upstream as optional, in which case + Cedar policies would evaluate against an absent or wrong identity. + Mitigation: admission webhook rejects configurations where the + `primaryUpstreamProvider` has `optional: true`. Defense-in-depth: at + runtime, the chain treats the primary as required even if the field is + somehow `true` on a stored object. +- **Threat D — Stale-session backend invocation.** Because the + capability snapshot is frozen at session creation, a client that + enumerated tools at `initialize` could still attempt to invoke a + tool whose upstream token has since been revoked (refresh-token + expiry, IdP admin revoke) or otherwise removed from storage. + Mitigation: no new code. The existing `TokenValidator` middleware + re-hydrates `identity.UpstreamTokens` from storage on every + request, and every backend auth strategy + (`UpstreamInjectStrategy`, `TokenExchangeStrategy`, + `AwsStsStrategy`) already returns `ErrUpstreamTokenNotFound` + when its configured provider is absent from the map. The call + fails closed via the existing error path; no separate dispatcher + gate is needed. +- **Threat E — Information disclosure via filtered listings.** Filtered + listings reveal which providers a session has not authorized (by + absence). This is the same disclosure as a session that simply hasn't + authorized those providers yet — not a new leak. + +### Authentication and Authorization + +- No change to who can call `/authorize` or how tokens are validated. +- Authz policy evaluation is unchanged: Cedar still binds to claims from + the primary upstream, which is guaranteed present by the new admission + rule. +- Per-backend initialization is the new authorization gate: a + session whose `identity.UpstreamTokens` lacks a backend's required + upstream skips that backend's `initOneBackend` entirely, so it + never enters the routing table or the tool list — there is nothing + to authorize against. Existing backend auth strategies remain the + call-time fail-closed for backends that *were* initialized but + whose token later disappeared: they already refuse to mint a + request when the configured upstream token is absent + (`ErrUpstreamTokenNotFound`). This is *stricter* than today — today + the session would already have failed authorization entirely. + +### Data Security + +- Skip rows reuse the existing `UpstreamTokens` row at the existing + `(sessionID, providerName)` key but carry no token material — the + `AccessToken`, `RefreshToken`, `IDToken`, and `ExpiresAt` fields are + zero. They retain the session's identity-binding fields (`UserID`, + `UpstreamSubject`, `ClientID`) for audit and for the runtime guard + on the consistency check. They live with the rest of the session's + rows and expire on the same session-end paths (TTL, + `SessionExpiresAt`, and the existing required-upstream-failure + `DeleteUpstreamTokens`). +- The existing token-storage interface and its on-disk encoding are + unchanged in shape; the row simply gains two scalar fields. + +### Input Validation + +- `Optional` is a leaf bool, kubebuilder-defaulted to `false`. +- Cross-field rule: at admission, verify + `upstreamProviders[primary].Optional == false`. Implemented in the + existing `MCPExternalAuthConfig` and `VirtualMCPServer` admission + paths, with a Go-level guard in the reconciler for defense-in-depth. +- `SkippedReason` values are sourced from RFC 6749 §4.1.2.1 error codes + and a small allowlist of internal reasons (`chain_failed`, + `network_error`); we do not echo arbitrary upstream + `error_description` text into the stored row. + +### Secrets Management + +- No new secrets. Existing client secrets and signing keys remain + governed by `SigningKeySecretRefs`/`HMACSecretRefs`. + +### Audit and Logging + +Emit structured logs for: + +- `upstream_skipped` — `{session_id, provider, reason}` at INFO. This is + load-bearing for debugging — when a user reports a missing backend, the + log explains which provider was skipped and why. +- `chain_completed_partial` — `{session_id, required_count, skipped_count}` + at INFO once per chain completion where at least one optional upstream + was skipped. +- The existing `/authorize` entry log already records each new + `SessionID`; no separate re-authorization event is needed since + re-`/authorize` is structurally just "another new session." +- `backend_init_skipped_for_session` — `{session_id, backend_id, missing_provider}` + at INFO when `makeBaseSession`'s pre-init filter drops a backend + because its required upstream is absent from + `identity.UpstreamTokens`. INFO is appropriate (not DEBUG) because + the event fires at most once per backend per MCP session, not per + list call — the capability snapshot is built once. + +Metrics (counters, OTEL): + +- `authserver_chain_completions_total{mode="full|partial"}` +- `authserver_upstream_skipped_total{provider,reason}` +- `vmcp_backend_init_skipped_total{backend,missing_provider}` + +Cardinality note: `vmcp_backend_init_skipped_total` carries both +`backend` and `missing_provider` labels. `missing_provider` is +bounded by the upstream count (handful) but `backend` is bounded +only by the size of the vMCP — a deployment aggregating hundreds of +backends will produce a proportionally large series count. Operators +running large vMCPs should drop the `backend` label at scrape time +via Prometheus metric_relabel rules; we document this in the +operator runbook rather than enforcing it server-side, so small-vMCP +operators retain the per-backend breakdown by default. + +### Mitigations Summary + +| Threat | Mitigation | +|--------|------------| +| Cross-identity token binding | Existing identity-consistency check, unchanged scope | +| Cross-session identity adoption | Fresh `SessionID` on every `/authorize`; no cross-session correlation | +| Optional primary misconfig | Admission webhook rejection + runtime guard | +| Stale-session invocation (snapshot post-revocation) | Existing per-request token re-hydration + strategy `ErrUpstreamTokenNotFound` — no new gate | +| Information disclosure | No new leak vs. status quo | + +## Alternatives Considered + +### Alternative 1: Retry-failed-only + +The issue described a retry-failed-only mode where the user re-authorizes +only the provider that previously failed, keeping existing tokens. + +- **Pros**: Less re-consent for the user; cheaper. +- **Cons**: Introduces a window where the session's stored tokens reflect + one identity for some legs and a freshly-bound identity for others + before the consistency check fires. Even with the check, the storage + schema must explicitly tolerate a transient mixed state, which expands + the surface for mistakes. Also requires a new endpoint and a way for + the client to discover *which* provider failed. +- **Why not chosen**: Restart-all only avoids the transient mixed-state + hazard entirely. The user-visible cost (one extra round-trip through + already-authorized providers) is small for the IdP counts we expect, + and the storage and endpoint complexity saved is meaningful. + +### Alternative 2: MCP-protocol-level signaling of filtered state + +Add a `_meta.unavailable_backends` field to `tools/list` responses, or +publish an MCP notification when backends become unavailable. + +- **Pros**: Clients could prompt the user to re-authorize specifically + for missing upstreams. +- **Cons**: Requires MCP spec extension, client cooperation, and a + protocol surface for something inherently transient. The current MCP + spec offers no extensibility point that's both server-pushable and + client-aware. +- **Why not chosen**: vMCP does not need to communicate filtered state + via MCP. Self-service re-auth via the client's normal `/authorize` + flow is sufficient. + +### Alternative 3: Implicit primary via "first upstream wins" + +Rather than reading +`authzConfig.inline.primaryUpstreamProvider`, treat +`upstreamProviders[0]` as primary unconditionally. + +- **Pros**: Less config. +- **Cons**: Reorders sensitive to operator mistakes (someone shuffles the + YAML list, silently switches the primary). The + `primaryUpstreamProvider` field already exists for exactly this + resolution and the operator has already chosen to make it explicit + (cmd/thv-operator/api/v1beta1/mcpserver_types.go:752-767). +- **Why not chosen**: Reuse the established mechanism. When + `primaryUpstreamProvider` is unset, *then* fall back to first-listed + (today's auto-selection — already covered by + `ConditionReasonAuthzUpstreamAutoSelected`). + +### Alternative 4: Implicitly-promote optional primaries instead of rejecting + +Instead of admission rejection when +`upstreamProviders[primary].Optional == true`, silently treat the +primary as required at runtime. + +- **Pros**: Configs always reconcile. +- **Cons**: Hides operator intent. Operator might have meant to demote + the primary; either they get an admission error and reconsider, or + they ship a config that silently behaves differently than written. +- **Why not chosen**: Loud failure at admission is the right level. The + runtime guard remains as defense-in-depth but is not the primary + signal. + +## Compatibility + +### Backward Compatibility + +- `Optional` defaults to `false`, so an unchanged manifest treats every + upstream as required and reproduces today's all-or-nothing chain + exactly. +- The `UpstreamTokenStorage` interface is unchanged. The new `Skipped` + / `SkippedReason` fields on `UpstreamTokens` deserialise to their + zero values from rows written before the upgrade, so existing + in-memory and Redis data continues to load and existing tests and + callers continue to compile. +- Existing single-upstream deployments (MCPServer, MCPRemoteProxy) are + unaffected — there is no second upstream to mark optional, so the + partial-auth path is unreachable. + +### Forward Compatibility + +- If more expressive quorum semantics are ever needed (e.g. "any N of + these M optional upstreams must succeed"), they can be layered on by + adding a new sibling field — the per-provider `optional` bool stays + meaningful and existing manifests keep working. +- The `Skipped` / `SkippedReason` field shape on `UpstreamTokens` is + internal to the auth server; storage backends can evolve the encoding + freely. +- The vMCP filter runs in the pre-init backend filter loop of + `makeBaseSession`, which is upstream of every capability-aggregation + path. Future protocol versions or new capability lists slot in + without reworking the filter, provided they continue to flow + through per-backend initialization in the same factory pass. + +## Implementation Plan + +The runtime layer is built first and exercised end-to-end via programmatic +configuration; the CRD schema is then bolted on as a translation layer in a +later phase. This lets the chain-logic and filtering work land and be +tested without churning the CRD surface while the design settles. + +### Phase 1: Auth server chain logic + +- Extend the auth server runtime config (`Handler` constructor and + `NamedUpstream` struct in `pkg/authserver/server/handlers/`) with a + per-upstream `Optional` flag (default `false`). The field is populated + programmatically for now; CRD wiring lands in Phase 3. When every + upstream's `Optional` is `false` (i.e. unchanged manifests), the + partial-auth code paths reduce to today's all-or-nothing behavior. +- Extend `storage.UpstreamTokens` with `Skipped bool` and `SkippedReason + string`; update the in-memory and Redis encodings to round-trip the + two fields. The `UpstreamTokenStorage` interface itself is unchanged. +- Filter skip rows out of + `upstreamtoken.InProcessService.GetAllValidTokens` and the singular + `GetValidToken` so `identity.UpstreamTokens` and existing strategy + consumers behave exactly as today when no rows are skipped. +- Implement `nextUnattemptedUpstream` on top of a single + `GetAllUpstreamTokens(sessionID)` call, retiring `nextMissingUpstream` + (or wrapping it; either way the original is gone after this phase). +- Extend `handleUpstreamError` to branch on per-upstream `Optional`, + writing a `Skipped: true` row via the existing `StoreUpstreamTokens` + when the upstream is optional. +- Tests: golden cases for partial-success, all-required-fail, + optional-fail, identity-consistency under partial completion; round- + trip tests for the new fields on both storage backends; assertion + that `identity.UpstreamTokens` never contains skipped providers. + Re-authorization pinning tests (no new code, but lock in the + fresh-session behavior this RFC relies on): + - Two back-to-back `/authorize` calls from the same client produce + distinct `SessionID`s in `PendingAuthorization`. + - The second call's chain walk reads zero pre-existing upstream + rows under its new `SessionID`, even when the first call left + real tokens or skip rows behind under the old `SessionID`. + - Within-session identity consistency still fires when an + attacker-controlled second leg binds a different upstream + subject than the first leg of the *same* session. + +### Phase 2: vMCP backend filtering + +- Add `RequiredUpstreamProvider()` helper on `BackendAuthStrategy`. +- In `makeBaseSession` + (pkg/vmcp/session/factory.go:427-435), + extend the existing pre-init backend filter loop. For each + `b *vmcp.Backend`, after the nil-skip and before scheduling its + `initOneBackend` goroutine, check + `b.AuthConfig.RequiredUpstreamProvider()`; if non-empty and not + present in `identity.UpstreamTokens`, drop the backend from + `filtered` and emit an INFO log naming the backend and missing + provider. Because the goroutine for that backend never runs, no + HTTP connection, handshake, or capability listing is attempted — + the aggregator sees a strictly smaller backend set, and the + routing table, advertised tool set, resources, prompts, and + per-session SDK tool registration all exclude the backend by + construction. No post-aggregation filter is needed and no + list-handler wrapper is added. +- No dispatcher-level changes are needed. Out-of-band revocation + between session creation and call is already caught by the existing + per-request `identity.UpstreamTokens` re-hydration plus each backend + auth strategy's `ErrUpstreamTokenNotFound` path; this work simply + confirms (via tests) that those paths fire as expected when an + upstream is missing. +- Tests: + - Session creation under a partially-authorized identity skips + `initOneBackend` for unauthorized backends (assert via a mock + connector that no connection attempt fires for the dependent + backend) and produces a `defaultMultiSession` whose + `routingTable` / `tools` / `allTools` / `resources` / `prompts` + omit them. + - `tools/list` against that session returns the filtered set. + - A `tools/call` against a tool the client shouldn't have access + to is rejected by the existing strategy path with + `ErrUpstreamTokenNotFound`. + - Out-of-band storage delete between session creation and call is + caught at the strategy (same code path, no new gate). + +### Phase 3: CRD schema and admission + +- Add the `Optional` field to `UpstreamProviderConfig` with kubebuilder + markers and a `false` default. +- Wire the operator reconciler to translate the new field into the + auth server runtime config plumbed in Phase 1. +- Extend admission validation for the optional-primary rejection rule. +- Regenerate manifests, deepcopy, and API docs + (`task operator-generate`, `task operator-manifests`, + `task crdref-gen`). +- Tests: webhook accepts default, rejects `optional: true` on the + primary upstream, accepts `optional: true` on a non-primary upstream; + reconciler end-to-end test confirms field propagation into the + running auth server. + +### Phase 4: Documentation and observability + +- New section in operator docs covering the partial-auth model, required + vs. optional providers, and the re-auth UX. +- Add OTEL metric instrumentation per the audit/logging plan. +- Update `docs/arch/` if any auth-architecture document is impacted. + +### Dependencies + +- None on other teams; the work is contained in + `pkg/authserver/`, `pkg/vmcp/`, and `cmd/thv-operator/api/v1beta1/`. +- Documentation regeneration depends on existing `task docs`, + `task crdref-gen` tooling. + +## Testing Strategy + +- **Unit tests** (Go, `task test`): + - Chain completion predicates with mixed required/optional upstreams + (handler package). + - `handleUpstreamError` branching: required-fail vs. optional-skip + against each RFC 6749 §4.1.2.1 error code. + - Identity-consistency check under partial completion (skipped + upstream contributes no record; required ones do). + - vMCP session creation under a synthesised identity whose + `UpstreamTokens` map is missing one provider: the pre-init filter + drops the dependent backend, so the mock connector records *zero* + `initOneBackend` calls for it; the resulting + `defaultMultiSession`'s `routingTable` / `tools` / `allTools` / + `resources` / `prompts` all omit it. + - Existing strategy fail-closed under out-of-band revocation: build + a session while a provider's token is present, then delete the + row from storage and issue a `tools/call` against a backend that + depends on that provider. The existing `UpstreamInjectStrategy` + (or `TokenExchangeStrategy` / `AwsStsStrategy`) returns + `ErrUpstreamTokenNotFound`; the existing dispatcher error path + propagates that to the client. No new dispatcher gate is + exercised because none is added. + - `RequiredUpstreamProvider()` over each strategy type and the + no-binding case. +- **Integration tests**: + - End-to-end chain walk against fake OIDC/OAuth2 upstreams with one + declining consent in two configurations: (a) every upstream has + `optional: false` — entire chain fails, matching today; (b) the + declining upstream has `optional: true` — chain completes and the + dependent backend is filtered. + - Re-`/authorize` after a partial flow: the new call mints a fresh + `SessionID`, walks the full chain again, and observes no state from + the prior session (whose rows remain at the old `SessionID` key + until their TTL). + - Refresh-token expiry of one provider triggers a backend auth + error; the client re-runs `/authorize`, exchanges at `/token` for + a new JWT (new `tsid`), reconnects to vMCP, and the resulting + fresh `initialize` returns a capability snapshot that includes + the previously-filtered backend. +- **Admission webhook tests**: `optional: true` on the primary rejected; + `optional: true` on a non-primary accepted; manifests with no + `optional` field at all accepted and behave as required. +- **Manual / E2E**: + - With slack and google configured `optional: true`, user declines + consent on slack while github + google succeed; vMCP is reachable, + slack backends omitted from `tools/list`. + - After slack outage recovers, client re-runs `/authorize`, + exchanges for a new JWT, and reconnects to vMCP; the new MCP + session's `tools/list` includes the slack backend. Confirm the + *original* MCP session (still holding the old JWT) continues to + see the filtered snapshot — recovery requires a new session. + +## Documentation + +- New operator-facing guide section: "Partial upstream authorization for + multi-IdP vMCPs" (motivation, config keys, admission rules, + troubleshooting). +- CRD reference updates (auto-regenerated): `optional` field + description on `UpstreamProviderConfig`. +- Runbook update: how to identify which upstream caused a skipped + backend (log fields, metrics). +- `docs/arch/` update if the embedded auth server has an existing + architecture doc; otherwise no new arch doc is warranted. + +## Open Questions + +None. + +## References + +- [toolhive#5162](https://github.com/stacklok/toolhive/issues/5162) — source issue +- [RFC 6749 §4.1.2.1](https://datatracker.ietf.org/doc/html/rfc6749#section-4.1.2.1) — Authorization Code Grant Error Response +- pkg/authserver/server/handlers/callback.go — chain orchestration +- pkg/authserver/server/handlers/handler.go — `nextMissingUpstream` +- pkg/authserver/storage/types.go — `UpstreamTokens`, `UpstreamTokenStorage` +- pkg/auth/token.go — `loadUpstreamTokens`, `TokenValidator.Middleware` (storage hydration of `identity.UpstreamTokens` via the `tsid` claim) +- pkg/auth/upstreamtoken/service.go — `InProcessService.GetAllValidTokens`, `GetValidToken` (filter skip rows out of the live-token map) +- pkg/vmcp/auth/types/types.go — backend strategy → upstream binding +- pkg/vmcp/auth/strategies/upstream_inject.go — upstream token consumption +- pkg/vmcp/session/factory.go — `makeBaseSession` pre-init backend filter loop (lines 427-435) is the site for the per-session filter; `buildRoutingTableWithAggregator` consumes the already-filtered results +- pkg/vmcp/types.go — `vmcp.Backend.AuthConfig *authtypes.BackendAuthStrategy` (carries `RequiredUpstreamProvider()` for the filter check) +- pkg/vmcp/server/server.go — `handleSessionRegistrationImpl`, `OnRegisterSession` (where the per-session capability snapshot is registered with the SDK) +- pkg/vmcp/discovery/middleware.go — `handleInitializeRequest` / `handleSubsequentRequest` (capabilities discovered at `initialize`, cached for session lifetime) +- cmd/thv-operator/api/v1beta1/mcpexternalauthconfig_types.go — `EmbeddedAuthServerConfig`, `UpstreamProviderConfig` +- cmd/thv-operator/api/v1beta1/mcpserver_types.go — `AuthzConfigRef.ExplicitPrimaryUpstreamProvider()` + +--- + +## RFC Lifecycle + + + +### Review History + +| Date | Reviewer | Decision | Notes | +|------|----------|----------|-------| +| 2026-05-13 | — | Draft | Initial submission | + +### Implementation Tracking + +| Repository | PR | Status | +|------------|-----|--------| +| toolhive | — | Not started |