docs(security): STRIDE threat-model coverage matrix (#23)

babs · web-flow · commit c8d495d0f476 · 2026-04-28T00:20:46.000+02:00
New docs/threat-model.md enumerates 16 threats across STRIDE with
one row each linking the mitigation to code, test, and (where
applicable) runbook. Closes T1.4 from misc/next-steps.md and turns
"we thought about this" into a reviewable artifact.

Includes an honest out-of-scope section (IdP compromise, browser
XSS, MITM with TLS verify off, side-channel timing, post-quantum,
operator-bypass-of-PROD_MODE, per-mount RBAC) so future work can
pick up gaps explicitly rather than assuming coverage.

Closing section names how to keep the doc in sync during review:
when an auth/authz path changes, the matrix row's Mitigation column
must keep matching the code, or the row needs a doc update in the
same PR.

Cross-linked from README "Standards conformance" section, from
specs.md companion-docs line at the top, and from main.go's
/register rate-limit constant comment (row 1 is the load-bearing
DCR-abuse mitigation).
diff --git a/README.md b/README.md
@@ -76,7 +76,9 @@ Drop this in front, point it at your existing IdP, done.
 
 Full design notes live in [`specs.md`](./specs.md). The implementation
 claim matrix, compatibility notes, and current IdP evidence live in
-[`docs/conformance.md`](./docs/conformance.md).
+[`docs/conformance.md`](./docs/conformance.md). The STRIDE-style
+threat coverage (one row per threat → code + test + runbook) lives
+in [`docs/threat-model.md`](./docs/threat-model.md).
 
 ---
 
diff --git a/docs/threat-model.md b/docs/threat-model.md
@@ -0,0 +1,88 @@
+# Threat model
+
+Auditable mapping from identified threats to the code, tests, and
+runbooks that mitigate them. Companion to [`specs.md`](../specs.md)
+(what the proxy implements) and [`conformance.md`](conformance.md)
+(which RFCs it claims). Goal: when a reviewer asks *"what's the
+worst case here?"*, the answer is on a row in this document with a
+file path, a test name, and an operational runbook.
+
+Categories use STRIDE: **S**poofing, **T**ampering, **R**epudiation,
+**I**nformation disclosure, **D**enial of service, **E**levation of
+privilege.
+
+## Coverage matrix
+
+| # | STRIDE | Threat | Mitigation | Code | Test | Runbook |
+|---|---|---|---|---|---|---|
+| 1 | T / I | **DCR abuse** — mass client registration to exhaust quota or phish via attacker-controlled `client_name` | Per-IP rate limit on `/register` (10/min default); control-byte strip on `client_name` at registration; `mcp_auth_clients_registered_total` for trend monitoring | `handlers/register.go`, `main.go` (`registerLimit`) | `handlers/handlers_test.go` (`TestRegister_*`) | — |
+| 2 | S | **Active-IdP-session phishing** — malicious DCR client + an active IdP session = silent token issuance without user interaction | Proxy-rendered consent page (default `RENDER_CONSENT_PAGE=true`); per-render `jti` makes the consent token single-use (T1.2); `client_name` rendered via `html/template` contextual escaping; CSP locks the page (no JS, no remote subresources) | `handlers/authorize.go` (consent fork), `handlers/consent.go` | `handlers/consent_test.go`, `handlers/single_use_replay_test.go` | `docs/runbooks/consent-denials.md` |
+| 3 | T | **Redirect-URI / open-redirect abuse** — crafted `redirect_uri` matches loosely or smuggles attacker host | Exact-match against registered URIs with RFC 8252 loopback-port relaxation; fragment-bearing URIs rejected at DCR; fragment scrubbed again on the success redirect (defense in depth) | `handlers/helpers.go` (`redirectURIMatches`), `handlers/register.go`, `handlers/callback.go` (fragment scrub) | `handlers/handlers_test.go` (redirect tests) | — |
+| 4 | S | **Authorization-code replay** — captured authorization code submitted twice within its TTL | `replay.Store.ClaimOnce` on `TokenID` at `/token`, run after parameter validation but before token issuance so a malformed legitimate retry doesn't burn the code; on detected replay the family of refresh tokens seeded by that code is revoked (RFC 6749 §4.1.2 MUST) | `handlers/token.go` (`handleAuthorizationCode`), `replay/redis.go` | `handlers/handlers_test.go` (replay tests) | `docs/runbooks/redis-outage.md` |
+| 5 | S | **Refresh-token replay / family compromise** — captured refresh submitted after legit rotation | Atomic Lua `ClaimOrCheckFamily` on Redis: claim + family revoke happen as one EVAL so the invariant "alreadyClaimed ⇒ family revoked" cannot be violated by a client disconnect or Redis blip | `handlers/token.go` (`handleRefreshToken`), `replay/redis.go` | `handlers/handlers_test.go` (refresh tests), `replay/redis_test.go` | `docs/runbooks/revoke.md` |
+| 6 | S | **Consent-token replay** — captured `consent_token` POSTed twice within its 5-min TTL | `JTI` `ClaimOnce` at POST `/consent` before either Approve or Deny; per-render `JTI` so back-button = re-consent (a new claim slot) | `handlers/consent.go` | `handlers/single_use_replay_test.go` (`TestConsent_SingleUse_*`) | — |
+| 7 | S | **Callback-state replay** — captured `/callback` URL replayed (e.g. attacker-observable redirect) | `SessionID` `ClaimOnce` BEFORE the upstream OIDC token-endpoint exchange — replay never fans out to the IdP and never produces audit-log noise | `handlers/callback.go` | `handlers/single_use_replay_test.go` (`TestCallback_SingleUse_*`) | — |
+| 8 | D | **Redis outage** — replay store unreachable | Fail-closed at every claim site: 503 `server_error` + `error_code=replay_store_unavailable`. `mcp_auth_access_denied_total{reason="replay_store_unavailable"}` is the single alerting source. `PROD_MODE=true` validates `REDIS_REQUIRED=true` + `REDIS_URL` set at startup | `replay/redis.go`, `config/config.go` (`Validate`), every claim site | `config/config_test.go` (production posture) | `docs/runbooks/redis-outage.md` |
+| 9 | D | **IdP outage** — upstream OIDC unavailable | 10s context timeout on `oauth2Cfg.Exchange`; surfaces 502 `server_error`; readiness probe lives on the metrics port only so an unauthenticated public-listener flood cannot flip every replica out of the K8s Service | `handlers/callback.go` (`oidcExchangeTTL`), `main.go` (port split) | `handlers/handlers_test.go` (callback timeout tests) | `docs/runbooks/idp-outage.md` |
+| 10 | S | **XFF / proxy-header spoofing** — `X-Forwarded-For` set by an attacker to bypass per-IP rate limiting | `TRUSTED_PROXY_CIDRS` allowlist scopes which upstream hops can set forwarding headers; `TRUSTED_PROXY_HEADER` names the trusted header; `PROD_MODE=true` rejects `TRUST_PROXY_HEADERS=true` without a `TRUSTED_PROXY_CIDRS` allowlist | `config/config.go` (`TrustedProxyCIDRs`), `main.go` (`ipKeyFunc`) | `config/config_test.go` (XFF cases) | — |
+| 11 | R | **Multi-replica drift** — two proxy replicas issue inconsistent decisions on the same code, refresh, or callback state | All claim sites use a shared Redis store; `FamilyIssuedAt` + `REVOKE_BEFORE` propagate bulk-cutoff revocations across replicas without per-replica state | `replay/redis.go`, `token/` (`FamilyIssuedAt`) | `replay/redis_test.go` | `docs/runbooks/key-rotation.md` |
+| 12 | D | **Sealed-input parse exhaustion** — oversized sealed blob slows `token.Open()` under load | Per-input length cap on every sealed-token open call (PR #18); 1 MB body cap via `MaxBytesReader` on `/token`, `/consent`, `/register` | `token/seal.go`, every handler entry point | `token/seal_test.go` (`FuzzOpenJSON`, `FuzzValidate`) | — |
+| 13 | I | **`id_token` claim-shape drift** — IdP schema migration silently bypasses `ALLOWED_GROUPS` via empty-groups admit | Distinct `mcp_auth_groups_claim_shape_mismatch_total` counter and warn-level log on every shape mismatch — surfaces drift before it cascades into a `group` denial spike. Empty groups still trigger the standard `group` denial when `ALLOWED_GROUPS` is non-empty | `handlers/callback.go` | `handlers/handlers_test.go` (group-shape tests) | — |
+| 14 | I | **IdP `error_description` phishing** — crafted `/callback?error=…&error_description=<phishing text>` reflects on the proxy origin or in a legit MCP-client error UI | Fixed proxy-owned description on BOTH the validated-session redirect and the no-session JSON paths; RFC 6749 §4.1.2.1 `error` allowlist (anything else collapses to `server_error`) | `handlers/callback.go` (idpError branch) | `handlers/handlers_test.go` (`TestCallback_OIDCError_*`) | — |
+| 15 | E | **PKCE downgrade** — client without `code_challenge` at `/authorize` then supplies `code_verifier` at `/token`, papering over the missing PKCE binding | When `PKCE_REQUIRED=false` AND the client omitted `code_challenge`, the proxy mints its own server-side PKCE pair (H6) so the issued code is anchored to a verifier; `/token` rejects `code_verifier` against a code with no challenge (RFC 9700 §4.8.2) | `handlers/authorize.go`, `handlers/consent.go`, `handlers/token.go` | `handlers/handlers_test.go` (PKCE tests) | — |
+| 16 | T | **Signing-key compromise / bulk revoke** — proxy signing material exfiltrated, in-flight tokens still valid | `REVOKE_BEFORE` env var: any access or refresh token sealed before the cutoff is rejected by `middleware/auth.go` and `handlers/token.go` regardless of whether the replay store has seen it. Operational pattern: rotate the signing key, set `REVOKE_BEFORE` to the cutoff timestamp, redeploy | `token/`, `middleware/auth.go`, `handlers/token.go` | `middleware/auth_test.go` | `docs/runbooks/key-rotation.md` |
+
+## Out of scope / accepted residual risk
+
+These are documented gaps — the proxy does not claim to defend
+against them. Listed so future work can pick them up explicitly
+rather than assuming they're already covered.
+
+- **IdP compromise.** If Keycloak / Entra / Auth0 is owned, the
+  proxy still trusts what it says. The signed `id_token`
+  verification, `nonce` echo, and audience checks all assume the
+  IdP itself is honest. Out of scope for this proxy; in scope for
+  the IdP operator's own threat model.
+- **Browser-side XSS in the consent page.** The page is JS-free and
+  CSP-locked (`default-src 'none'`, `style-src 'unsafe-inline'`,
+  `script-src` defaults to none, `frame-ancestors 'none'`). A
+  browser-engine bug that escapes contextual HTML escaping is not
+  separately mitigated.
+- **Network-level MITM between proxy and IdP.** TLS verification is
+  on by default in the `oauth2` library; an operator who disables
+  it (or a CA compromise) lets a MITM observe the upstream code
+  exchange. Standard PKI assumption.
+- **Side-channel timing on AES-GCM.** Go's `crypto/aes` is constant
+  time on architectures with AES-NI (the production target). Not
+  separately tested.
+- **Quantum / cryptanalytic attacks on AES-GCM-256 or SHA-256.**
+  The proxy's sealed-token format is conventional symmetric crypto
+  (AES-GCM for authenticated encryption; SHA-256 for key derivation
+  and PKCE-S256). Not in scope.
+- **Operator misconfiguration that bypasses `PROD_MODE`.** The
+  validation rejects unsafe combinations *when `PROD_MODE=true`*.
+  An operator who runs in production with `PROD_MODE=false` and a
+  permissive config gets the legacy posture by their own choice.
+  The CI manifest gate (T1.3) blocks this for the shipped overlay
+  but cannot block a hand-rolled deployment.
+- **Per-resource-mount RBAC beyond `ALLOWED_GROUPS`.** The proxy
+  enforces a single group allowlist for the whole mount. Per-tool
+  or per-method RBAC inside the MCP server itself is the
+  responsibility of that server.
+
+## How to use this document during review
+
+When a code change touches an authentication or authorization path:
+
+1. Identify which row(s) in the matrix the change sits under.
+2. Check that the **Mitigation** column still describes what the
+   code does — if not, the row needs an update in the same PR.
+3. If the change introduces a new threat that doesn't map to an
+   existing row, add a row (with **Code** + **Test** + **Runbook**
+   columns filled in) before merge.
+4. If a row's **Code** path moves, update the link.
+5. If a row gains a test, update the **Test** column.
+
+The matrix is load-bearing: it's the artifact a security reviewer
+or auditor reads first. Drift in this document means drift in the
+project's claimed posture.
diff --git a/main.go b/main.go
@@ -298,6 +298,9 @@ func main() {
 	mcpLimit := passthrough
 	discoveryLimit := passthrough
 	if cfg.RateLimitEnabled {
+		// /register's 10/min cap is the load-bearing mitigation in
+		// docs/threat-model.md row 1 (DCR abuse). Keep this in sync
+		// with that document if the value changes.
 		registerLimit = rateLimiter(10, time.Minute, "register", ipKeyFunc)
 		authorizeLimit = rateLimiter(30, time.Minute, "authorize", ipKeyFunc)
 		// /consent has its own bucket so the user-driven approve/deny
diff --git a/specs.md b/specs.md
@@ -4,6 +4,8 @@
 
 Reverse proxy in Go acting as an OAuth 2.1 Authorization Server compatible with the MCP auth spec, federating authentication to any OIDC Identity Provider (Keycloak, Microsoft Entra ID, Auth0, Okta, Google...) via auto-discovery. It lets MCP clients (claude.ai, Claude Code) access a private MCP server through a standard PKCE flow.
 
+Companion docs: [`docs/conformance.md`](docs/conformance.md) (RFC claim matrix + IdP evidence), [`docs/threat-model.md`](docs/threat-model.md) (STRIDE coverage with code + test + runbook links).
+
 ---
 
 ## Standards Conformance