|
| 1 | +# Configuration reference |
| 2 | + |
| 3 | +Companion to the [README](../README.md). The README lists the five |
| 4 | +required environment variables and the production-posture |
| 5 | +guarantees; this doc is the full reference for every knob, grouped |
| 6 | +by what they affect. |
| 7 | + |
| 8 | +All configuration is via environment variables. Defaults match the |
| 9 | +secure production posture (`PROD_MODE=true`); flags listed here as |
| 10 | +"loosens X" are rejected by the startup validator unless |
| 11 | +`PROD_MODE=false` is set explicitly. |
| 12 | + |
| 13 | +## Required (no defaults) |
| 14 | + |
| 15 | +| Variable | Description | |
| 16 | +|---|---| |
| 17 | +| `OIDC_ISSUER_URL` | OIDC issuer auto-discovered via `/.well-known/openid-configuration`. | |
| 18 | +| `OIDC_CLIENT_ID` | Client registered on the IdP. | |
| 19 | +| `OIDC_CLIENT_SECRET` | IdP client secret. | |
| 20 | +| `PROXY_BASE_URL` | Public URL of this proxy. Audience-bound into every sealed token — two deployments accidentally sharing `TOKEN_SIGNING_SECRET` but differing on `PROXY_BASE_URL` cannot replay each other's tokens. | |
| 21 | +| `UPSTREAM_MCP_URL` | Upstream MCP URL with explicit path, e.g. `http://mcp:8000/api/v1/mcp`. The path is the proxy's mount AND forwarded verbatim to the upstream. Origin-only (`http://backend`), lone-`/`, query, fragment, userinfo, and paths colliding with control-plane routes (`/healthz`, `/register`, `/authorize`, `/callback`, `/token`, `/.well-known`) are rejected at startup. | |
| 22 | +| `TOKEN_SIGNING_SECRET` | ≥ 32 bytes AES-GCM key. Byte-identical across replicas. `PROD_MODE=true` additionally rejects secrets with < 16 distinct bytes (patterned values like `aaaa…` or `0123…`). Generate with `openssl rand -hex 32` or `manifests/scripts/generate-signing-secret.sh`. | |
| 23 | + |
| 24 | +## Listeners and logging |
| 25 | + |
| 26 | +| Variable | Default | Description | |
| 27 | +|---|---|---| |
| 28 | +| `LISTEN_ADDR` | `:8080` | Public bind address. | |
| 29 | +| `METRICS_ADDR` | `127.0.0.1:9090` | Prometheus + readiness bind address (separate listener). Loopback-only by default so `/metrics` and `/readyz` are never exposed on the public interface. Override (`:9090` / explicit interface) when a scraper must reach the pod. | |
| 30 | +| `LOG_LEVEL` | `info` | `debug` / `info` / `warn` / `error`. | |
| 31 | +| `SHUTDOWN_TIMEOUT` | `120s` | Graceful shutdown deadline. Must be ≥ longest expected SSE stream so rolling deploys don't chop streams mid-flight. Capped at 15 minutes — a longer value keeps a stuck pod lingering past the K8s `terminationGracePeriodSeconds` sweet spot, masking upstream bugs behind an apparently healthy rollout. | |
| 32 | + |
| 33 | +## Identity and authorization |
| 34 | + |
| 35 | +| Variable | Default | Description | |
| 36 | +|---|---|---| |
| 37 | +| `GROUPS_CLAIM` | `groups` | Flat claim name in the IdP id_token holding user groups. | |
| 38 | +| `ALLOWED_GROUPS` | (empty) | Comma-separated allowlist; empty = allow all authenticated users. | |
| 39 | +| `MCP_RESOURCE_NAME` | (empty) | Human-readable name advertised under `resource_name` in the RFC 9728 PRM (e.g. `"ACME MCP"`). Used by MCP clients for display / consent UI. Optional; field is omitted when unset. | |
| 40 | +| `UPSTREAM_AUTHORIZATION_HEADER` | (empty) | When set, sent verbatim as the `Authorization` header on every request to the upstream MCP backend. Full header value incl. scheme, e.g. `Bearer xyz`. Treat as a secret. | |
| 41 | + |
| 42 | +## Token signing and rotation |
| 43 | + |
| 44 | +| Variable | Default | Description | |
| 45 | +|---|---|---| |
| 46 | +| `TOKEN_SIGNING_SECRETS_PREVIOUS` | (empty) | Whitespace-separated retired signing secrets accepted on Open during a rolling rotation. New seals always use the primary `TOKEN_SIGNING_SECRET`; Open tries primary first, then each previous. See [`runbooks/key-rotation.md`](./runbooks/key-rotation.md). | |
| 47 | +| `REVOKE_BEFORE` | (empty) | RFC3339 timestamp. Bulk revocation cutoff: tokens with `iat` before this are rejected. Applies to access AND refresh tokens. | |
| 48 | +| `CLIENT_REGISTRATION_TTL` | `168h` (7d) | Lifetime of a sealed `client_id` minted by `POST /register`. Default matches the 7-day refresh-token TTL so a client holding a still-valid refresh can always exchange it; a shorter value silently kills long-running MCP clients (which treat DCR as one-shot at startup) the moment their access token first expires. Go duration syntax (`168h`, `720h`, …); capped at 90d. **Rolling-deploy note:** the TTL is sealed into each `client_id` at registration time, so bumping this env var only affects newly-issued client_ids — existing registrations stay on whatever TTL was in effect when they were minted. See [`runbooks/client-registration-expired.md`](./runbooks/client-registration-expired.md). | |
| 49 | + |
| 50 | +## Replay store (Redis) |
| 51 | + |
| 52 | +| Variable | Default | Description | |
| 53 | +|---|---|---| |
| 54 | +| `REDIS_URL` | (empty) | Enables single-use authz codes + refresh-rotation reuse detection + single-use consent / callback-state tokens. `rediss://` for TLS. See [`redis-production.md`](./redis-production.md). | |
| 55 | +| `REDIS_REQUIRED` | `true` | Fail startup when `REDIS_URL` is unset. Set `false` only for dev / single-replica; stateless mode leaves codes / refresh tokens replayable within their TTL. Rejected by `PROD_MODE`. | |
| 56 | +| `REDIS_KEY_PREFIX` | `mcp-auth-proxy:` | Key prefix for shared Redis. Set to empty to opt out of namespacing. | |
| 57 | +| `REFRESH_RACE_GRACE_SEC` | `2` | Grace window in seconds during which a refresh-rotation collision is treated as a benign concurrent submit (parallel-tab refresh, slow-network double-submit) and returns 429 `refresh_concurrent_submit` without revoking the family. Outside the window every collision still revokes. Range `[0, 10]`; `0` disables. The 10s ceiling is a security cap — wider windows are statistically attacker-shaped. | |
| 58 | +| `IDP_EXCHANGE_RATE_PER_SEC` | (disabled) | Cap on outbound proxy → IdP token-endpoint requests at `/callback`. Defense in depth: a flood of `/callback` hits that slips past the per-IP limiter (distributed sources, permissive XFF trust matrix) is bounded by this token bucket before reaching the IdP. Denied requests get 503 `temporarily_unavailable` + `error_code=idp_exchange_throttled` + `Retry-After: 1`. Set to a positive number (e.g. `20`) to enable. **Per-replica scope:** an `N`-replica deployment admits up to `N × IDP_EXCHANGE_RATE_PER_SEC` to the IdP — divide your IdP-side ceiling by replica count. | |
| 59 | +| `IDP_EXCHANGE_BURST` | `50` | Burst size for the IdP-exchange limiter when `IDP_EXCHANGE_RATE_PER_SEC > 0`. Higher burst absorbs a short spike (e.g. a deploy-time reconnect storm) without 503s; lower burst keeps the ceiling tighter. Ignored when `IDP_EXCHANGE_RATE_PER_SEC` is unset/zero. | |
| 60 | + |
| 61 | +## Rate limiting and proxy headers |
| 62 | + |
| 63 | +| Variable | Default | Description | |
| 64 | +|---|---|---| |
| 65 | +| `RATE_LIMIT_ENABLED` | `true` | Per-IP rate limiting on pre-auth endpoints and on the authenticated MCP route. Disable only behind a WAF that already enforces it. | |
| 66 | +| `TRUSTED_PROXY_CIDRS` | (empty) | Comma-separated CIDRs of peers whose forwarding header (default `X-Forwarded-For`) is walked right-to-left for rate-limit keying. The first hop NOT in the trusted set is the bucket key; everything left of it (typically appended by the client) is ignored. Other peers fall back to RemoteAddr. **Preferred over the legacy `TRUST_PROXY_HEADERS` bool.** | |
| 67 | +| `TRUSTED_PROXY_HEADER` | `X-Forwarded-For` | Pin which forwarding header carries the hop list. Allowlist: `X-Forwarded-For`, `X-Real-IP`, `True-Client-IP`. Pin `X-Real-IP` / `True-Client-IP` only when the trusted ingress is known to OVERWRITE (not append) that header — otherwise a client behind a passthrough ingress can spoof an unbounded rate-limit bucket per request. | |
| 68 | +| `TRUST_PROXY_HEADERS` | `false` | **Legacy.** Blanket trust of every peer's forwarded headers. Superseded by `TRUSTED_PROXY_CIDRS` when both are set; rejected entirely under `PROD_MODE=true` without `TRUSTED_PROXY_CIDRS` because the bucket key becomes attacker-spoofable. | |
| 69 | + |
| 70 | +**Per-replica scope:** the rate limiter is in-process. An `N`-replica |
| 71 | +deployment admits up to `N × <per-endpoint rate>` per IP; size your |
| 72 | +upstream WAF or external limiter accordingly. The optional outbound |
| 73 | +`IDP_EXCHANGE_RATE_PER_SEC` bucket has the same per-replica scope — |
| 74 | +divide your IdP-side ceiling by replica count when sizing it. |
| 75 | + |
| 76 | +## Resource management |
| 77 | + |
| 78 | +| Variable | Default | Description | |
| 79 | +|---|---|---| |
| 80 | +| `MCP_PER_SUBJECT_CONCURRENCY` | `16` | Per-subject in-flight cap on the authenticated MCP route. Excess requests get 503 `temporarily_unavailable` + `Retry-After: 1`. Idle subjects (no in-flight work for ≥5 min) are reclaimed by a background pruner. `0` disables. | |
| 81 | + |
| 82 | +## Production posture toggles |
| 83 | + |
| 84 | +`PROD_MODE=true` rejects unsafe combinations at startup. The |
| 85 | +relaxation toggles below are offered for dev / legacy-client paths; |
| 86 | +flipping any of them in production silently weakens a security |
| 87 | +control. |
| 88 | + |
| 89 | +| Variable | Default | Description | |
| 90 | +|---|---|---| |
| 91 | +| `PROD_MODE` | `true` | Fails startup if any compatibility flag that weakens a security control is set (`PKCE_REQUIRED=false`, `COMPAT_ALLOW_STATELESS=true`, `REDIS_REQUIRED=false`, `REDIS_URL` empty, `OIDC_ALLOW_INSECURE_HTTP=true`, or legacy `TRUST_PROXY_HEADERS=true` without `TRUSTED_PROXY_CIDRS`). Set `false` explicitly only for dev / single-replica work that needs one of the relaxation toggles. | |
| 92 | +| `PKCE_REQUIRED` | `true` | Set `false` for legacy clients that omit PKCE (Cursor, MCP Inspector, ChatGPT). Rejected by `PROD_MODE`. | |
| 93 | +| `COMPAT_ALLOW_STATELESS` | `false` | Synthesize a server-side `state` on `/authorize` when the client omits it. Strict mode refuses the request; counter `mcp_auth_access_denied_total{reason="state_missing"}` fires either way. Rejected by `PROD_MODE`. | |
| 94 | +| `RENDER_CONSENT_PAGE` | `true` | Render an explicit proxy-side consent page on `/authorize` so the user sees who's asking and where they'll be redirected before the IdP login. Closes the silent-token-issuance path where a malicious DCR client + an active IdP session = tokens issued without any user interaction. Plain HTML, no JavaScript. Set `false` to fall back to the legacy silent-redirect — only when every caller is non-interactive and known-trusted. | |
| 95 | +| `OIDC_ALLOW_INSECURE_HTTP` | `false` | Dev-only escape hatch for cleartext `http://` OIDC issuers (Docker Compose Keycloak demo). Rejected when `PROD_MODE=true`. | |
| 96 | + |
| 97 | +## Logging and observability |
| 98 | + |
| 99 | +| Variable | Default | Description | |
| 100 | +|---|---|---| |
| 101 | +| `MCP_LOG_BODY_MAX` | `65536` | Max bytes buffered per request for JSON-RPC method extraction into access logs. `0` disables buffering (no `rpc_method` / `rpc_tool` / `rpc_id` fields). Raise for large batches; lower or zero when tool names must stay out of logs. | |
| 102 | +| `ACCESS_LOG_SKIP_RE` | (empty) | **Go [RE2](https://pkg.go.dev/regexp/syntax) regexp** matched against `r.URL.Path` on the **public listener only**. Matching paths suppress the access-log line; handler response, Prometheus counters, and panic recovery are unaffected. Invalid pattern fails startup. RE2 is linear-time — no ReDoS. Typical: `^/healthz$` (liveness probe noise). **Always anchor with `^…$`** unless intentionally substring-matching: `healthz` (no anchors) also matches `/mcp/healthz-tool`. | |
| 103 | +| `MCP_TOOL_METRICS` | `false` | Emit per-tool Prometheus counters (`mcp_auth_rpc_calls_total{tool}`, etc.) on JSON-RPC `tools/call` requests. Disabled by default — the `tool` label increases series cardinality and reveals workflow patterns. | |
| 104 | +| `MCP_TOOL_METRICS_MAX_CARDINALITY` | `256` | Cap on distinct `tool` label values. Names past the cap collapse into `_overflow`; unparsed names land in `_unknown`. `0` disables the cap (only safe when the upstream enforces a tool allowlist). Only meaningful when `MCP_TOOL_METRICS=true`. | |
| 105 | + |
| 106 | +--- |
| 107 | + |
| 108 | +## Observability |
| 109 | + |
| 110 | +Every series the proxy emits, with the alerting playbook for the |
| 111 | +ones operators most often want to wire up. |
| 112 | + |
| 113 | +### Token funnel |
| 114 | + |
| 115 | +- `mcp_auth_authorize_initiated_total{path}` — validated `/authorize` |
| 116 | + requests entering the consent (`path="consent"`) or silent-redirect |
| 117 | + (`path="silent"`) fork. Closes the GET-side of the funnel. |
| 118 | +- `mcp_auth_consent_decisions_total{decision}` — `approved` / |
| 119 | + `denied` clicks on the proxy-rendered consent page. Distinct from |
| 120 | + `access_denied_total`: a user clicking Deny is a normal interaction. |
| 121 | +- `mcp_auth_tokens_issued_total{grant_type}` — access tokens minted, |
| 122 | + by `authorization_code` / `refresh_token`. |
| 123 | +- `mcp_auth_clients_registered_total` — RFC 7591 registrations. |
| 124 | + |
| 125 | +PromQL recipes: |
| 126 | + |
| 127 | +```promql |
| 128 | +# Fraction of /authorize traffic taking the consent fork: |
| 129 | +mcp_auth_authorize_initiated_total{path="consent"} |
| 130 | + / sum(mcp_auth_authorize_initiated_total) |
| 131 | +
|
| 132 | +# Consent abandonment (started but didn't click approve OR deny): |
| 133 | +1 - sum(mcp_auth_consent_decisions_total) |
| 134 | + / mcp_auth_authorize_initiated_total{path="consent"} |
| 135 | +
|
| 136 | +# Approve-but-no-token (consent flow that died at the IdP or callback): |
| 137 | +sum(mcp_auth_consent_decisions_total{decision="approved"}) |
| 138 | + - mcp_auth_tokens_issued_total{grant_type="authorization_code"} |
| 139 | +``` |
| 140 | + |
| 141 | +### Denials |
| 142 | + |
| 143 | +- `mcp_auth_access_denied_total{reason}` — buckets: |
| 144 | + - `group` / `group_invalid` — user not in `ALLOWED_GROUPS`, or |
| 145 | + group name contained header-smuggling chars. |
| 146 | + - `email_unverified` — `email_verified=false` from the IdP. |
| 147 | + - `subject_missing` / `subject_concurrency_exceeded`. |
| 148 | + - `invalid_token` — forged / malformed / signature / AAD failures |
| 149 | + (**attack signal**). |
| 150 | + - `token_expired` — benign aging (separate bucket from |
| 151 | + `invalid_token` so the latter is unambiguously the attack channel). |
| 152 | + - `audience_mismatch` / `resource_mismatch`. |
| 153 | + - `token_revoked_iat_cutoff` — `REVOKE_BEFORE` rejection. |
| 154 | + - `id_token_verification_failed` — IdP signature / nonce / claim |
| 155 | + parse. |
| 156 | + - `replay_store_unavailable` — Redis down (fail-closed). |
| 157 | + - `state_missing` — `/authorize` without state in strict mode. |
| 158 | + - `refresh_family_revoked` / `refresh_concurrent_submit` — |
| 159 | + refresh-rotation outcomes. |
| 160 | +- `mcp_auth_replay_detected_total{kind}` — `code` / `refresh` / |
| 161 | + `consent` / `callback_state` replays caught by the Redis-backed |
| 162 | + store. |
| 163 | +- `mcp_auth_groups_claim_shape_mismatch_total` — id_token `groups` |
| 164 | + claim failed to decode as `[]string`. **No denial occurs** — user |
| 165 | + is admitted with empty groups; the counter surfaces an IdP schema |
| 166 | + regression before it cascades into a `group` denial spike. |
| 167 | + |
| 168 | +### Throttling |
| 169 | + |
| 170 | +- `mcp_auth_rate_limited_total{endpoint}` — httprate 429s by |
| 171 | + endpoint (`register` / `authorize` / `consent` / `callback` / |
| 172 | + `token` / `mcp` / `discovery`). |
| 173 | +- `mcp_auth_idp_exchange_throttled_total` — outbound proxy → IdP |
| 174 | + token-endpoint exchanges denied by the rate-limit bucket |
| 175 | + (`IDP_EXCHANGE_RATE_PER_SEC`). A spike under steady inbound |
| 176 | + traffic usually means a distributed flood is slipping past the |
| 177 | + per-IP limiter, or the IdP is slow enough that the bucket fills |
| 178 | + faster than it drains. |
| 179 | + |
| 180 | +### Crypto bookkeeping |
| 181 | + |
| 182 | +- `mcp_auth_token_seals_total{purpose}` — successful AES-GCM seal |
| 183 | + operations, by purpose (`client` / `session` / `code` / `access` / |
| 184 | + `refresh`). Aggregate across replicas to track cumulative seals |
| 185 | + per signing key. |
| 186 | + |
| 187 | + Alert before nonce-collision matters: |
| 188 | + |
| 189 | + ```promql |
| 190 | + sum(increase(mcp_auth_token_seals_total[7d])) > 2**28 |
| 191 | + ``` |
| 192 | + |
| 193 | + At `2^28` seals/key, rotate `TOKEN_SIGNING_SECRET` via |
| 194 | + `TOKEN_SIGNING_SECRETS_PREVIOUS` (see |
| 195 | + [`runbooks/key-rotation.md`](./runbooks/key-rotation.md)). The |
| 196 | + AES-GCM 96-bit nonce is random, so the practical wall is the |
| 197 | + birthday bound at `2^32`. |
| 198 | + |
| 199 | +### MCP RPC traffic (opt-in) |
| 200 | + |
| 201 | +When `MCP_TOOL_METRICS=true`, additional counters fire only on |
| 202 | +JSON-RPC `tools/call` requests — protocol-level methods |
| 203 | +(`initialize`, `notifications/*`, `tools/list`, `prompts/*`) do not |
| 204 | +contribute, so an alert on `_unknown` flags malformed `tools/call` |
| 205 | +payloads, not background chatter. |
| 206 | + |
| 207 | +- `mcp_auth_rpc_calls_total{tool}` |
| 208 | +- `mcp_auth_rpc_calls_failed_total{tool}` — status ≥ 400. |
| 209 | +- `mcp_auth_rpc_request_bytes_total{tool}` / |
| 210 | + `mcp_auth_rpc_response_bytes_total{tool}` — single-call only; |
| 211 | + per-call attribution is honest only outside batches. |
| 212 | +- `mcp_auth_rpc_batches_total`, |
| 213 | + `mcp_auth_rpc_batches_failed_total`, |
| 214 | + `mcp_auth_rpc_batch_bytes_total{direction}` — batch-shape |
| 215 | + counters, disjoint from the per-tool family above so the `tool` |
| 216 | + label stays clean. |
| 217 | + |
| 218 | +JSON-RPC batches fan out into one `rpc_calls_total` increment per |
| 219 | +`tools/call` entry, each carrying its own tool label. Cap distinct |
| 220 | +labels via `MCP_TOOL_METRICS_MAX_CARDINALITY` (default 256). |
| 221 | + |
| 222 | +### Health probes |
| 223 | + |
| 224 | +- `GET /healthz` (public listener) — liveness; 200 while the process |
| 225 | + is up. |
| 226 | +- `GET /readyz` (metrics port only) — readiness. Reflects Redis |
| 227 | + reachability when `REDIS_URL` is set, cached ~1s to resist |
| 228 | + probe-flood amplification. |
0 commit comments