|
1 | | -## Issue: MCPRemoteProxy fails to forward requests to upstream MCP servers returning 406 |
| 1 | +## Issue: MCPRemoteProxy fails to connect to upstream MCP servers |
2 | 2 |
|
3 | 3 | ### Summary |
4 | 4 |
|
5 | 5 | MCPRemoteProxy with embedded auth server (case 2: client auth, no upstream auth) successfully |
6 | | -authenticates users via Okta but fails to establish an MCP connection because the upstream |
7 | | -server rejects all proxied requests with HTTP 406 (Not Acceptable). |
| 6 | +authenticates users via Okta but fails to establish an MCP connection to the upstream server. |
8 | 7 |
|
9 | 8 | ### Environment |
10 | 9 |
|
11 | 10 | - ToolHive operator and proxyrunner from commit `3b834c81` (aron-muon/toolhive fork) |
12 | 11 | - Upstream MCP: `https://api.gov.nominal.io/docs/mcp` (Nominal documentation MCP, public, no auth required) |
| 12 | +- Proxy: `nominal-docs.mcp.staging.muonspace.com` (MCPRemoteProxy) |
13 | 13 | - Transport: `streamable-http` |
14 | 14 | - Auth: Embedded auth server with `disableUpstreamTokenInjection: true`, Okta OIDC upstream |
15 | 15 |
|
16 | | -### Root Cause (Hypothesized) |
| 16 | +### Root Cause |
17 | 17 |
|
18 | | -The upstream MCP server requires the `Accept: application/json, text/event-stream` header on |
19 | | -all POST requests (per the MCP streamable-http spec). Without it, the server returns: |
| 18 | +**Two issues combined** to prevent the proxy from reaching the upstream MCP server: |
20 | 19 |
|
21 | | -```json |
22 | | -{"jsonrpc":"2.0","id":"server-error","error":{"code":-32600,"message":"Not Acceptable: Client must accept both application/json and text/event-stream"}} |
23 | | -``` |
24 | | -HTTP 406. |
25 | | - |
26 | | -Direct curl to the upstream with the correct `Accept` header succeeds (HTTP 200 with valid |
27 | | -MCP response). Direct curl without it returns 406. |
28 | | - |
29 | | -### What We Tried |
30 | | - |
31 | | -1. **`headerForward.addPlaintextHeaders`** — configured `Accept: application/json, text/event-stream` |
32 | | - on the MCPRemoteProxy spec. The middleware is registered (`"header forward middleware |
33 | | - configured","headers":"Accept"`) and applied (`"applied middleware","name":"header-forward"`), |
34 | | - but the upstream still returns 406. |
35 | | - |
36 | | -2. **Middleware ordering verification** — the `header-forward` middleware is outermost (added |
37 | | - last in `PopulateMiddlewareConfigs`, applied in reverse order), so it should set the `Accept` |
38 | | - header before any other middleware processes the request. The Go `httputil.ReverseProxy` with |
39 | | - `Rewrite` creates `pr.Out` as a shallow clone of `pr.In`, so headers set by middleware should |
40 | | - propagate to the outbound request. |
41 | | - |
42 | | -### Observed Behavior |
43 | | - |
44 | | -- Proxy logs show `"applied middleware","name":"strip-auth"` and `"applied middleware","name":"header-forward"` — both are active. |
45 | | -- Authenticated client requests (Claude Code via Okta OAuth) pass through auth/authz successfully — audit events show `outcome: success` with user identity. |
46 | | -- All initialize requests complete in `duration_ms: 2` — far too fast for a real upstream round-trip, indicating the upstream rejects immediately. |
47 | | -- The proxy's internal health check returns `status_code: 401` because it hits the OIDC middleware without a token (separate known issue, non-blocking). |
48 | | -- After 155 attempts over 5 minutes, the proxy logs `"initialize not successful, but continuing"`. |
49 | | -- Claude Code reports `Failed to reconnect to nominal-docs-staging`. |
50 | | - |
51 | | -### Expected Behavior |
52 | | - |
53 | | -The proxy should forward the `Accept: application/json, text/event-stream` header (either from |
54 | | -the client's original request or from `headerForward` configuration) to the upstream, and the |
55 | | -upstream should respond with HTTP 200 containing a valid MCP initialize response. |
56 | | - |
57 | | ---- |
58 | | - |
59 | | -### Investigation Results |
60 | | - |
61 | | -Exhaustive code-level analysis of the entire request path was performed. Findings below. |
62 | | - |
63 | | -#### 1. Accept header DOES survive the reverse proxy (code is correct) |
64 | | - |
65 | | -Traced the full path from middleware to upstream: |
66 | | - |
67 | | -1. `pkg/transport/middleware/header_forward.go:130-134` — middleware calls |
68 | | - `r.Header.Set("Accept", "application/json, text/event-stream")` on the inbound request. |
69 | | -2. Go's `httputil.ReverseProxy.ServeHTTP` calls `outreq = req.Clone(ctx)` which calls |
70 | | - `r.Header.Clone()` — a **deep copy** of all headers, including Accept. |
71 | | -3. Before `Rewrite` is called, Go only strips `Forwarded`, `X-Forwarded-For`, |
72 | | - `X-Forwarded-Host`, `X-Forwarded-Proto` — Accept is **not** stripped. |
73 | | -4. The `Rewrite` function (`pkg/transport/proxy/transparent/transparent_proxy.go:505-545`) |
74 | | - only modifies URL (scheme, host, path), Host header, X-Forwarded-\*, query params, and |
75 | | - OTEL propagation headers — it **never touches Accept**. |
76 | | -5. After `Rewrite`, Go's `removeHopByHopHeaders` only strips `Connection`, `Keep-Alive`, |
77 | | - `Proxy-*`, `Te`, `Trailer`, `Transfer-Encoding`, `Upgrade` — **not Accept**. |
78 | | -6. `tracingTransport.RoundTrip` (`transparent_proxy.go:364-447`) only modifies `req.Host` — |
79 | | - **not Accept**. |
80 | | -7. The `WithContext` shallow copy in Rewrite shares the same Header map (pointer), so all |
81 | | - modifications are visible to the original `outreq` used by Go's ReverseProxy. |
82 | | - |
83 | | -**Conclusion: No code path strips or modifies the Accept header. The header-forward middleware |
84 | | -correctly propagates Accept through the reverse proxy to the upstream.** |
| 20 | +#### Issue 1: X-Forwarded-Host causes redirect loop (fixed in `258bd69c`) |
85 | 21 |
|
86 | | -#### 2. Nothing strips the Accept header |
| 22 | +The transparent proxy's `Rewrite` function called `pr.SetXForwarded()` for all upstreams, |
| 23 | +including remote third-party servers. This injected `X-Forwarded-Host: nominal-docs.mcp.staging.muonspace.com` |
| 24 | +(the proxy's own hostname) into outbound requests. The third-party upstream at |
| 25 | +`api.gov.nominal.io` used this header to construct a 307 redirect URL pointing back to the |
| 26 | +proxy, creating an infinite redirect loop: |
87 | 27 |
|
88 | | -The only header-stripping middleware is `stripAuthMiddleware` |
89 | | -(`pkg/runner/middleware.go:316-327`), which calls `r.Header.Del("Authorization")` — only |
90 | | -Authorization. Confirmed no other middleware or reverse proxy component touches Accept. |
91 | | - |
92 | | -#### 3. MCP parser middleware does NOT intercept requests |
93 | | - |
94 | | -`ParsingMiddleware` (`pkg/mcp/parser.go:69-105`) **always** calls `next.ServeHTTP(w, r)`. It |
95 | | -reads the body, parses JSON-RPC, stores the parsed data in request context, and passes |
96 | | -through. It never generates its own response or short-circuits the middleware chain. |
97 | | - |
98 | | -#### 4. Path rewriting works correctly |
99 | | - |
100 | | -The `Rewrite` function at `transparent_proxy.go:521-523` unconditionally replaces the outbound |
101 | | -path with `remoteBasePath` (`/docs/mcp`). Both client paths `/mcp` and `/docs/mcp` get |
102 | | -rewritten to `/docs/mcp`. The `WithContext` shallow copy shares the same `URL` pointer, so |
103 | | -modifications are visible to the original `outreq`. Tests in |
104 | | -`pkg/transport/proxy/transparent/remote_path_test.go` confirm this behavior. |
105 | | - |
106 | | -#### 5. The "2ms / 155 attempts" observations are from the RUNNER's self-check, NOT from client requests |
107 | | - |
108 | | -**This is the key finding that reframes the bug.** |
109 | | - |
110 | | -At `pkg/runner/runner.go:414`, the runner calls `waitForInitializeSuccess()` which sends |
111 | | -POST requests to **the proxy's own address** (`http://localhost:8080/docs/mcp`) at |
112 | | -`runner.go:810-820`. These self-check requests: |
113 | | - |
114 | | -- Have **no JWT token** (the runner doesn't authenticate with itself) |
115 | | -- Go through the **full middleware chain**, including the OIDC auth middleware |
116 | | -- Are immediately rejected by the auth middleware with **401 Unauthorized** (~2ms, no network) |
117 | | -- Are retried with exponential backoff for 5 minutes (155 attempts) |
118 | | -- Result in the `"initialize not successful, but continuing"` log at `runner.go:415` |
119 | | - |
120 | | -The runner's self-check (`runner.go:818-820`) **does** set the correct Accept header: |
121 | | -```go |
122 | | -req.Header.Set("Content-Type", "application/json") |
123 | | -req.Header.Set("Accept", "application/json, text/event-stream") |
124 | | -req.Header.Set("MCP-Protocol-Version", "2024-11-05") |
125 | 28 | ``` |
126 | | -But it never reaches the upstream because the auth middleware rejects it first. |
127 | | - |
128 | | -**The "2ms" timing and "155 attempts" are entirely explained by instant 401 rejections from |
129 | | -the OIDC auth middleware on the runner's unauthenticated self-check requests. They do not |
130 | | -reflect actual client-to-upstream behavior.** |
131 | | - |
132 | | -#### 6. Actual client requests (with JWTs) should work |
133 | | - |
134 | | -For real Claude Code requests with valid ToolHive JWTs, the flow is: |
| 29 | +Client → proxy (nominal-docs.mcp.staging.muonspace.com) |
| 30 | + → upstream (api.gov.nominal.io) |
| 31 | + → 307 redirect to nominal-docs.mcp.staging.muonspace.com ← loop! |
| 32 | +``` |
135 | 33 |
|
136 | | -1. Auth middleware validates JWT — **passes** (confirmed by audit: `outcome: success`) |
137 | | -2. `strip-auth` removes Authorization header — prevents JWT leaking to upstream |
138 | | -3. `header-forward` sets `Accept: application/json, text/event-stream` |
139 | | -4. Reverse proxy forwards to upstream with Accept header intact |
140 | | -5. Upstream should receive Accept and return 200 |
| 34 | +**Fix**: Skip `SetXForwarded()` when `isRemote` is true. Remote third-party upstreams don't |
| 35 | +need the proxy's internal hostname, and it causes redirect loops when the upstream constructs |
| 36 | +redirect URLs from `X-Forwarded-Host`. Local container backends still receive these headers. |
141 | 37 |
|
142 | | -**Based on code analysis, the header forwarding is correct for authenticated client requests.** |
| 38 | +#### Issue 2: Transparent proxy does not follow HTTP redirects (fixed in `31d868ae`) |
143 | 39 |
|
144 | | -### Revised Assessment |
| 40 | +After fixing the X-Forwarded-Host loop, the upstream still 307-redirected — this time a |
| 41 | +legitimate HTTPS→HTTP scheme redirect: |
145 | 42 |
|
146 | | -The original hypothesis ("the issue is in the HTTP header forwarding layer") appears to be |
147 | | -**incorrect**. The code correctly forwards the Accept header for authenticated requests. The |
148 | | -observed symptoms (2ms timing, 155 attempts, 401 status codes) are entirely explained by the |
149 | | -runner's unauthenticated self-check hitting the OIDC middleware. |
| 43 | +``` |
| 44 | +from: https://api.gov.nominal.io/docs/mcp/ |
| 45 | +location: http://api.gov.nominal.io/docs/mcp |
| 46 | +``` |
150 | 47 |
|
151 | | -The remaining failure (`Claude Code reports "Failed to reconnect"`) has a different root cause |
152 | | -than the Accept header. Likely explanations, in order of probability: |
| 48 | +Go's `http.Transport.RoundTrip` does not follow redirects (that is `http.Client`'s job), but |
| 49 | +`httputil.ReverseProxy` uses the Transport directly. The 307 response was passed through to |
| 50 | +the MCP client, which cannot follow upstream redirects through the proxy. |
153 | 51 |
|
154 | | -1. **Infrastructure between the proxy pod and the upstream** — a service mesh sidecar |
155 | | - (Istio/Envoy), egress proxy, network policy, or cloud NAT stripping or modifying the |
156 | | - Accept header before it reaches `api.gov.nominal.io`. This is the most likely cause in a |
157 | | - Kubernetes environment. |
| 52 | +**Fix**: Added `forwardFollowingRedirects` to `tracingTransport` that follows up to 10 |
| 53 | +redirects transparently. 307/308 preserve method and body (RFC 7538); 301/302/303 change to |
| 54 | +GET (RFC 7231). Each followed redirect is logged at WARN level. |
158 | 55 |
|
159 | | -2. **The runner's failed self-check delays client registration** — the 5-minute |
160 | | - `waitForInitializeSuccess` failure means the proxy never confirms upstream reachability |
161 | | - before registering with client managers (`runner.go:425-431`). If Claude Code connects |
162 | | - during or immediately after this window, it might experience session establishment issues |
163 | | - unrelated to the Accept header. |
| 56 | +### Investigation Timeline |
164 | 57 |
|
165 | | -3. **Observation conflation** — the 406 status was observed via direct curl tests (without |
166 | | - Accept header) during debugging, while the actual Claude Code failure may be a different |
167 | | - issue (OAuth flow completion, session management, or reconnection logic). |
| 58 | +#### Initial hypothesis: Accept header not reaching upstream (INCORRECT) |
168 | 59 |
|
169 | | -### Next Steps |
| 60 | +The original report assumed the upstream was returning 406 because the `Accept` header was |
| 61 | +missing. Code analysis proved the header-forward middleware correctly propagates Accept through |
| 62 | +the reverse proxy — no code path strips or modifies it. |
170 | 63 |
|
171 | | -1. **Add debug logging to `tracingTransport.RoundTrip`** at |
172 | | - `pkg/transport/proxy/transparent/transparent_proxy.go:364` to dump all outbound request |
173 | | - headers (`req.Header`). This will definitively confirm whether Accept reaches the wire |
174 | | - for authenticated client requests. |
| 64 | +#### Debug logging revealed the real status code |
175 | 65 |
|
176 | | -2. **Check for infrastructure header stripping** — inspect service mesh configs |
177 | | - (Istio/Envoy sidecar), egress policies, and network proxies in the cluster that sit |
178 | | - between the proxy pod and the upstream. |
| 66 | +Added outbound request and upstream response logging to `tracingTransport.RoundTrip`. This |
| 67 | +showed every upstream response was **307 Temporary Redirect**, not 406. The 406 observed |
| 68 | +during initial debugging was from direct curl tests without the Accept header — the proxy was |
| 69 | +never reaching the upstream because of the redirect issues. |
179 | 70 |
|
180 | | -3. **Capture actual network traffic** — use `tcpdump` or pod-level packet capture to see |
181 | | - the exact HTTP request sent to `api.gov.nominal.io` and verify the Accept header is |
182 | | - present on the wire. |
| 71 | +#### The "2ms / 155 attempts" were from the runner self-check |
183 | 72 |
|
184 | | -4. **Fix the runner self-check for OIDC-protected proxies** — `waitForInitializeSuccess` |
185 | | - should either skip when OIDC is configured, bypass the middleware chain by hitting the |
186 | | - `/health` endpoint, or send the request directly to the upstream (like the pinger does) |
187 | | - instead of through the proxy's own middleware. This is a separate bug that causes |
188 | | - misleading logs and delays proxy readiness. |
| 73 | +The `waitForInitializeSuccess()` function (`runner.go:414`) sends POST requests to the |
| 74 | +proxy's own address without a JWT. The OIDC auth middleware immediately rejects these with |
| 75 | +401 (~2ms, no network round-trip). This produced the misleading "initialize not successful" |
| 76 | +logs and fast timing that was initially attributed to the upstream rejecting requests. |
189 | 77 |
|
190 | | -### Reproduction |
| 78 | +### Fixes Applied |
191 | 79 |
|
192 | | -```bash |
193 | | -# Upstream works directly with correct Accept header: |
194 | | -curl -s -w "\n%{http_code}" -X POST https://api.gov.nominal.io/docs/mcp \ |
195 | | - -H "Content-Type: application/json" \ |
196 | | - -H "Accept: application/json, text/event-stream" \ |
197 | | - -d '{"jsonrpc":"2.0","method":"initialize","id":1,"params":{"protocolVersion":"2025-03-26","capabilities":{},"clientInfo":{"name":"test","version":"1.0"}}}' |
198 | | -# Returns 200 with valid MCP response |
| 80 | +| Commit | Description | |
| 81 | +|--------|-------------| |
| 82 | +| `c1c36ba1` | Add debug logging for outbound request headers | |
| 83 | +| `d3f45069` | Add upstream response status logging | |
| 84 | +| `a187848a` | Log upstream redirect responses with Location URL | |
| 85 | +| `258bd69c` | Strip X-Forwarded-* headers for remote upstream servers | |
| 86 | +| `31d868ae` | Follow upstream HTTP redirects in transparent proxy | |
199 | 87 |
|
200 | | -# Upstream rejects without Accept header: |
201 | | -curl -s -w "\n%{http_code}" -X POST https://api.gov.nominal.io/docs/mcp \ |
202 | | - -H "Content-Type: application/json" \ |
203 | | - -d '{"jsonrpc":"2.0","method":"initialize","id":1,"params":{"protocolVersion":"2025-03-26","capabilities":{},"clientInfo":{"name":"test","version":"1.0"}}}' |
204 | | -# Returns 406: "Not Acceptable: Client must accept both application/json and text/event-stream" |
205 | | -``` |
| 88 | +### Remaining Known Issues |
206 | 89 |
|
207 | | -### Impact |
| 90 | +1. **Runner self-check fails for OIDC-protected proxies** — `waitForInitializeSuccess` sends |
| 91 | + unauthenticated requests through the middleware chain, always getting 401. Should skip or |
| 92 | + bypass auth for OIDC-protected proxies. Non-blocking (logs warning and continues). |
208 | 93 |
|
209 | | -This blocks MCPRemoteProxy use case 2 (embedded auth server + public upstream MCP) from |
210 | | -working end-to-end. The auth changes (`disableUpstreamTokenInjection`, `strip-auth` middleware) |
211 | | -are functioning correctly. The header forwarding code is also correct. The remaining failure |
212 | | -is most likely environmental (infrastructure header stripping) or caused by the runner |
213 | | -self-check bug producing misleading diagnostics. |
| 94 | +2. **Debug logging should be removed** before merging to upstream — the outbound request and |
| 95 | + response logging in `tracingTransport.RoundTrip` was added for diagnosis and is noisy. |
0 commit comments