Problem
When an MCPServer has backendReplicas > 1, the proxy runner stores the ClusterIP service URL (e.g. http://mcp-<name>:8080) as backend_url in session metadata. After a proxy runner restart, it recovers the session from Redis but routes via ClusterIP — kube-proxy may send the request to a backend pod that never handled the initialize for that session.
The backend pod returns HTTP 404 with JSON-RPC -32001 ("session not found") because it has no record of that session. The mcp-go client surfaces this as:
transport error: failed to send request: session terminated (404). need to re-initialize
The session is not actually terminated — it simply doesn't exist on the pod that received the request.
Parent issue: #4484
Failing acceptance test
PR #4574 adds an E2E test that demonstrates this bug. It fails on all 3 Kubernetes versions in CI:
Error from CI (v1.35.1):
[FAILED] Request 1/5 should succeed — session should route to the correct backend
Unexpected error:
transport error: failed to send request: session terminated (404). need to re-initialize
In [It] at: mcpserver_scaling_test.go:382
Test flow
- Deploy MCPServer with
replicas=1, backendReplicas=2, Redis session storage, sessionAffinity=None
- Initialize MCP session, call
tools/list — succeeds
- Delete the proxy runner pod (Deployment recreates it)
- Send 5
tools/list requests with the same session ID — fails on request 1
With 2 backends and random routing, P(all 5 hit correct pod) ≈ 3%.
Root cause
-
StatefulSet serviceName mismatch: buildStatefulSetSpec sets serviceName to containerName, but the headless service is named mcp-<containerName>-headless. These must match for Kubernetes to create pod DNS records.
-
ClusterIP stored as backend_url: RoundTrip stores targetURI (ClusterIP) as backend_url. After restart, Rewrite routes via ClusterIP with sessionAffinity=None, hitting a random backend.
Proposed fix
Store a pod-specific headless DNS URL (e.g. myserver-0.mcp-myserver-headless.default.svc.cluster.local:8080) as backend_url instead of the ClusterIP:
- Fix StatefulSet
serviceName to match headless service name
- Add
HeadlessServiceConfig to ScalingConfig
- Pre-select a random StatefulSet pod on initialize, store its headless DNS as
backend_url
- Operator populates config when
backendReplicas > 1
- Backward compatible: when config is nil (single backend), behavior is unchanged
Problem
When an MCPServer has
backendReplicas > 1, the proxy runner stores the ClusterIP service URL (e.g.http://mcp-<name>:8080) asbackend_urlin session metadata. After a proxy runner restart, it recovers the session from Redis but routes via ClusterIP — kube-proxy may send the request to a backend pod that never handled theinitializefor that session.The backend pod returns HTTP 404 with JSON-RPC
-32001("session not found") because it has no record of that session. The mcp-go client surfaces this as:The session is not actually terminated — it simply doesn't exist on the pod that received the request.
Parent issue: #4484
Failing acceptance test
PR #4574 adds an E2E test that demonstrates this bug. It fails on all 3 Kubernetes versions in CI:
Error from CI (v1.35.1):
Test flow
replicas=1,backendReplicas=2, Redis session storage,sessionAffinity=Nonetools/list— succeedstools/listrequests with the same session ID — fails on request 1With 2 backends and random routing, P(all 5 hit correct pod) ≈ 3%.
Root cause
StatefulSet
serviceNamemismatch:buildStatefulSetSpecsetsserviceNametocontainerName, but the headless service is namedmcp-<containerName>-headless. These must match for Kubernetes to create pod DNS records.ClusterIP stored as
backend_url:RoundTripstorestargetURI(ClusterIP) asbackend_url. After restart,Rewriteroutes via ClusterIP withsessionAffinity=None, hitting a random backend.Proposed fix
Store a pod-specific headless DNS URL (e.g.
myserver-0.mcp-myserver-headless.default.svc.cluster.local:8080) asbackend_urlinstead of the ClusterIP:serviceNameto match headless service nameHeadlessServiceConfigtoScalingConfigbackend_urlbackendReplicas > 1