Backend routing breaks after proxy runner restart with backendReplicas > 1

## Problem

When an MCPServer has `backendReplicas > 1`, the proxy runner stores the **ClusterIP service URL** (e.g. `http://mcp-<name>:8080`) as `backend_url` in session metadata. After a proxy runner restart, it recovers the session from Redis but routes via ClusterIP — kube-proxy may send the request to a backend pod that never handled the `initialize` for that session.

The backend pod returns HTTP 404 with JSON-RPC `-32001` ("session not found") because it has no record of that session. The mcp-go client surfaces this as:

```
transport error: failed to send request: session terminated (404). need to re-initialize
```

The session is not actually terminated — it simply doesn't exist on the pod that received the request.

Parent issue: #4484

## Failing acceptance test

PR #4574 adds an E2E test that demonstrates this bug. It fails on all 3 Kubernetes versions in CI:

- [v1.33.7](https://github.com/stacklok/toolhive/actions/runs/24040941288/job/70112255691), [v1.34.3](https://github.com/stacklok/toolhive/actions/runs/24040941288/job/70112255684), [v1.35.1](https://github.com/stacklok/toolhive/actions/runs/24040941288/job/70112255710)

Error from CI (v1.35.1):
```
[FAILED] Request 1/5 should succeed — session should route to the correct backend
Unexpected error:
    transport error: failed to send request: session terminated (404). need to re-initialize
In [It] at: mcpserver_scaling_test.go:382
```

### Test flow

1. Deploy MCPServer with `replicas=1`, `backendReplicas=2`, Redis session storage, `sessionAffinity=None`
2. Initialize MCP session, call `tools/list` — succeeds
3. Delete the proxy runner pod (Deployment recreates it)
4. Send 5 `tools/list` requests with the same session ID — fails on request 1

With 2 backends and random routing, P(all 5 hit correct pod) ≈ 3%.

## Root cause

1. **StatefulSet `serviceName` mismatch**: `buildStatefulSetSpec` sets `serviceName` to `containerName`, but the headless service is named `mcp-<containerName>-headless`. These must match for Kubernetes to create pod DNS records.

2. **ClusterIP stored as `backend_url`**: `RoundTrip` stores `targetURI` (ClusterIP) as `backend_url`. After restart, `Rewrite` routes via ClusterIP with `sessionAffinity=None`, hitting a random backend.

## Proposed fix

Store a **pod-specific headless DNS URL** (e.g. `myserver-0.mcp-myserver-headless.default.svc.cluster.local:8080`) as `backend_url` instead of the ClusterIP:

- Fix StatefulSet `serviceName` to match headless service name
- Add `HeadlessServiceConfig` to `ScalingConfig`
- Pre-select a random StatefulSet pod on initialize, store its headless DNS as `backend_url`
- Operator populates config when `backendReplicas > 1`
- Backward compatible: when config is nil (single backend), behavior is unchanged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend routing breaks after proxy runner restart with backendReplicas > 1 #4575

Problem

Failing acceptance test

Test flow

Root cause

Proposed fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Backend routing breaks after proxy runner restart with backendReplicas > 1 #4575

Description

Problem

Failing acceptance test

Test flow

Root cause

Proposed fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions