Skip to content

Commit 63a7880

Browse files
committed
Address scalability review feedback in operator and vMCP guides
- Clarify SessionStorageWarning is advisory-only and the operator still applies the requested replica count - Correct condition type (SessionStorageWarning) vs reason (SessionStorageMissingForReplicas) distinction - Add warning that ClientIP session affinity fails silently behind NAT or shared egress IPs, with guidance to use None for stateless backends - Fix MCPServer horizontal scaling section: backend is a StatefulSet, not a Deployment; add architecture overview and common scaling configs - Note that SessionStorageWarning only fires for spec.replicas > 1, not backendReplicas - Add connection draining note: 30s grace/drain period, no preStop hook, override via podTemplateSpec - Add Redis address example comment to prompt users to update the value - Clarify maxParallel fan-out is per-pod, not distributed across replicas - Add tip on sizing workflow timeouts relative to maxIterations/maxParallel
1 parent 7fef985 commit 63a7880

3 files changed

Lines changed: 95 additions & 5 deletions

File tree

docs/toolhive/guides-k8s/run-mcp-k8s.mdx

Lines changed: 48 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -441,19 +441,35 @@ kubectl -n <NAMESPACE> describe mcpserver <NAME>
441441

442442
## Horizontal scaling
443443

444-
MCPServer creates two separate Deployments: one for the proxy runner and one for
445-
the MCP server backend. You can scale each independently:
444+
MCPServer creates two separate workloads: a proxy runner Deployment and a
445+
backend MCP server StatefulSet. You can scale each independently:
446446

447447
- `spec.replicas` controls the proxy runner pod count
448448
- `spec.backendReplicas` controls the backend MCP server pod count
449449

450+
The proxy runner handles authentication, MCP protocol framing, and session
451+
management; it is stateless with respect to tool execution. The backend runs the
452+
actual MCP server and executes tools.
453+
454+
Common configurations:
455+
456+
- **Scale only the proxy** (`replicas: N`, omit `backendReplicas`): useful when
457+
auth and connection overhead is the bottleneck with a single backend.
458+
- **Scale only the backend** (omit `replicas`, `backendReplicas: M`): useful
459+
when tool execution is CPU/memory-bound and the proxy is not a bottleneck. The
460+
backend StatefulSet uses client-IP session affinity to route repeated
461+
connections to the same pod — subject to the same NAT limitations as
462+
proxy-level affinity.
463+
- **Scale both** (`replicas: N`, `backendReplicas: M`): full horizontal scale.
464+
Redis session storage is required when `replicas > 1`.
465+
450466
```yaml title="MCPServer resource"
451467
spec:
452468
replicas: 2
453469
backendReplicas: 3
454470
sessionStorage:
455471
provider: redis
456-
address: redis-master.toolhive-system.svc.cluster.local:6379
472+
address: redis-master.toolhive-system.svc.cluster.local:6379 # Update to match your Redis Service location
457473
db: 0
458474
keyPrefix: mcp-sessions
459475
passwordRef:
@@ -466,6 +482,35 @@ When running multiple replicas, configure
466482
across pods. If you omit `replicas` or `backendReplicas`, the operator defers
467483
replica management to an HPA or other external controller.
468484

485+
:::note The `SessionStorageWarning` condition fires only when
486+
`spec.replicas > 1`. Scaling only the backend (`backendReplicas > 1`) does not
487+
trigger a warning, but backend client-IP affinity is still unreliable behind NAT
488+
or shared egress IPs. :::
489+
490+
:::note[Connection draining on scale-down]
491+
492+
When a proxy runner pod is terminated (scale-in, rolling update, or node
493+
eviction), Kubernetes sends SIGTERM and the proxy drains in-flight requests for
494+
up to 30 seconds before force-closing connections. The grace period and drain
495+
timeout are both 30 seconds with no headroom, so long-lived SSE or streaming
496+
connections may be dropped if they exceed the drain window.
497+
498+
No preStop hook is injected by the operator. If your workload requires
499+
additional time — for example, to let kube-proxy propagate endpoint removal
500+
before the pod stops accepting traffic — override
501+
`terminationGracePeriodSeconds` via `podTemplateSpec`:
502+
503+
```yaml
504+
spec:
505+
podTemplateSpec:
506+
spec:
507+
terminationGracePeriodSeconds: 60
508+
```
509+
510+
The same 30-second default applies to the backend StatefulSet.
511+
512+
:::
513+
469514
:::warning[Stdio transport limitation]
470515

471516
Backends using the `stdio` transport are limited to a single replica. The

docs/toolhive/guides-vmcp/composite-tools.mdx

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -396,11 +396,38 @@ spec:
396396
| `step` | Inner step definition (tool call to execute per item) | — |
397397
| `onError` | Error handling: `abort` (stop) or `continue` (skip) | abort |
398398

399+
:::note
400+
401+
`forEach` does not support `onError.action: retry`. Use `retry` on regular tool
402+
steps. The `maxParallel` cap of 50 is enforced at runtime regardless of the
403+
configured value.
404+
405+
:::
406+
399407
Access the current item inside the inner step using
400408
`{{.forEach.<itemVar>.<field>}}`. In the example above, `{{.forEach.repo.name}}`
401409
accesses the `name` field of the current repository. You can also use
402410
`{{.forEach.index}}` to access the zero-based iteration index.
403411

412+
`maxParallel` controls how many iterations run concurrently **on the pod that
413+
received the composite tool request**. Iterations are not distributed across
414+
vMCP replicas — all parallel backend calls originate from a single pod
415+
regardless of `spec.replicas`. When sizing your deployment, account for the
416+
per-pod fan-out: a `maxParallel: 50` forEach step can open up to 50 simultaneous
417+
connections to backend MCP servers from one pod. Ensure both the vMCP pod
418+
resources and the backend MCP servers can handle that per-pod concurrency.
419+
420+
:::tip[Plan your workflow timeouts]
421+
422+
With `maxIterations: 1000` and `maxParallel: 10` (the defaults), a forEach loop
423+
runs up to 100 serial batches. If each backend call takes a few seconds, the
424+
total duration can easily exceed a workflow-level timeout. Set the workflow
425+
`timeout` to at least
426+
`ceil(maxIterations / maxParallel) × expected step duration` to avoid silent
427+
truncation.
428+
429+
:::
430+
404431
### Error handling
405432

406433
Configure behavior when steps fail:

docs/toolhive/guides-vmcp/scaling-and-performance.mdx

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -89,8 +89,10 @@ for a complete Redis deployment guide.
8989
:::warning
9090

9191
If you configure multiple replicas without session storage, the operator sets a
92-
`SessionStorageMissingForReplicas` status condition on the resource. Ensure
93-
Redis is available before scaling beyond a single replica.
92+
`SessionStorageWarning` status condition on the resource but **still applies the
93+
replica count**. Pods will start, but requests routed to a replica that did not
94+
establish the session will fail. Ensure Redis is available before scaling beyond
95+
a single replica.
9496

9597
:::
9698

@@ -116,6 +118,22 @@ spec:
116118
sessionAffinity: ClientIP # default
117119
```
118120

121+
:::warning[ClientIP affinity is unreliable behind NAT or shared egress IPs]
122+
123+
`ClientIP` affinity relies on the source IP reaching kube-proxy. When clients
124+
sit behind a NAT gateway, corporate proxy, or cloud load balancer (common in
125+
EKS, GKE, and AKS), all traffic appears to originate from the same IP — routing
126+
every client to the same pod and eliminating the benefit of horizontal scaling.
127+
This fails silently: the deployment appears healthy but only one pod handles all
128+
load.
129+
130+
For stateless backends, set `sessionAffinity: None` so the Service load-balances
131+
freely. For stateful backends where true per-session routing is required,
132+
`ClientIP` affinity is a best-effort mechanism only. Prefer vertical scaling or
133+
a dedicated vMCP instance per team instead.
134+
135+
:::
136+
119137
For stateful backends, vertical scaling or dedicated instances per team/use case
120138
are recommended instead of horizontal scaling.
121139

0 commit comments

Comments
 (0)