Address scalability review feedback in operator and vMCP guides

yrobla · yrobla · commit 63a7880f7c35 · 2026-04-07T10:39:28.000+02:00
- Clarify SessionStorageWarning is advisory-only and the operator still
  applies the requested replica count
- Correct condition type (SessionStorageWarning) vs reason
  (SessionStorageMissingForReplicas) distinction
- Add warning that ClientIP session affinity fails silently behind NAT
  or shared egress IPs, with guidance to use None for stateless backends
- Fix MCPServer horizontal scaling section: backend is a StatefulSet,
  not a Deployment; add architecture overview and common scaling configs
- Note that SessionStorageWarning only fires for spec.replicas &gt; 1, not
  backendReplicas
- Add connection draining note: 30s grace/drain period, no preStop hook,
  override via podTemplateSpec
- Add Redis address example comment to prompt users to update the value
- Clarify maxParallel fan-out is per-pod, not distributed across replicas
- Add tip on sizing workflow timeouts relative to maxIterations/maxParallel
diff --git a/docs/toolhive/guides-k8s/run-mcp-k8s.mdx b/docs/toolhive/guides-k8s/run-mcp-k8s.mdx
@@ -441,19 +441,35 @@ kubectl -n <NAMESPACE> describe mcpserver <NAME>
 
 ## Horizontal scaling
 
-MCPServer creates two separate Deployments: one for the proxy runner and one for
-the MCP server backend. You can scale each independently:
+MCPServer creates two separate workloads: a proxy runner Deployment and a
+backend MCP server StatefulSet. You can scale each independently:
 
 - `spec.replicas` controls the proxy runner pod count
 - `spec.backendReplicas` controls the backend MCP server pod count
 
+The proxy runner handles authentication, MCP protocol framing, and session
+management; it is stateless with respect to tool execution. The backend runs the
+actual MCP server and executes tools.
+
+Common configurations:
+
+- **Scale only the proxy** (`replicas: N`, omit `backendReplicas`): useful when
+  auth and connection overhead is the bottleneck with a single backend.
+- **Scale only the backend** (omit `replicas`, `backendReplicas: M`): useful
+  when tool execution is CPU/memory-bound and the proxy is not a bottleneck. The
+  backend StatefulSet uses client-IP session affinity to route repeated
+  connections to the same pod — subject to the same NAT limitations as
+  proxy-level affinity.
+- **Scale both** (`replicas: N`, `backendReplicas: M`): full horizontal scale.
+  Redis session storage is required when `replicas > 1`.
+
 ```yaml title="MCPServer resource"
 spec:
   replicas: 2
   backendReplicas: 3
   sessionStorage:
     provider: redis
-    address: redis-master.toolhive-system.svc.cluster.local:6379
+    address: redis-master.toolhive-system.svc.cluster.local:6379 # Update to match your Redis Service location
     db: 0
     keyPrefix: mcp-sessions
     passwordRef:
@@ -466,6 +482,35 @@ When running multiple replicas, configure
 across pods. If you omit `replicas` or `backendReplicas`, the operator defers
 replica management to an HPA or other external controller.
 
+:::note The `SessionStorageWarning` condition fires only when
+`spec.replicas > 1`. Scaling only the backend (`backendReplicas > 1`) does not
+trigger a warning, but backend client-IP affinity is still unreliable behind NAT
+or shared egress IPs. :::
+
+:::note[Connection draining on scale-down]
+
+When a proxy runner pod is terminated (scale-in, rolling update, or node
+eviction), Kubernetes sends SIGTERM and the proxy drains in-flight requests for
+up to 30 seconds before force-closing connections. The grace period and drain
+timeout are both 30 seconds with no headroom, so long-lived SSE or streaming
+connections may be dropped if they exceed the drain window.
+
+No preStop hook is injected by the operator. If your workload requires
+additional time — for example, to let kube-proxy propagate endpoint removal
+before the pod stops accepting traffic — override
+`terminationGracePeriodSeconds` via `podTemplateSpec`:
+
+```yaml
+spec:
+  podTemplateSpec:
+    spec:
+      terminationGracePeriodSeconds: 60
+```
+
+The same 30-second default applies to the backend StatefulSet.
+
+:::
+
 :::warning[Stdio transport limitation]
 
 Backends using the `stdio` transport are limited to a single replica. The
diff --git a/docs/toolhive/guides-vmcp/composite-tools.mdx b/docs/toolhive/guides-vmcp/composite-tools.mdx
@@ -396,11 +396,38 @@ spec:
 | `step`          | Inner step definition (tool call to execute per item) | —       |
 | `onError`       | Error handling: `abort` (stop) or `continue` (skip)   | abort   |
 
+:::note
+
+`forEach` does not support `onError.action: retry`. Use `retry` on regular tool
+steps. The `maxParallel` cap of 50 is enforced at runtime regardless of the
+configured value.
+
+:::
+
 Access the current item inside the inner step using
 `{{.forEach.<itemVar>.<field>}}`. In the example above, `{{.forEach.repo.name}}`
 accesses the `name` field of the current repository. You can also use
 `{{.forEach.index}}` to access the zero-based iteration index.
 
+`maxParallel` controls how many iterations run concurrently **on the pod that
+received the composite tool request**. Iterations are not distributed across
+vMCP replicas — all parallel backend calls originate from a single pod
+regardless of `spec.replicas`. When sizing your deployment, account for the
+per-pod fan-out: a `maxParallel: 50` forEach step can open up to 50 simultaneous
+connections to backend MCP servers from one pod. Ensure both the vMCP pod
+resources and the backend MCP servers can handle that per-pod concurrency.
+
+:::tip[Plan your workflow timeouts]
+
+With `maxIterations: 1000` and `maxParallel: 10` (the defaults), a forEach loop
+runs up to 100 serial batches. If each backend call takes a few seconds, the
+total duration can easily exceed a workflow-level timeout. Set the workflow
+`timeout` to at least
+`ceil(maxIterations / maxParallel) × expected step duration` to avoid silent
+truncation.
+
+:::
+
 ### Error handling
 
 Configure behavior when steps fail:
diff --git a/docs/toolhive/guides-vmcp/scaling-and-performance.mdx b/docs/toolhive/guides-vmcp/scaling-and-performance.mdx
@@ -89,8 +89,10 @@ for a complete Redis deployment guide.
 :::warning
 
 If you configure multiple replicas without session storage, the operator sets a
-`SessionStorageMissingForReplicas` status condition on the resource. Ensure
-Redis is available before scaling beyond a single replica.
+`SessionStorageWarning` status condition on the resource but **still applies the
+replica count**. Pods will start, but requests routed to a replica that did not
+establish the session will fail. Ensure Redis is available before scaling beyond
+a single replica.
 
 :::
 
@@ -116,6 +118,22 @@ spec:
   sessionAffinity: ClientIP # default
 ```
 
+:::warning[ClientIP affinity is unreliable behind NAT or shared egress IPs]
+
+`ClientIP` affinity relies on the source IP reaching kube-proxy. When clients
+sit behind a NAT gateway, corporate proxy, or cloud load balancer (common in
+EKS, GKE, and AKS), all traffic appears to originate from the same IP — routing
+every client to the same pod and eliminating the benefit of horizontal scaling.
+This fails silently: the deployment appears healthy but only one pod handles all
+load.
+
+For stateless backends, set `sessionAffinity: None` so the Service load-balances
+freely. For stateful backends where true per-session routing is required,
+`ClientIP` affinity is a best-effort mechanism only. Prefer vertical scaling or
+a dedicated vMCP instance per team instead.
+
+:::
+
 For stateful backends, vertical scaling or dedicated instances per team/use case
 are recommended instead of horizontal scaling.