Commit 9ddce6a
RFC: Horizontal Scaling for vMCP and Proxy Runner (#47)
* Add draft RFC for vMCP and proxyrunner horizontal scaling
Introduces THV-XXXX covering background, problems, scope, high-level
solution, and requirements for enabling safe horizontal scale-out of
the vmcp and thv-proxyrunner components via externalized Redis session
storage and session-aware routing.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Address review feedback on vMCP horizontal scaling RFC
- Fix Mermaid \n → <br/> in both diagrams
- Update metadata layer description to include session IDs
- Strengthen re-initialization language ("destructive" not "may not be safe")
- Add current proxyrunner state context to §2.2
- Fix stdio scaling description: about concurrency, not exclusivity
- Add fungibility constraint note to §1.4 and §5.3 R-OP-1
- Fix §3.1: single MCPServer backed by multiple proxyrunner replicas
- Add vMCP scale-in to §3.1 in-scope
- Update §3.2: proxyrunner scale-in only; proxyrunner:StatefulSet N:1 ratio
- Add §3.3 Scaling Summary table
- Update §4.1 diagram to show one:many proxyrunner→backend pods
- Update vMCP session record to backends[] array with per-backend URLs/session IDs
- Simplify proxyrunner session record to session→backend-pod mapping
- Update §4.3 routing to reflect multi-backend session model
- Add §4.6 proxyrunner value proposition note
- Remove redundant R-PR-7
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Address second round of review feedback
- §1.1 diagram: use subgraphs to show logical MCPServer boundary
(one MCPServer = one proxyrunner Deployment + its StatefulSet)
- §1.4: replace vague "This constraint" with specific statement that
a stdio backend couples itself to a specific proxyrunner process
- §2.2: correct current-state description — controller already supports
multiple proxyrunner replicas for sse/streamable-http transports;
the problem is lack of session-aware routing, not lack of replica support
- §3.2: correct proxyrunner:StatefulSet ratio — each replica manages
its own StatefulSet (1:1), not a shared StatefulSet (N:1)
- §3.3: update Scaling Summary table to reflect 1:1 replica:StatefulSet
- §4.1: update architecture diagram to show per-replica StatefulSets
- §4.2: proxyrunner session record now includes identity subject for
session hijacking prevention (per session-scoped work THV-0038)
- §5.5: add Security Requirements (R-SEC-1, R-SEC-2) for session
hijacking prevention
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Correct proxyrunner StatefulSet model based on code review
All replicas of a proxyrunner Deployment share a single StatefulSet —
they converge on the same desired state via Kubernetes server-side
apply (field manager: toolhive-container-manager), with no leader
election. The previous edit assumed a 1:1 replica:StatefulSet ratio,
which is incorrect.
Updated sections:
- §1.1: add explanation of shared StatefulSet and server-side apply
mechanics; note stdio replica cap vs sse/streamable-http
- §2.2: correct current-state description — replicas share one
StatefulSet; the problem is missing session-to-pod routing
- §3.2: correct ratio back to N:1 (N replicas, 1 StatefulSet)
- §3.3: update Scaling Summary table accordingly
- §4.1: revert architecture diagram to single shared StatefulSet
subgraph with multiple pods
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Scope CRD replica fields and correct current-state description
Neither MCPServer nor VirtualMCPServer CRDs have a replicas field;
both Deployments and the StatefulSet are hardcoded to 1. Add this as
a core deliverable: spec.replicas (proxyrunner/vMCP pod count) and
spec.backendReplicas (StatefulSet pod count) for declarative scaling.
Explicitly document the one-StatefulSet-per-MCPServer invariant.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Address second round of review comments
- §1.1 diagram: remove replica count labels from nodes
- §3.1: add proxyrunner scale-in (non-stdio) to in scope
- §3.2: note 1:1 StatefulSet ratio as future stdio scaling path
- §3.2: clarify inter-proxyrunner routing is best-effort
- §3.2: replace proxyrunner scale-in out-of-scope bullet with
graceful drain and backend StatefulSet scale-in bullets
- §3.3: update table to reflect proxyrunner scale-in is in scope
- §4.1: simplify diagram (no individual pod nodes)
- §5.1: remove R-VMCP-6 (vMCP pod DNS exposure)
- §5.4: fix R-DEP-4 to focus on backend scale-in as disruptive
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* tweaks
* address comments
* self-review
* Add Required Changes section (§6) to RFC-51
Catalogs 16 concrete code changes needed to implement horizontal
scaling for vMCP and proxyrunner, organized by component: CRD/operator
changes (RC-1 through RC-5), transport session layer (RC-6, RC-7),
vMCP session management (RC-8 through RC-10, RC-16), proxyrunner
routing (RC-11 through RC-13), operational concerns (RC-14), and
security (RC-15). Each change is mapped to requirements from §5 and
documents the current state of the code.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Rename RFC to match PR number (THV-0051 → THV-0047) and set status to In Review
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>1 parent 7c46970 commit 9ddce6a
1 file changed
Lines changed: 649 additions & 0 deletions
0 commit comments