RFC: Horizontal Scaling for vMCP and Proxy Runner#47
Conversation
4987889 to
b9e00fd
Compare
Introduces THV-XXXX covering background, problems, scope, high-level solution, and requirements for enabling safe horizontal scale-out of the vmcp and thv-proxyrunner components via externalized Redis session storage and session-aware routing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
b9e00fd to
dab8883
Compare
- Fix Mermaid \n → <br/> in both diagrams
- Update metadata layer description to include session IDs
- Strengthen re-initialization language ("destructive" not "may not be safe")
- Add current proxyrunner state context to §2.2
- Fix stdio scaling description: about concurrency, not exclusivity
- Add fungibility constraint note to §1.4 and §5.3 R-OP-1
- Fix §3.1: single MCPServer backed by multiple proxyrunner replicas
- Add vMCP scale-in to §3.1 in-scope
- Update §3.2: proxyrunner scale-in only; proxyrunner:StatefulSet N:1 ratio
- Add §3.3 Scaling Summary table
- Update §4.1 diagram to show one:many proxyrunner→backend pods
- Update vMCP session record to backends[] array with per-backend URLs/session IDs
- Simplify proxyrunner session record to session→backend-pod mapping
- Update §4.3 routing to reflect multi-backend session model
- Add §4.6 proxyrunner value proposition note
- Remove redundant R-PR-7
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- §1.1 diagram: use subgraphs to show logical MCPServer boundary (one MCPServer = one proxyrunner Deployment + its StatefulSet) - §1.4: replace vague "This constraint" with specific statement that a stdio backend couples itself to a specific proxyrunner process - §2.2: correct current-state description — controller already supports multiple proxyrunner replicas for sse/streamable-http transports; the problem is lack of session-aware routing, not lack of replica support - §3.2: correct proxyrunner:StatefulSet ratio — each replica manages its own StatefulSet (1:1), not a shared StatefulSet (N:1) - §3.3: update Scaling Summary table to reflect 1:1 replica:StatefulSet - §4.1: update architecture diagram to show per-replica StatefulSets - §4.2: proxyrunner session record now includes identity subject for session hijacking prevention (per session-scoped work THV-0038) - §5.5: add Security Requirements (R-SEC-1, R-SEC-2) for session hijacking prevention Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All replicas of a proxyrunner Deployment share a single StatefulSet — they converge on the same desired state via Kubernetes server-side apply (field manager: toolhive-container-manager), with no leader election. The previous edit assumed a 1:1 replica:StatefulSet ratio, which is incorrect. Updated sections: - §1.1: add explanation of shared StatefulSet and server-side apply mechanics; note stdio replica cap vs sse/streamable-http - §2.2: correct current-state description — replicas share one StatefulSet; the problem is missing session-to-pod routing - §3.2: correct ratio back to N:1 (N replicas, 1 StatefulSet) - §3.3: update Scaling Summary table accordingly - §4.1: revert architecture diagram to single shared StatefulSet subgraph with multiple pods Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Neither MCPServer nor VirtualMCPServer CRDs have a replicas field; both Deployments and the StatefulSet are hardcoded to 1. Add this as a core deliverable: spec.replicas (proxyrunner/vMCP pod count) and spec.backendReplicas (StatefulSet pod count) for declarative scaling. Explicitly document the one-StatefulSet-per-MCPServer invariant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- §1.1 diagram: remove replica count labels from nodes - §3.1: add proxyrunner scale-in (non-stdio) to in scope - §3.2: note 1:1 StatefulSet ratio as future stdio scaling path - §3.2: clarify inter-proxyrunner routing is best-effort - §3.2: replace proxyrunner scale-in out-of-scope bullet with graceful drain and backend StatefulSet scale-in bullets - §3.3: update table to reflect proxyrunner scale-in is in scope - §4.1: simplify diagram (no individual pod nodes) - §5.1: remove R-VMCP-6 (vMCP pod DNS exposure) - §5.4: fix R-DEP-4 to focus on backend scale-in as disruptive Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ChrisJBurns
left a comment
There was a problem hiding this comment.
Looks ok, couple of comments, maybe some comments that are blockers. The HMAC stuff and subject-only comments I can relay to others expertise if they are blocking or non-issues. I'm more interested in how the ProxyRunner scales at all without vMCP - or if we even want to make a more controversial decision to mandate vMCP for use at scale?
Also, should we defined what observability looks like for this? Redis ops, routing decisions/cross-pod proxy success, distributed trace prop across pod boundaries etc? Or is that a later thing? (this is fine too)
| address: "redis:6379" | ||
| db: 0 | ||
| keyPrefix: "vmcp:session:" | ||
| passwordRef: |
There was a problem hiding this comment.
Are we sending credentials in plaintext? Or are we using mTLS?
There was a problem hiding this comment.
I'm not sure. I think whatever we use today should be sufficient, but if you feel strongly then we'll need to modify the behavior wherever else redis is used.
Catalogs 16 concrete code changes needed to implement horizontal scaling for vMCP and proxyrunner, organized by component: CRD/operator changes (RC-1 through RC-5), transport session layer (RC-6, RC-7), vMCP session management (RC-8 through RC-10, RC-16), proxyrunner routing (RC-11 through RC-13), operational concerns (RC-14), and security (RC-15). Each change is mapped to requirements from §5 and documents the current state of the code. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… In Review Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
THV-0047: Manual Horizontal Scaling for vMCP and Proxy Runner
This RFC defines an approach to enable safe horizontal scale-out of
vmcpandthv-proxyrunnerby externalizing session state to a shared Redis store and implementing session-aware routing at each layer.Key Sections
RestoreSession, LRU eviction, backend expiry syncRelated