You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: rfcs/THV-0038-session-scoped-client-lifecycle.md
+73-39Lines changed: 73 additions & 39 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -616,7 +616,7 @@ The default factory implementation follows this pattern:
616
616
-**Client initialization includes MCP handshake**: Each client sends `InitializeRequest` to its backend, and the backend responds with capabilities and its own `Mcp-Session-Id`. The client stores the session ID for protocol compliance (includes it in subsequent request headers).
617
617
-**Capture backend session IDs**: Factory also captures each backend's session ID (via `client.SessionID()`) for observability, storing them in a map to pass to the session
618
618
-**Performance requirement**: Use parallel initialization (e.g., `errgroup` with bounded concurrency) to avoid sequential latency accumulation. Connection initialization (TCP handshake + TLS negotiation + MCP protocol handshake) can take tens to hundreds of milliseconds per backend depending on network latency and backend responsiveness. With 20 backends, sequential initialization could easily exceed acceptable session creation latency.
619
-
-**Bounded concurrency**: Limit parallel goroutines (e.g., 10 concurrent initializations) to avoid resource exhaustion. This limit is **per-session-creation** (not global), implemented as a semaphore inside the factory. It should be a configurable vMCP server-level parameter (e.g., `max_backend_init_concurrency`, default: 10). Operators with many backends on a fast private network can raise it; resource-constrained deployments or backends with expensive initialization should lower it. A global limit across concurrent session creations is not necessary — the per-session semaphore already bounds the worst case per event.
619
+
-**Bounded concurrency**: Limit parallel goroutines (e.g., 10 concurrent initializations) per session to avoid resource exhaustion. This limit is **per-session-creation**, implemented as a semaphore inside the factory via a configurable vMCP server-level parameter (`max_backend_init_concurrency`, default: 10). Operators should note that the aggregate system load is protected by the global `TOOLHIVE_MAX_SESSIONS` limit (see [Resource Exhaustion & DoS Protection](#concurrency--resource-safety)). For additional safety during traffic spikes, the factory may optionally implement a global initialization semaphore (e.g., `max_global_backend_init_concurrency`, default: 100) to cap the total number of simultaneous connection attempts across all active session creations (preventing, for example, 100 concurrent session requests from triggering 1,000 backend initializations).
620
620
-**Per-backend timeout**: Apply context timeout (e.g., 5s per backend) so one slow backend doesn't block session creation
621
621
-**Partial initialization**: If some backends fail, log warnings and continue with successfully initialized backends (failed backends not added to clients map)
622
622
- Clients are connection-ready and stateful (each maintains its backend session for protocol use)
@@ -706,6 +706,7 @@ The default session implementation stores:
706
706
- Used for: logging, metrics, health checks, debugging, and explicit session cleanup
707
707
- Updated when clients are re-initialized (e.g., after backend session expiration)
708
708
- RWMutex for thread-safe access (read lock for queries/calls, write lock for Close)
709
+
-`singleflight.Group` (or per-backend locks) to coordinate concurrent re-initialization of backend sessions without stalling the whole session
709
710
710
711
**Backend session ID lifecycle management:**
711
712
@@ -965,6 +966,9 @@ The session **continues operating** with remaining backends. If a backend client
965
966
func (s *Session) CallTool(ctx context.Context, name string, arguments map[string]any) (*ToolResult, error) {
966
967
backend:= s.routingTable.Lookup(name)
967
968
client:= s.clients[backend.ID]
969
+
if client == nil {
970
+
returnnil, fmt.Errorf("no client found for backend %s", backend.ID)
971
+
}
968
972
969
973
// Call backend client - if it fails, return the error
-**Approach**: Best effort - attempt keepalive, gracefully handle backends that don't support it
1149
1182
-**Configuration**: Enable per backend, configurable interval (default: 5 min)
1150
1183
1151
-
The preferred keepalive method is the MCP spec-defined `ping` protocol request, which is side-effect-free and supported by all compliant servers; explicit tool calls should only be used as a fallback. Keepalive failures must not affect healthy sessions — after N consecutive failures the feature should be disabled for that backend, with a periodic probe to re-enable on recovery. Keepalive should default to disabled for stateless backends or where TTL alignment already covers the session lifetime. The keepalive goroutine must hold the backend lock to avoid races with session re-initialization. Operators should be able to observe keepalive health via per-backend metrics covering attempt counts, failure reasons, and auto-disable events.
1184
+
The preferred keepalive method is the MCP spec-defined `ping` protocol request, which is side-effect-free and supported by all compliant servers; explicit tool calls should only be used as a fallback. Keepalive failures must not affect healthy sessions — after N consecutive failures the feature should be disabled for that backend, with a periodic probe to re-enable on recovery. Keepalive should default to disabled for stateless backends or where TTL alignment already covers the session lifetime. The keepalive goroutine must use the same in-flight counter (`sync.WaitGroup`) approach as other operations to avoid races with session re-initialization while ensuring no locks are held during network I/O. Operators should be able to observe keepalive health via per-backend metrics covering attempt counts, failure reasons, and auto-disable events.
1152
1185
1153
1186
2.**Session TTL alignment**:
1154
1187
- Configure backend session TTLs longer than vMCP session TTL
@@ -1231,10 +1264,11 @@ For initial implementation, we assume most backends use long-lived credentials (
1231
1264
1232
1265
**Required for production deployment**:
1233
1266
1.**Session binding to authentication token**:
1234
-
- Store a cryptographic hash of the original authentication token (e.g., `SHA256(bearerToken)`) in the session during creation
1235
-
- On each request, validate that the current auth token hash matches the session's bound token hash
1236
-
- If mismatch, reject with "session authentication mismatch" error and terminate session
1237
-
- This prevents stolen session IDs from being used with different credentials
1267
+
- Store a secure cryptographic hash of the original authentication token in the session during creation. To prevent offline attacks if session state is leaked (e.g., from Redis/Valkey), prefer a keyed hash (e.g., `HMAC-SHA256` with a server-managed secret and a per-session salt).
1268
+
- On each request, validate the current auth token against the session's bound hash using a constant-time comparison to prevent timing attacks.
1269
+
- If mismatch, reject with "session authentication mismatch" error and terminate session.
1270
+
- Ensure the hash value is treated as sensitive and is never logged or exposed in traces.
1271
+
- This prevents stolen session IDs from being used with different credentials.
1238
1272
1239
1273
2.**TLS-only enforcement**:
1240
1274
- Require TLS for all vMCP connections (prevent session ID interception)
@@ -1503,7 +1537,7 @@ if config.SessionManagementV2 {
1503
1537
- Verify old code path still works when flag is disabled (no regressions)
1504
1538
1505
1539
**Security (blocking for production rollout)**:
1506
-
- Implement token hash binding during `CreateSession()`: store `SHA256(bearerToken)`in the session, validate on each subsequent request, reject with "session authentication mismatch" and terminate on mismatch (see Security Considerations → Session Hijacking Prevention). This must be completed before the feature flag is enabled in any production environment.
1540
+
- Implement token hash binding during `CreateSession()`: store a secure keyed hash (e.g., `HMAC-SHA256` with a server-managed secret and a per-session salt) of the original authentication token in the session. Validate this hash on each subsequent request using constant-time comparison. Reject with "session authentication mismatch" and terminate the session on mismatch (see Security Considerations → Session Hijacking Prevention). This must be completed before the feature flag is enabled in any production environment.
1507
1541
1508
1542
**Files Modified**:
1509
1543
-`pkg/vmcp/server/server.go` - Add conditional logic based on `sessionManagementV2` flag
@@ -1537,7 +1571,7 @@ if config.SessionManagementV2 {
1537
1571
-`pkg/vmcp/server/server.go` - Remove old code path and feature flag conditionals
1538
1572
-`pkg/vmcp/discovery/middleware.go` - Delete (replaced by SessionFactory)
1539
1573
-`pkg/vmcp/client/client.go` - Remove `httpBackendClient` (replaced by Session ownership)
1540
-
-`pkg/transport/session/manager.go` - Update `DeleteExpired()` to call `session.Close()` before removing from storage (fixes resource leak). Because the storage layer operates on the `Session` interface (which has no`Close()` method), use an optional interface check: `if closer, ok := sess.(io.Closer); ok { closer.Close() }`. This avoids adding `Close()` to the base interface (which would require all existing session types to implement it) while still dispatching cleanup to sessions that carry resources.
1574
+
-`pkg/transport/session/manager.go` - Update `DeleteExpired()` to call `session.Close()` before removing from storage (fixes resource leak). Because the storage layer operates on the base `transportsession.Session` interface (which lacks a`Close()` method, unlike the vMCP-specific `Session` interface defined in this RFC), use an optional interface check: `if closer, ok := sess.(io.Closer); ok { _ = closer.Close() }`. This avoids adding `Close()` to the base `transportsession.Session`interface (which would require all existing session types to implement it) while still dispatching cleanup to sessions that carry active resources.
1541
1575
- Delete old `VMCPSession` implementation files
1542
1576
1543
1577
**Rationale**: Once the new code path is validated, delete the old code entirely rather than maintaining both paths. This avoids technical debt and ongoing maintenance burden.
0 commit comments