Skip to content

Commit f8c8edd

Browse files
committed
changes from review
1 parent 500ce67 commit f8c8edd

1 file changed

Lines changed: 73 additions & 39 deletions

File tree

rfcs/THV-0038-session-scoped-client-lifecycle.md

Lines changed: 73 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -616,7 +616,7 @@ The default factory implementation follows this pattern:
616616
- **Client initialization includes MCP handshake**: Each client sends `InitializeRequest` to its backend, and the backend responds with capabilities and its own `Mcp-Session-Id`. The client stores the session ID for protocol compliance (includes it in subsequent request headers).
617617
- **Capture backend session IDs**: Factory also captures each backend's session ID (via `client.SessionID()`) for observability, storing them in a map to pass to the session
618618
- **Performance requirement**: Use parallel initialization (e.g., `errgroup` with bounded concurrency) to avoid sequential latency accumulation. Connection initialization (TCP handshake + TLS negotiation + MCP protocol handshake) can take tens to hundreds of milliseconds per backend depending on network latency and backend responsiveness. With 20 backends, sequential initialization could easily exceed acceptable session creation latency.
619-
- **Bounded concurrency**: Limit parallel goroutines (e.g., 10 concurrent initializations) to avoid resource exhaustion. This limit is **per-session-creation** (not global), implemented as a semaphore inside the factory. It should be a configurable vMCP server-level parameter (e.g., `max_backend_init_concurrency`, default: 10). Operators with many backends on a fast private network can raise it; resource-constrained deployments or backends with expensive initialization should lower it. A global limit across concurrent session creations is not necessary — the per-session semaphore already bounds the worst case per event.
619+
- **Bounded concurrency**: Limit parallel goroutines (e.g., 10 concurrent initializations) per session to avoid resource exhaustion. This limit is **per-session-creation**, implemented as a semaphore inside the factory via a configurable vMCP server-level parameter (`max_backend_init_concurrency`, default: 10). Operators should note that the aggregate system load is protected by the global `TOOLHIVE_MAX_SESSIONS` limit (see [Resource Exhaustion & DoS Protection](#concurrency--resource-safety)). For additional safety during traffic spikes, the factory may optionally implement a global initialization semaphore (e.g., `max_global_backend_init_concurrency`, default: 100) to cap the total number of simultaneous connection attempts across all active session creations (preventing, for example, 100 concurrent session requests from triggering 1,000 backend initializations).
620620
- **Per-backend timeout**: Apply context timeout (e.g., 5s per backend) so one slow backend doesn't block session creation
621621
- **Partial initialization**: If some backends fail, log warnings and continue with successfully initialized backends (failed backends not added to clients map)
622622
- Clients are connection-ready and stateful (each maintains its backend session for protocol use)
@@ -706,6 +706,7 @@ The default session implementation stores:
706706
- Used for: logging, metrics, health checks, debugging, and explicit session cleanup
707707
- Updated when clients are re-initialized (e.g., after backend session expiration)
708708
- RWMutex for thread-safe access (read lock for queries/calls, write lock for Close)
709+
- `singleflight.Group` (or per-backend locks) to coordinate concurrent re-initialization of backend sessions without stalling the whole session
709710

710711
**Backend session ID lifecycle management:**
711712

@@ -965,6 +966,9 @@ The session **continues operating** with remaining backends. If a backend client
965966
func (s *Session) CallTool(ctx context.Context, name string, arguments map[string]any) (*ToolResult, error) {
966967
backend := s.routingTable.Lookup(name)
967968
client := s.clients[backend.ID]
969+
if client == nil {
970+
return nil, fmt.Errorf("no client found for backend %s", backend.ID)
971+
}
968972

969973
// Call backend client - if it fails, return the error
970974
result, err := client.CallTool(ctx, name, arguments)
@@ -1055,9 +1059,19 @@ When vMCP detects backend session expiration (404 or "session expired" error), a
10551059
func (s *Session) CallTool(ctx context.Context, name string, arguments map[string]any) (*ToolResult, error) {
10561060
backend := s.routingTable.Lookup(name)
10571061

1058-
// Hold RLock only while reading from shared state; release before network I/O.
1062+
// Check if session is closed and increment in-flight counter
10591063
s.mu.RLock()
1064+
if s.closed {
1065+
s.mu.RUnlock()
1066+
return nil, fmt.Errorf("session closed")
1067+
}
10601068
client := s.clients[backend.ID]
1069+
if client == nil {
1070+
s.mu.RUnlock()
1071+
return nil, fmt.Errorf("no client found for backend %s", backend.ID)
1072+
}
1073+
s.inFlight.Add(1)
1074+
defer s.inFlight.Done()
10611075
s.mu.RUnlock()
10621076

10631077
result, err := client.CallTool(ctx, name, arguments)
@@ -1067,19 +1081,23 @@ func (s *Session) CallTool(ctx context.Context, name string, arguments map[strin
10671081
log.Warn("Backend %s session expired, re-initializing", backend.ID)
10681082

10691083
// Re-initialize the backend session (acquires write lock internally).
1070-
if reinitErr := s.reinitializeBackend(ctx, backend.ID); reinitErr != nil {
1084+
// Pass the failed client to avoid redundant re-initialization if multiple
1085+
// goroutines detect the failure simultaneously.
1086+
if reinitErr := s.reinitializeBackend(ctx, backend.ID, client); reinitErr != nil {
10711087
return nil, fmt.Errorf("failed to reinitialize backend %s: %w",
10721088
backend.ID, reinitErr)
10731089
}
10741090

10751091
// Retry the operation once with the re-initialized session.
1076-
// No further retries: if the new session also fails, surface the error
1077-
// immediately to avoid infinite loops or resource exhaustion.
10781092
// Re-read under RLock: reinitializeBackend replaced the client under write lock.
10791093
s.mu.RLock()
10801094
client = s.clients[backend.ID]
10811095
s.mu.RUnlock()
10821096

1097+
if client == nil {
1098+
return nil, fmt.Errorf("no client found for backend %s after re-initialization", backend.ID)
1099+
}
1100+
10831101
result, err = client.CallTool(ctx, name, arguments)
10841102
if err != nil {
10851103
return nil, fmt.Errorf("backend %s failed after re-initialization: %w",
@@ -1091,38 +1109,53 @@ func (s *Session) CallTool(ctx context.Context, name string, arguments map[strin
10911109
return result, err
10921110
}
10931111

1094-
func (s *Session) reinitializeBackend(ctx context.Context, backendID string) error {
1095-
s.mu.Lock()
1096-
defer s.mu.Unlock()
1097-
1098-
backend := s.getBackendConfig(backendID)
1099-
1100-
// Close old client (with expired backend session)
1101-
if oldClient := s.clients[backendID]; oldClient != nil {
1102-
oldClient.Close()
1103-
}
1104-
1105-
// Create NEW client (triggers new Initialize handshake with backend)
1106-
// This sends InitializeRequest to backend MCP server
1107-
newClient, err := s.clientFactory.CreateClient(ctx, backend, s.identity)
1108-
if err != nil {
1109-
return fmt.Errorf("client creation failed: %w", err)
1112+
func (s *Session) reinitializeBackend(ctx context.Context, backendID string, failedClient *Client) error {
1113+
// 1. Double-check if re-init is still needed under read lock
1114+
s.mu.RLock()
1115+
if s.clients[backendID] != failedClient {
1116+
s.mu.RUnlock()
1117+
return nil
11101118
}
1119+
s.mu.RUnlock()
11111120

1112-
// Backend responds with new session ID (in Mcp-Session-Id header)
1113-
// The client stores it for subsequent requests (protocol requirement)
1121+
// 2. Coordinate concurrent re-initialization (e.g., using singleflight)
1122+
// to ensure only one network initialization proceeds for this backendID.
1123+
_, err, _ := s.reinitGroup.Do(backendID, func() (interface{}, error) {
1124+
// Re-check after acquiring re-initialization group ownership
1125+
s.mu.RLock()
1126+
if s.clients[backendID] != failedClient {
1127+
s.mu.RUnlock()
1128+
return nil, nil
1129+
}
1130+
s.mu.RUnlock()
11141131

1115-
// Capture the new backend session ID for observability
1116-
newBackendSessionID := newClient.SessionID() // Assuming client exposes this method
1132+
backend := s.getBackendConfig(backendID)
11171133

1118-
// Update session state
1119-
oldSessionID := s.backendSessions[backendID]
1120-
s.backendSessions[backendID] = newBackendSessionID
1121-
s.clients[backendID] = newClient
1134+
// 3. Create NEW client (network I/O happens OUTSIDE the session-wide lock)
1135+
newClient, err := s.clientFactory.CreateClient(ctx, backend, s.identity)
1136+
if err != nil {
1137+
return nil, fmt.Errorf("client creation failed: %w", err)
1138+
}
11221139

1123-
log.Info("Backend %s session re-initialized: old=%s, new=%s",
1124-
backendID, oldSessionID, newBackendSessionID)
1125-
return nil
1140+
// 4. Update session state under write lock (short critical section)
1141+
s.mu.Lock()
1142+
oldSessionID := s.backendSessions[backendID]
1143+
newBackendSessionID := newClient.SessionID()
1144+
s.backendSessions[backendID] = newBackendSessionID
1145+
s.clients[backendID] = newClient
1146+
s.mu.Unlock()
1147+
1148+
// 5. Close the old client OUTSIDE any locks
1149+
if failedClient != nil {
1150+
failedClient.Close()
1151+
}
1152+
1153+
log.Info("Backend %s session re-initialized: old=%s, new=%s",
1154+
backendID, oldSessionID, newBackendSessionID)
1155+
return nil, nil
1156+
})
1157+
1158+
return err
11261159
}
11271160
```
11281161

@@ -1148,7 +1181,7 @@ func (s *Session) reinitializeBackend(ctx context.Context, backendID string) err
11481181
- **Approach**: Best effort - attempt keepalive, gracefully handle backends that don't support it
11491182
- **Configuration**: Enable per backend, configurable interval (default: 5 min)
11501183

1151-
The preferred keepalive method is the MCP spec-defined `ping` protocol request, which is side-effect-free and supported by all compliant servers; explicit tool calls should only be used as a fallback. Keepalive failures must not affect healthy sessions — after N consecutive failures the feature should be disabled for that backend, with a periodic probe to re-enable on recovery. Keepalive should default to disabled for stateless backends or where TTL alignment already covers the session lifetime. The keepalive goroutine must hold the backend lock to avoid races with session re-initialization. Operators should be able to observe keepalive health via per-backend metrics covering attempt counts, failure reasons, and auto-disable events.
1184+
The preferred keepalive method is the MCP spec-defined `ping` protocol request, which is side-effect-free and supported by all compliant servers; explicit tool calls should only be used as a fallback. Keepalive failures must not affect healthy sessions — after N consecutive failures the feature should be disabled for that backend, with a periodic probe to re-enable on recovery. Keepalive should default to disabled for stateless backends or where TTL alignment already covers the session lifetime. The keepalive goroutine must use the same in-flight counter (`sync.WaitGroup`) approach as other operations to avoid races with session re-initialization while ensuring no locks are held during network I/O. Operators should be able to observe keepalive health via per-backend metrics covering attempt counts, failure reasons, and auto-disable events.
11521185

11531186
2. **Session TTL alignment**:
11541187
- Configure backend session TTLs longer than vMCP session TTL
@@ -1231,10 +1264,11 @@ For initial implementation, we assume most backends use long-lived credentials (
12311264

12321265
**Required for production deployment**:
12331266
1. **Session binding to authentication token**:
1234-
- Store a cryptographic hash of the original authentication token (e.g., `SHA256(bearerToken)`) in the session during creation
1235-
- On each request, validate that the current auth token hash matches the session's bound token hash
1236-
- If mismatch, reject with "session authentication mismatch" error and terminate session
1237-
- This prevents stolen session IDs from being used with different credentials
1267+
- Store a secure cryptographic hash of the original authentication token in the session during creation. To prevent offline attacks if session state is leaked (e.g., from Redis/Valkey), prefer a keyed hash (e.g., `HMAC-SHA256` with a server-managed secret and a per-session salt).
1268+
- On each request, validate the current auth token against the session's bound hash using a constant-time comparison to prevent timing attacks.
1269+
- If mismatch, reject with "session authentication mismatch" error and terminate session.
1270+
- Ensure the hash value is treated as sensitive and is never logged or exposed in traces.
1271+
- This prevents stolen session IDs from being used with different credentials.
12381272

12391273
2. **TLS-only enforcement**:
12401274
- Require TLS for all vMCP connections (prevent session ID interception)
@@ -1503,7 +1537,7 @@ if config.SessionManagementV2 {
15031537
- Verify old code path still works when flag is disabled (no regressions)
15041538

15051539
**Security (blocking for production rollout)**:
1506-
- Implement token hash binding during `CreateSession()`: store `SHA256(bearerToken)` in the session, validate on each subsequent request, reject with "session authentication mismatch" and terminate on mismatch (see Security Considerations → Session Hijacking Prevention). This must be completed before the feature flag is enabled in any production environment.
1540+
- Implement token hash binding during `CreateSession()`: store a secure keyed hash (e.g., `HMAC-SHA256` with a server-managed secret and a per-session salt) of the original authentication token in the session. Validate this hash on each subsequent request using constant-time comparison. Reject with "session authentication mismatch" and terminate the session on mismatch (see Security Considerations → Session Hijacking Prevention). This must be completed before the feature flag is enabled in any production environment.
15071541

15081542
**Files Modified**:
15091543
- `pkg/vmcp/server/server.go` - Add conditional logic based on `sessionManagementV2` flag
@@ -1537,7 +1571,7 @@ if config.SessionManagementV2 {
15371571
- `pkg/vmcp/server/server.go` - Remove old code path and feature flag conditionals
15381572
- `pkg/vmcp/discovery/middleware.go` - Delete (replaced by SessionFactory)
15391573
- `pkg/vmcp/client/client.go` - Remove `httpBackendClient` (replaced by Session ownership)
1540-
- `pkg/transport/session/manager.go` - Update `DeleteExpired()` to call `session.Close()` before removing from storage (fixes resource leak). Because the storage layer operates on the `Session` interface (which has no `Close()` method), use an optional interface check: `if closer, ok := sess.(io.Closer); ok { closer.Close() }`. This avoids adding `Close()` to the base interface (which would require all existing session types to implement it) while still dispatching cleanup to sessions that carry resources.
1574+
- `pkg/transport/session/manager.go` - Update `DeleteExpired()` to call `session.Close()` before removing from storage (fixes resource leak). Because the storage layer operates on the base `transportsession.Session` interface (which lacks a `Close()` method, unlike the vMCP-specific `Session` interface defined in this RFC), use an optional interface check: `if closer, ok := sess.(io.Closer); ok { _ = closer.Close() }`. This avoids adding `Close()` to the base `transportsession.Session` interface (which would require all existing session types to implement it) while still dispatching cleanup to sessions that carry active resources.
15411575
- Delete old `VMCPSession` implementation files
15421576

15431577
**Rationale**: Once the new code path is validated, delete the old code entirely rather than maintaining both paths. This avoids technical debt and ongoing maintenance burden.

0 commit comments

Comments
 (0)