Skip to content

Commit 895c6c8

Browse files
committed
changes from review
1 parent 500ce67 commit 895c6c8

1 file changed

Lines changed: 21 additions & 9 deletions

File tree

rfcs/THV-0038-session-scoped-client-lifecycle.md

Lines changed: 21 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -616,7 +616,7 @@ The default factory implementation follows this pattern:
616616
- **Client initialization includes MCP handshake**: Each client sends `InitializeRequest` to its backend, and the backend responds with capabilities and its own `Mcp-Session-Id`. The client stores the session ID for protocol compliance (includes it in subsequent request headers).
617617
- **Capture backend session IDs**: Factory also captures each backend's session ID (via `client.SessionID()`) for observability, storing them in a map to pass to the session
618618
- **Performance requirement**: Use parallel initialization (e.g., `errgroup` with bounded concurrency) to avoid sequential latency accumulation. Connection initialization (TCP handshake + TLS negotiation + MCP protocol handshake) can take tens to hundreds of milliseconds per backend depending on network latency and backend responsiveness. With 20 backends, sequential initialization could easily exceed acceptable session creation latency.
619-
- **Bounded concurrency**: Limit parallel goroutines (e.g., 10 concurrent initializations) to avoid resource exhaustion. This limit is **per-session-creation** (not global), implemented as a semaphore inside the factory. It should be a configurable vMCP server-level parameter (e.g., `max_backend_init_concurrency`, default: 10). Operators with many backends on a fast private network can raise it; resource-constrained deployments or backends with expensive initialization should lower it. A global limit across concurrent session creations is not necessary — the per-session semaphore already bounds the worst case per event.
619+
- **Bounded concurrency**: Limit parallel goroutines (e.g., 10 concurrent initializations) per session to avoid resource exhaustion. This limit is **per-session-creation**, implemented as a semaphore inside the factory via a configurable vMCP server-level parameter (`max_backend_init_concurrency`, default: 10). Operators should note that the aggregate system load is protected by the global `TOOLHIVE_MAX_SESSIONS` limit (see [Resource Exhaustion & DoS Protection](#concurrency--resource-safety)). For additional safety during traffic spikes, the factory may optionally implement a global initialization semaphore (e.g., `max_global_backend_init_concurrency`, default: 100) to cap the total number of simultaneous connection attempts across all active session creations (preventing, for example, 100 concurrent session requests from triggering 1,000 backend initializations).
620620
- **Per-backend timeout**: Apply context timeout (e.g., 5s per backend) so one slow backend doesn't block session creation
621621
- **Partial initialization**: If some backends fail, log warnings and continue with successfully initialized backends (failed backends not added to clients map)
622622
- Clients are connection-ready and stateful (each maintains its backend session for protocol use)
@@ -1055,9 +1055,15 @@ When vMCP detects backend session expiration (404 or "session expired" error), a
10551055
func (s *Session) CallTool(ctx context.Context, name string, arguments map[string]any) (*ToolResult, error) {
10561056
backend := s.routingTable.Lookup(name)
10571057

1058-
// Hold RLock only while reading from shared state; release before network I/O.
1058+
// Check if session is closed and increment in-flight counter
10591059
s.mu.RLock()
1060+
if s.closed {
1061+
s.mu.RUnlock()
1062+
return nil, fmt.Errorf("session closed")
1063+
}
10601064
client := s.clients[backend.ID]
1065+
s.inFlight.Add(1)
1066+
defer s.inFlight.Done()
10611067
s.mu.RUnlock()
10621068

10631069
result, err := client.CallTool(ctx, name, arguments)
@@ -1067,14 +1073,14 @@ func (s *Session) CallTool(ctx context.Context, name string, arguments map[strin
10671073
log.Warn("Backend %s session expired, re-initializing", backend.ID)
10681074

10691075
// Re-initialize the backend session (acquires write lock internally).
1070-
if reinitErr := s.reinitializeBackend(ctx, backend.ID); reinitErr != nil {
1076+
// Pass the failed client to avoid redundant re-initialization if multiple
1077+
// goroutines detect the failure simultaneously.
1078+
if reinitErr := s.reinitializeBackend(ctx, backend.ID, client); reinitErr != nil {
10711079
return nil, fmt.Errorf("failed to reinitialize backend %s: %w",
10721080
backend.ID, reinitErr)
10731081
}
10741082

10751083
// Retry the operation once with the re-initialized session.
1076-
// No further retries: if the new session also fails, surface the error
1077-
// immediately to avoid infinite loops or resource exhaustion.
10781084
// Re-read under RLock: reinitializeBackend replaced the client under write lock.
10791085
s.mu.RLock()
10801086
client = s.clients[backend.ID]
@@ -1091,15 +1097,21 @@ func (s *Session) CallTool(ctx context.Context, name string, arguments map[strin
10911097
return result, err
10921098
}
10931099

1094-
func (s *Session) reinitializeBackend(ctx context.Context, backendID string) error {
1100+
func (s *Session) reinitializeBackend(ctx context.Context, backendID string, failedClient *Client) error {
10951101
s.mu.Lock()
10961102
defer s.mu.Unlock()
10971103

1104+
// Check if another goroutine already re-initialized this backend while we were waiting for the lock
1105+
if currentClient := s.clients[backendID]; currentClient != failedClient {
1106+
log.Debug("Backend %s already re-initialized by another goroutine", backendID)
1107+
return nil
1108+
}
1109+
10981110
backend := s.getBackendConfig(backendID)
10991111

11001112
// Close old client (with expired backend session)
1101-
if oldClient := s.clients[backendID]; oldClient != nil {
1102-
oldClient.Close()
1113+
if failedClient != nil {
1114+
failedClient.Close()
11031115
}
11041116

11051117
// Create NEW client (triggers new Initialize handshake with backend)
@@ -1148,7 +1160,7 @@ func (s *Session) reinitializeBackend(ctx context.Context, backendID string) err
11481160
- **Approach**: Best effort - attempt keepalive, gracefully handle backends that don't support it
11491161
- **Configuration**: Enable per backend, configurable interval (default: 5 min)
11501162

1151-
The preferred keepalive method is the MCP spec-defined `ping` protocol request, which is side-effect-free and supported by all compliant servers; explicit tool calls should only be used as a fallback. Keepalive failures must not affect healthy sessions — after N consecutive failures the feature should be disabled for that backend, with a periodic probe to re-enable on recovery. Keepalive should default to disabled for stateless backends or where TTL alignment already covers the session lifetime. The keepalive goroutine must hold the backend lock to avoid races with session re-initialization. Operators should be able to observe keepalive health via per-backend metrics covering attempt counts, failure reasons, and auto-disable events.
1163+
The preferred keepalive method is the MCP spec-defined `ping` protocol request, which is side-effect-free and supported by all compliant servers; explicit tool calls should only be used as a fallback. Keepalive failures must not affect healthy sessions — after N consecutive failures the feature should be disabled for that backend, with a periodic probe to re-enable on recovery. Keepalive should default to disabled for stateless backends or where TTL alignment already covers the session lifetime. The keepalive goroutine must use the same in-flight counter (`sync.WaitGroup`) approach as other operations to avoid races with session re-initialization while ensuring no locks are held during network I/O. Operators should be able to observe keepalive health via per-backend metrics covering attempt counts, failure reasons, and auto-disable events.
11521164

11531165
2. **Session TTL alignment**:
11541166
- Configure backend session TTLs longer than vMCP session TTL

0 commit comments

Comments
 (0)