| title | Failure handling |
|---|---|
| description | Configure circuit breaker and partial failure modes to handle backend failures gracefully. |
Virtual MCP Server (vMCP) implements failure handling patterns to prevent cascading failures and provide graceful degradation when backends become unavailable. This guide covers circuit breaker configuration and partial failure modes.
:::tip
For backend health status monitoring and the /status endpoint, see
Backend discovery modes.
:::
When backends fail due to crashes, network issues, or rate limiting, vMCP provides circuit breaker and partial failure modes to handle failures gracefully:
- Circuit breaker: Prevents cascading failures by immediately rejecting requests to failing backends instead of waiting for timeouts
- Partial failure modes: Choose whether to fail entire requests or continue with available backends
- Automatic recovery: Backends are automatically restored when they recover
:::tip
Enable circuit breaker for production environments where backends may experience temporary failures (deployments, restarts, rate limits). For highly stable backends, health checks alone may be sufficient.
:::
The circuit breaker tracks backend failures and transitions through three states:
- Closed (normal operation): Requests pass through to the backend. Failures are counted.
- Open (failing state): After exceeding the failure threshold, the circuit opens. Requests fail immediately without contacting the backend.
- Half-open (recovery testing): After a timeout period, the circuit allows exactly one test request through. While this request is in progress, all other requests are rejected (circuit remains half-open). If the test succeeds, the circuit closes immediately and normal operation resumes. If it fails, the circuit reopens for another timeout period.
stateDiagram-v2
[*] --> Closed
Closed --> Open: Failure threshold exceeded
Open --> HalfOpen: Timeout elapsed
HalfOpen --> Closed: Request succeeds
HalfOpen --> Open: Request fails
Closed --> Closed: Success (reset count)
Configure circuit breaker in the VirtualMCPServer resource:
apiVersion: toolhive.stacklok.dev/v1beta1
kind: VirtualMCPServer
metadata:
name: my-vmcp
namespace: toolhive-system
spec:
groupRef:
name: my-group
config:
# highlight-start
operational:
failureHandling:
healthCheckInterval: 30s
unhealthyThreshold: 3
circuitBreaker:
enabled: true
failureThreshold: 5
timeout: 60s
# highlight-end
incomingAuth:
type: anonymous| Field | Description | Default |
|---|---|---|
healthCheckInterval |
Time between health checks for each backend | 30s |
unhealthyThreshold |
Consecutive failures before marking backend unhealthy | 3 |
healthCheckTimeout |
Maximum duration for a single health check | 10s |
statusReportingInterval |
Interval for reporting status to Kubernetes | 30s |
| Circuit breaker | ||
enabled |
Enable circuit breaker | false |
failureThreshold |
Number of failures before opening the circuit | 5 |
timeout |
Duration to wait before testing recovery | 60s |
:::note
Circuit breaker is disabled by default. Health checks run independently of the
circuit breaker and mark backends as healthy/unhealthy based on
unhealthyThreshold.
:::
:::note[Two failure thresholds]
vMCP uses two thresholds:
unhealthyThreshold(default: 3): Consecutive health check failures before marking backend unhealthyfailureThreshold(default: 5): Consecutive request failures before opening circuit breaker
Health checks detect failures during idle periods (max detection time:
healthCheckInterval × unhealthyThreshold). Circuit breaker provides fast
failure protection during active traffic.
:::
Configure how vMCP behaves when some backends are unavailable:
apiVersion: toolhive.stacklok.dev/v1beta1
kind: VirtualMCPServer
metadata:
name: my-vmcp
namespace: toolhive-system
spec:
groupRef:
name: my-group
config:
operational:
failureHandling:
# highlight-next-line
partialFailureMode: best_effort
incomingAuth:
type: anonymousfail(default): Entire request fails if any required backend is unavailable. Use when all backends must be operational.best_effort: Return results from healthy backends even if some fail. Tools from failed backends are omitted from responses. Use for graceful degradation.
With partialFailureMode: best_effort, if the GitHub backend is down but Fetch
is healthy, the tools/list response only includes tools from healthy backends:
{
"jsonrpc": "2.0",
"result": {
"tools": [{ "name": "fetch_url", "description": "Fetch URL content" }]
},
"id": 1
}GitHub tools are omitted from the response because the circuit breaker is open. The client doesn't see unavailable backend tools, preventing timeout errors when attempting to call them.
Check backend health and circuit state:
kubectl get virtualmcpserver my-vmcp -n toolhive-system -o yamlStatus includes health information and circuit breaker state:
status:
phase: Degraded # Ready|Degraded if some backends unhealthy
backendCount: 2 # Only counts ready backends (fetch-mcp, jira-mcp)
discoveredBackends:
- name: github-mcp
status: unavailable
lastHealthCheck: '2025-02-09T10:29:45Z'
message: 'connection timeout'
circuitBreakerState: open # Circuit breaker state: closed|open|half-open
circuitLastChanged: '2025-02-09T10:28:30Z' # When circuit opened
consecutiveFailures: 8 # Current failure count
- name: fetch-mcp
status: ready
lastHealthCheck: '2025-02-09T10:30:05Z'
circuitBreakerState: closed
consecutiveFailures: 0
- name: jira-mcp
status: ready
lastHealthCheck: '2025-02-09T10:30:03Z'
circuitBreakerState: half-open # Testing recovery
circuitLastChanged: '2025-02-09T10:30:00Z'
consecutiveFailures: 2 # Reduced after partial recoveryStatus fields:
status: Backend health (ready, degraded, unavailable, unknown)circuitBreakerState: Circuit state (closed, open, half-open) - empty if circuit breaker disabledcircuitLastChanged: When the circuit breaker state last changedconsecutiveFailures: Count of consecutive health check failuresmessage: Additional information about backend status or errors
The /status HTTP endpoint provides a simplified view:
curl http://localhost:4483/status{
"backends": [
{
"name": "github-mcp",
"health": "unhealthy",
"transport": "sse",
"auth_type": "token_exchange"
},
{
"name": "fetch-mcp",
"health": "healthy",
"transport": "streamable-http",
"auth_type": "unauthenticated"
}
],
"healthy": false,
"version": "v1.2.3",
"group_ref": "my-group"
}:::info
The /status endpoint provides basic health information but does not include
circuit breaker state. For detailed circuit breaker information
(circuitBreakerState, consecutiveFailures, circuitLastChanged), use the
Kubernetes status shown above. See
Backend discovery modes for
more details on the /status endpoint.
:::
Detect failures quickly and fail fast:
apiVersion: toolhive.stacklok.dev/v1beta1
kind: VirtualMCPServer
metadata:
name: production-vmcp
namespace: toolhive-system
spec:
groupRef:
name: production-backends
config:
operational:
failureHandling:
# Check every 10 seconds
healthCheckInterval: 10s
# Mark unhealthy after 2 failures (20 seconds)
unhealthyThreshold: 2
healthCheckTimeout: 5s
# Open circuit after 3 failures
circuitBreaker:
enabled: true
failureThreshold: 3
timeout: 30s
# Fail requests if any backend down
partialFailureMode: fail
incomingAuth:
type: oidc
oidcConfigRef:
name: my-issuer
audience: my-vmcpContinue with available backends:
apiVersion: toolhive.stacklok.dev/v1beta1
kind: VirtualMCPServer
metadata:
name: dev-vmcp
namespace: toolhive-system
spec:
groupRef:
name: dev-backends
config:
operational:
failureHandling:
healthCheckInterval: 30s
unhealthyThreshold: 3
circuitBreaker:
enabled: true
failureThreshold: 5
timeout: 60s
# Continue with healthy backends
partialFailureMode: best_effort
incomingAuth:
type: anonymousCircuit breaker opens too frequently
If the circuit breaker is too sensitive:
Increase failure threshold:
operational:
failureHandling:
circuitBreaker:
failureThreshold: 10 # Require more failures before openingIncrease timeout:
operational:
failureHandling:
circuitBreaker:
timeout: 120s # Give backends more time to recoverBackends not recovering automatically
If backends stay unhealthy after recovering:
-
Test backend connectivity
Verify the backend MCP server is accessible from vMCP:
kubectl exec -n toolhive-system deployment/vmcp-my-vmcp -- \ curl -v http://my-backend:8080/mcpThe backend should respond with MCP protocol headers.
-
Increase circuit breaker timeout
operational: failureHandling: circuitBreaker: timeout: 90s # Allow more time for full recovery
-
Review vMCP logs
kubectl logs -n toolhive-system deployment/vmcp-my-vmcp
Look for circuit breaker state transitions:
WARN Circuit breaker for backend github-mcp OPENED (threshold exceeded) INFO Circuit breaker for backend github-mcp CLOSED (recovery successful)
Healthy backends marked unhealthy
If backends are incorrectly marked unhealthy:
Increase health check timeout:
operational:
failureHandling:
healthCheckTimeout: 20s # Allow slower responsesIncrease unhealthy threshold:
operational:
failureHandling:
unhealthyThreshold: 5 # Allow more failures before marking unhealthy- Monitor vMCP activity with OpenTelemetry tracing and metrics
- Set up audit logging to track requests and failures
- Backend discovery modes - backend health status and
/statusendpoint - Configuration guide
- VirtualMCPServer CRD specification