Skip to content

[FEATURE] Graceful MCP server startup failures - fail open #1481

@mkmeral

Description

@mkmeral

Problem Statement

When loading multiple MCP servers (especially from a config file), if any single MCP server fails to start, the entire agent initialization fails. This creates a poor developer experience where one misconfigured or unavailable MCP server prevents the agent from using any other tools.

Currently, if you have 5 MCP servers configured and 3rd MCP server fails to connect (wrong path, server down, auth issue), you get only 2 MCP server connections instead of the 4 that would have worked.

Proposed Solution

Implement a "fail open" strategy for MCP client initialization:

  1. When starting multiple MCP clients, catch startup failures per-client
  2. Log a warning for failed clients but continue loading others
  3. Return successfully loaded tools from healthy clients
  4. Optionally provide a callback or return value indicating which servers failed

Example behavior:

# Current behavior (fail_fast=True, the default) - unchanged
clients = [mcp1, mcp2_broken, mcp3]  # Throws exception, agent unusable

# New opt-in behavior (fail_fast=False) - graceful degradation
clients = load_mcp_clients(config, fail_fast=False)
# Logs: "WARNING: Failed to start MCP server 'mcp2_broken', skipping: Connection refused"
# Agent works with tools from mcp1 and mcp3

To maintain backwards compatibility, add a fail_fast=True parameter that defaults to the current strict behavior. Users can opt-in to graceful degradation with fail_fast=False.

Use Case

  • Config-driven agents: Loading MCP servers from mcp.json where some servers may be optional or environment-specific
  • Development workflows: Testing with partial MCP availability without needing all servers running
  • Production resilience: Agent continues functioning even if one MCP server has temporary issues
  • Multi-tenant setups: Different users may have access to different MCP servers

Alternatives Solutions

  1. Wrap each MCPClient in try/except manually - works but verbose and error-prone
  2. Pre-validate MCP configs before loading - doesn't help with runtime failures
  3. Lazy loading of MCP clients - more complex, changes tool discovery timing

Additional Context

I implemented this pattern in a wrapper project and it significantly improved DX:

# Fail open approach
for name, server_config in servers.items():
    try:
        client.start()
        tools = client.list_tools_sync()
        successful_servers.extend(client)
        logger.info(f"Loaded {len(tools)} tools from MCP server: {name}")
    except Exception as e:
        logger.warning(f"Failed to start MCP server {name}, skipping: {e}")
        try:
            client.stop()
        except Exception:
            pass

This becomes especially important with #482 (config file loading) since users will likely have multiple servers defined.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-configRelated to config-based agents or mcp-configarea-mcpMCP relatedenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions