Skip to content

fix: return MCP connection errors to LLM instead of raising#1531

Open
jsonmp-k8 wants to merge 4 commits intokagent-dev:mainfrom
jsonmp-k8:fix/1530-mcp-tool-call-cpu-spin
Open

fix: return MCP connection errors to LLM instead of raising#1531
jsonmp-k8 wants to merge 4 commits intokagent-dev:mainfrom
jsonmp-k8:fix/1530-mcp-tool-call-cpu-spin

Conversation

@jsonmp-k8
Copy link
Copy Markdown
Contributor

@jsonmp-k8 jsonmp-k8 commented Mar 20, 2026

Summary

  • Wrap McpTool instances with ConnectionSafeMcpTool that catches persistent connection errors and returns them as error text to the LLM
  • Catches ConnectionError (stdlib), TimeoutError (stdlib), httpx.TransportError (httpx network/timeout/protocol errors), and McpError (MCP session stream drops and read timeouts)
  • The error message includes the tool name, error type, and instructs the LLM not to retry
  • KAgentMcpToolset.get_tools() automatically wraps all McpTool instances

Root cause

When an MCP HTTP tool call fails with "connection reset by peer", the error propagates up to the ADK flow handler, which sends it back to the LLM as a function error. The LLM interprets this as a transient failure and retries the same tool call — creating a tight loop of LLM call → tool call → connection error → LLM call for up to max_llm_calls (500) iterations, burning 100% CPU.

The MCP client wraps transport-level errors into McpError via mcp.shared.session.send_request() before they reach the tool, so catching only stdlib/httpx errors is insufficient — McpError must also be handled.

Testing

  • python -m pytest python/packages/kagent-adk/tests/unittests/test_mcp_connection_error_handling.py -v (10 tests)
  • python -m pytest python/packages/kagent-adk/tests/unittests/ -v (170 passed)

Test coverage:

  • ConnectionResetError, ConnectionRefusedError, TimeoutError — caught, returned as error dict
  • httpx.ConnectError, httpx.ReadError, httpx.ConnectTimeout — caught via httpx.TransportError
  • McpError (session read timeout) — caught, returned as error dict
  • ValueError, CancelledError — still raised (not connection errors)
  • KAgentMcpToolset.get_tools() wraps McpToolConnectionSafeMcpTool

Fixes #1530

Copilot AI review requested due to automatic review settings March 20, 2026 20:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Wraps MCP tools to gracefully surface persistent connection failures to the LLM as normal tool output (instead of raising), preventing tight retry loops and high CPU usage in static agent runs (Fixes #1530).

Changes:

  • Add ConnectionSafeMcpTool that catches connection-related exceptions and returns an error payload instructing the LLM not to retry.
  • Update KAgentMcpToolset.get_tools() to wrap returned McpTool instances with ConnectionSafeMcpTool.
  • Add unit tests covering connection vs non-connection error behavior (including CancelledError propagation).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
python/packages/kagent-adk/src/kagent/adk/_mcp_toolset.py Introduces ConnectionSafeMcpTool and wraps MCP tools returned by KAgentMcpToolset to avoid raising persistent connection failures.
python/packages/kagent-adk/tests/unittests/test_mcp_connection_error_handling.py Adds pytest coverage ensuring connection errors are returned as error text while other exceptions still propagate.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jsonmp-k8 jsonmp-k8 force-pushed the fix/1530-mcp-tool-call-cpu-spin branch 3 times, most recently from 41c3c17 to aa46632 Compare March 20, 2026 21:11
jmhbh
jmhbh previously approved these changes Mar 23, 2026
@jsonmp-k8 jsonmp-k8 force-pushed the fix/1530-mcp-tool-call-cpu-spin branch 2 times, most recently from 9fa4fe9 to ae91dd0 Compare March 24, 2026 21:37
@jsonmp-k8
Copy link
Copy Markdown
Contributor Author

@EItanya can you please review this ?

@jsonmp-k8 jsonmp-k8 force-pushed the fix/1530-mcp-tool-call-cpu-spin branch 3 times, most recently from 5e717fc to f4a74cd Compare April 3, 2026 06:08
Copy link
Copy Markdown
Contributor

@EItanya EItanya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough root cause analysis and fix, @jsonmp-k8! The approach of catching connection errors at the tool level to prevent the retry spin loop makes sense. A few observations:

Concerns

1. McpError may be too broad

McpError is the general MCP protocol error class, not just connection/transport errors. Catching all McpError could silently swallow legitimate protocol-level errors (e.g., invalid tool arguments, server-side validation failures) that the LLM should actually see as actionable errors to correct its behavior. Could this be narrowed — e.g., by inspecting the error code or message to distinguish transport-level failures from protocol-level ones?

2. The __new__ + __dict__ copy pattern is fragile

The wrapping creates ConnectionSafeMcpTool instances by bypassing __init__ and copying __dict__. This works today since McpTool uses plain instance attributes, but it's tightly coupled to the internal implementation of McpTool (from google-adk). If upstream adds __slots__, properties, or post-init hooks, this will break silently. A composition approach (delegating to the inner tool rather than re-typing the instance) would be more robust against upstream changes.

3. Add exc_info to the error log

The logger.error(error_message) call loses the traceback, which makes debugging intermittent network issues harder for operators. Consider:

logger.error(error_message, exc_info=error)

4. Minor style: Dict import unnecessary

The file already has from __future__ import annotations, so dict[str, Any] works without importing Dict from typing.


Comment left by Claude on behalf of @EItanya

jsonmp-k8 added a commit to jsonmp-k8/kagent that referenced this pull request Apr 7, 2026
Address review feedback on kagent-dev#1531:
- Only catch transport-level McpErrors (timeouts, stream drops) via
  keyword inspection; protocol-level McpErrors (invalid args, validation)
  now propagate so the LLM can correct its behavior
- Replace fragile __new__ + __dict__ copy with composition pattern:
  store inner McpTool and delegate via __getattr__
- Add exc_info to logger.error() for operator-visible tracebacks
- Remove unused Dict import, use lowercase dict[str, Any]
- Simplify test setup: mock inner tool directly, remove patch boilerplate
- Add test_protocol_mcp_error_still_raises to verify narrowing

Signed-off-by: Jaison Paul <paul.jaison@gmail.com>
When an MCP HTTP tool call fails with a persistent connection error
(e.g. "connection reset by peer"), the error propagates to the LLM as
a function error. The LLM interprets this as transient and retries the
same tool call, creating a tight loop that burns 100% CPU for up to
max_llm_calls (500) iterations.

Wrap McpTool instances with ConnectionSafeMcpTool that catches
connection errors (ConnectionError, TimeoutError, httpx.TransportError,
McpError) and returns them as error text. This lets the LLM inform the
user about the failure instead of retrying indefinitely.

Fixes kagent-dev#1530

Signed-off-by: Jaison Paul <paul.jaison@gmail.com>
Address review feedback on kagent-dev#1531:
- Only catch transport-level McpErrors (timeouts, stream drops) via
  keyword inspection; protocol-level McpErrors (invalid args, validation)
  now propagate so the LLM can correct its behavior
- Replace fragile __new__ + __dict__ copy with composition pattern:
  store inner McpTool and delegate via __getattr__
- Add exc_info to logger.error() for operator-visible tracebacks
- Remove unused Dict import, use lowercase dict[str, Any]
- Simplify test setup: mock inner tool directly, remove patch boilerplate
- Add test_protocol_mcp_error_still_raises to verify narrowing

Signed-off-by: Jaison Paul <paul.jaison@gmail.com>
@jsonmp-k8 jsonmp-k8 force-pushed the fix/1530-mcp-tool-call-cpu-spin branch from 996ceb3 to 48764d8 Compare April 7, 2026 04:15
@jsonmp-k8
Copy link
Copy Markdown
Contributor Author

Thanks for the detailed review, @EItanya! All four points addressed in 996ceb3:

1. McpError narrowing

Removed McpError from _CONNECTION_ERROR_TYPES and now catch it separately. Added _is_transport_mcp_error() that checks the error message for transport-related keywords (timeout, connection, reset, closed, stream, etc.). Protocol-level McpErrors (e.g., invalid tool arguments) are re-raised so the LLM can correct itself. Added test_protocol_mcp_error_still_raises to verify.

2. Composition over __new__ + __dict__

ConnectionSafeMcpTool now stores the inner tool and delegates via __getattr__. Wrapping is just ConnectionSafeMcpTool(tool). This is robust against upstream __slots__, properties, or post-init changes.

3. exc_info on error log

Done — logger.error(error_message, exc_info=error).

4. Dict import

Removed — using lowercase dict[str, Any] with from __future__ import annotations.

jsonmp-k8 and others added 2 commits April 7, 2026 00:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] kagent-adk static agent spins at 100% CPU on persistent HTTP MCP tool call failures

4 participants