fix: return MCP connection errors to LLM instead of raising#1531
fix: return MCP connection errors to LLM instead of raising#1531jsonmp-k8 wants to merge 4 commits intokagent-dev:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Wraps MCP tools to gracefully surface persistent connection failures to the LLM as normal tool output (instead of raising), preventing tight retry loops and high CPU usage in static agent runs (Fixes #1530).
Changes:
- Add
ConnectionSafeMcpToolthat catches connection-related exceptions and returns an error payload instructing the LLM not to retry. - Update
KAgentMcpToolset.get_tools()to wrap returnedMcpToolinstances withConnectionSafeMcpTool. - Add unit tests covering connection vs non-connection error behavior (including
CancelledErrorpropagation).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
python/packages/kagent-adk/src/kagent/adk/_mcp_toolset.py |
Introduces ConnectionSafeMcpTool and wraps MCP tools returned by KAgentMcpToolset to avoid raising persistent connection failures. |
python/packages/kagent-adk/tests/unittests/test_mcp_connection_error_handling.py |
Adds pytest coverage ensuring connection errors are returned as error text while other exceptions still propagate. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
python/packages/kagent-adk/tests/unittests/test_mcp_connection_error_handling.py
Outdated
Show resolved
Hide resolved
41c3c17 to
aa46632
Compare
9fa4fe9 to
ae91dd0
Compare
|
@EItanya can you please review this ? |
5e717fc to
f4a74cd
Compare
EItanya
left a comment
There was a problem hiding this comment.
Thanks for the thorough root cause analysis and fix, @jsonmp-k8! The approach of catching connection errors at the tool level to prevent the retry spin loop makes sense. A few observations:
Concerns
1. McpError may be too broad
McpError is the general MCP protocol error class, not just connection/transport errors. Catching all McpError could silently swallow legitimate protocol-level errors (e.g., invalid tool arguments, server-side validation failures) that the LLM should actually see as actionable errors to correct its behavior. Could this be narrowed — e.g., by inspecting the error code or message to distinguish transport-level failures from protocol-level ones?
2. The __new__ + __dict__ copy pattern is fragile
The wrapping creates ConnectionSafeMcpTool instances by bypassing __init__ and copying __dict__. This works today since McpTool uses plain instance attributes, but it's tightly coupled to the internal implementation of McpTool (from google-adk). If upstream adds __slots__, properties, or post-init hooks, this will break silently. A composition approach (delegating to the inner tool rather than re-typing the instance) would be more robust against upstream changes.
3. Add exc_info to the error log
The logger.error(error_message) call loses the traceback, which makes debugging intermittent network issues harder for operators. Consider:
logger.error(error_message, exc_info=error)4. Minor style: Dict import unnecessary
The file already has from __future__ import annotations, so dict[str, Any] works without importing Dict from typing.
Comment left by Claude on behalf of @EItanya
Address review feedback on kagent-dev#1531: - Only catch transport-level McpErrors (timeouts, stream drops) via keyword inspection; protocol-level McpErrors (invalid args, validation) now propagate so the LLM can correct its behavior - Replace fragile __new__ + __dict__ copy with composition pattern: store inner McpTool and delegate via __getattr__ - Add exc_info to logger.error() for operator-visible tracebacks - Remove unused Dict import, use lowercase dict[str, Any] - Simplify test setup: mock inner tool directly, remove patch boilerplate - Add test_protocol_mcp_error_still_raises to verify narrowing Signed-off-by: Jaison Paul <paul.jaison@gmail.com>
When an MCP HTTP tool call fails with a persistent connection error (e.g. "connection reset by peer"), the error propagates to the LLM as a function error. The LLM interprets this as transient and retries the same tool call, creating a tight loop that burns 100% CPU for up to max_llm_calls (500) iterations. Wrap McpTool instances with ConnectionSafeMcpTool that catches connection errors (ConnectionError, TimeoutError, httpx.TransportError, McpError) and returns them as error text. This lets the LLM inform the user about the failure instead of retrying indefinitely. Fixes kagent-dev#1530 Signed-off-by: Jaison Paul <paul.jaison@gmail.com>
Address review feedback on kagent-dev#1531: - Only catch transport-level McpErrors (timeouts, stream drops) via keyword inspection; protocol-level McpErrors (invalid args, validation) now propagate so the LLM can correct its behavior - Replace fragile __new__ + __dict__ copy with composition pattern: store inner McpTool and delegate via __getattr__ - Add exc_info to logger.error() for operator-visible tracebacks - Remove unused Dict import, use lowercase dict[str, Any] - Simplify test setup: mock inner tool directly, remove patch boilerplate - Add test_protocol_mcp_error_still_raises to verify narrowing Signed-off-by: Jaison Paul <paul.jaison@gmail.com>
996ceb3 to
48764d8
Compare
|
Thanks for the detailed review, @EItanya! All four points addressed in 996ceb3: 1. McpError narrowingRemoved 2. Composition over
|
Signed-off-by: Jaison Paul <paul.jaison@gmail.com>
Summary
McpToolinstances withConnectionSafeMcpToolthat catches persistent connection errors and returns them as error text to the LLMConnectionError(stdlib),TimeoutError(stdlib),httpx.TransportError(httpx network/timeout/protocol errors), andMcpError(MCP session stream drops and read timeouts)KAgentMcpToolset.get_tools()automatically wraps allMcpToolinstancesRoot cause
When an MCP HTTP tool call fails with "connection reset by peer", the error propagates up to the ADK flow handler, which sends it back to the LLM as a function error. The LLM interprets this as a transient failure and retries the same tool call — creating a tight loop of LLM call → tool call → connection error → LLM call for up to
max_llm_calls(500) iterations, burning 100% CPU.The MCP client wraps transport-level errors into
McpErrorviamcp.shared.session.send_request()before they reach the tool, so catching only stdlib/httpx errors is insufficient —McpErrormust also be handled.Testing
python -m pytest python/packages/kagent-adk/tests/unittests/test_mcp_connection_error_handling.py -v(10 tests)python -m pytest python/packages/kagent-adk/tests/unittests/ -v(170 passed)Test coverage:
ConnectionResetError,ConnectionRefusedError,TimeoutError— caught, returned as error dicthttpx.ConnectError,httpx.ReadError,httpx.ConnectTimeout— caught viahttpx.TransportErrorMcpError(session read timeout) — caught, returned as error dictValueError,CancelledError— still raised (not connection errors)KAgentMcpToolset.get_tools()wrapsMcpTool→ConnectionSafeMcpToolFixes #1530