fix: exit MCP anyio contexts in the lifecycle task on disable#9132
fix: exit MCP anyio contexts in the lifecycle task on disable#9132kldsjfas wants to merge 3 commits into
Conversation
AstrBotDevs#9070 moved MCP shutdown cleanup into the per-server lifecycle task so that the anyio cancel scopes entered in connect_to_server() are exited from the task that entered them. Running the cleanup through asyncio.shield() defeats that: shield wraps the coroutine in a new task, so disabling a server still fails with "Attempted to exit cancel scope in a different task than it was entered in" and leaves the scope state corrupted -- the failure mode behind the CPU spin reported in AstrBotDevs#9068. Await _terminate_mcp_client() directly instead. The graceful disable path has no cancellation in flight, and on forced shutdown the lifecycle task still runs its finally block in-task, so the shield only served to break task affinity. Closes AstrBotDevs#9068
There was a problem hiding this comment.
Hey - I've found 1 issue
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location path="astrbot/core/provider/func_tool_manager.py" line_range="727-729" />
<code_context>
- # Cleanup in the same task that entered the anyio contexts
- await asyncio.shield(self._terminate_mcp_client(name))
+ # Cleanup in the same task that entered the anyio contexts.
+ # No asyncio.shield() here: it would run the cleanup in a new
+ # task, and anyio cancel scopes cannot exit across tasks (#9068).
+ await self._terminate_mcp_client(name)
lifecycle_task = asyncio.create_task(
</code_context>
<issue_to_address>
**issue (bug_risk):** Revisit removal of `asyncio.shield` and the explanation about it running in a new task
The updated comment misdescribes `asyncio.shield`: it doesn’t create a new task, it only prevents the current task’s awaitable from being cancelled. With the change, `await self._terminate_mcp_client(name)` can now be cancelled directly, which risks leaving cleanup incomplete if the lifecycle task is cancelled mid-termination.
If we still want to avoid crossing anyio cancel scopes while ensuring cleanup completes, consider adding shielding/cancellation handling inside `_terminate_mcp_client` instead. If partial cleanup on cancellation is acceptable, then the comment should be revised so it doesn’t imply that `asyncio.shield` runs the cleanup in a separate task.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| # No asyncio.shield() here: it would run the cleanup in a new | ||
| # task, and anyio cancel scopes cannot exit across tasks (#9068). | ||
| await self._terminate_mcp_client(name) |
There was a problem hiding this comment.
issue (bug_risk): Revisit removal of asyncio.shield and the explanation about it running in a new task
The updated comment misdescribes asyncio.shield: it doesn’t create a new task, it only prevents the current task’s awaitable from being cancelled. With the change, await self._terminate_mcp_client(name) can now be cancelled directly, which risks leaving cleanup incomplete if the lifecycle task is cancelled mid-termination.
If we still want to avoid crossing anyio cancel scopes while ensuring cleanup completes, consider adding shielding/cancellation handling inside _terminate_mcp_client instead. If partial cleanup on cancellation is acceptable, then the comment should be revised so it doesn’t imply that asyncio.shield runs the cleanup in a separate task.
There was a problem hiding this comment.
Code Review
This pull request removes asyncio.shield() from the MCP client termination call in the finally block of connect_and_lifecycle to ensure cleanup runs in the same task, addressing an issue where AnyIO cancel scopes cannot exit across tasks. It also adds a corresponding unit test. The review feedback highlights a potential issue where task cancellation during shutdown could prevent the cleanup from completing because the first await in _terminate_mcp_client would raise CancelledError. A suggestion is provided to use task.uncancel() to safely run the cleanup before re-cancelling.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| finally: | ||
| # Cleanup in the same task that entered the anyio contexts | ||
| await asyncio.shield(self._terminate_mcp_client(name)) | ||
| # Cleanup in the same task that entered the anyio contexts. | ||
| # No asyncio.shield() here: it would run the cleanup in a new | ||
| # task, and anyio cancel scopes cannot exit across tasks (#9068). | ||
| await self._terminate_mcp_client(name) |
There was a problem hiding this comment.
If the lifecycle task is cancelled (e.g., due to a timeout during startup or shutdown), the finally block is entered with a pending cancellation. Without shielding or uncancelling, the first await inside _terminate_mcp_client (such as async with self._runtime_lock:) will immediately raise asyncio.CancelledError. This prevents the cleanup from completing, leaving the server in a half-broken state (not removed from self._mcp_server_runtime and its tools not removed from self.func_list).
Since asyncio.shield() cannot be used here (as it runs the cleanup in a new task, violating AnyIO's task-affinity requirement), we can use asyncio.current_task().uncancel() (available in Python 3.11+) to temporarily clear the cancellation state during the cleanup, and then re-cancel the task afterwards.
finally:
# Cleanup in the same task that entered the anyio contexts.
# No asyncio.shield() here: it would run the cleanup in a new
# task, and anyio cancel scopes cannot exit across tasks (#9068).
task = asyncio.current_task()
is_cancelled = False
if hasattr(task, "cancelling") and task.cancelling() > 0:
is_cancelled = True
task.uncancel()
try:
await self._terminate_mcp_client(name)
finally:
if is_cancelled:
task.cancel()A cancellation delivered while the lifecycle task is already running its finally-block cleanup would abort _terminate_mcp_client() halfway, stranding the runtime entry and the MCP transport. Retry the cleanup in the same task and uncancel() the absorbed request instead, so a forced shutdown can no longer skip it.
|
Addressed the cancellation concern in 419c448: if a cancellation lands while the finally-block cleanup is running, the cleanup is now retried in the same task with On the |
|
@gemini-code-assist /gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the MCP client cleanup logic in func_tool_manager.py to run within the same task that initiated the connection, avoiding issues with asyncio.shield and anyio cancel scopes. It also introduces a loop to absorb late cancellations and uses task.uncancel() to ensure cleanup completes. Corresponding unit tests have been added. Feedback highlights a compatibility issue where task.uncancel() is not supported in Python 3.10, potentially causing an AttributeError, and suggests guarding the call with hasattr.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if task is not None: | ||
| task.uncancel() |
There was a problem hiding this comment.
Since the project supports Python 3.10 (as indicated by the python:3.10-slim Dockerfile and documentation), calling task.uncancel() directly will raise an AttributeError because uncancel() was introduced in Python 3.11.
To ensure compatibility with Python 3.10, we should guard the call by checking if the uncancel method exists on the task object.
| if task is not None: | |
| task.uncancel() | |
| if task is not None and hasattr(task, 'uncancel'): | |
| task.uncancel() |
|
On the Python 3.10 compatibility concern: the project's supported floor is Python 3.12 — |
Hi, many users are still using 3.10.x, it would be better if the change can compatible to 3.10 |
Task.uncancel() only exists on Python 3.11+. On 3.10 absorbing the CancelledError is sufficient, so skip the call when unavailable.
|
added the hasattr guard in 86094f2, so on 3.10 uncancel() is just skipped. should be fine there — catching the CancelledError is already enough for the retry, uncancel() is only 3.11+ bookkeeping. rest of the change doesn't use any 3.11+ API. |
#9068 is still reproducible on current master: disabling an MCP server exits the anyio cancel scopes from a task other than the one that entered them, which corrupts their state — the failure mode behind the reported 100% CPU spin.
#9070 introduced the per-server lifecycle task precisely so that cleanup happens in the task that entered the anyio contexts, but running the cleanup through
asyncio.shield()defeats that:shield()wraps the coroutine in a new task, soexit_stack.aclose()still runs outside the owning task and fails withAttempted to exit cancel scope in a different task than it was entered in(visible at debug level) on every plain enable → disable, no cancellation involved.Modifications / 改动点
astrbot/core/provider/func_tool_manager.py: await_terminate_mcp_client()directly in the lifecycle task'sfinallyinstead of wrapping it inasyncio.shield(). The graceful disable path has no cancellation in flight, and on forced shutdown the lifecycle task still runs itsfinallyin-task, so the shield only served to break task affinity.tests/unit/test_func_tool_manager.py: regression test asserting that shutdown cleanup runs in the same task that connected (fails on master, passes with this fix).Related: a reconnect from a tool call can still create an exit stack owned by a different task (#8056, addressed by the larger rework in #8307). This PR deliberately keeps to the disable path and does not touch
mcp_client.py, so it does not conflict with #8307.Screenshots or Test Results / 运行截图或测试结果
Enable → disable of a real stdio MCP server (FastMCP echo server), instrumented to record which task runs the cleanup:
On master (20008f1):
With this fix:
uv run pytest tests/unit: 666 passed; the 6 failures on this Windows machine (test_computer, test_message_tools, test_astr_main_agent) are identical on clean master.ruff format --checkandruff checkpass on the changed files.Checklist / 检查清单
😊 If there are new features added in the PR, I have discussed it with the authors through issues/emails, etc.
/ 如果 PR 中有新加入的功能,已经通过 Issue / 邮件等方式和作者讨论过。
👀 My changes have been well-tested, and "Verification Steps" and "Screenshots" have been provided above.
/ 我的更改经过了良好的测试,并已在上方提供了“验证步骤”和“运行截图”。
🤓 I have ensured that no new dependencies are introduced, OR if new dependencies are introduced, they have been added to the appropriate locations in
requirements.txtandpyproject.toml./ 我确保没有引入新依赖库,或者引入了新依赖库的同时将其添加到
requirements.txt和pyproject.toml文件相应位置。😮 My changes do not introduce malicious code.
/ 我的更改没有引入恶意代码。
Summary by Sourcery
Ensure MCP server disable cleanup runs in the same lifecycle task that established the connection to avoid anyio cancel scope corruption and CPU spin.
Bug Fixes:
Tests: