Summary
claude_agent_sdk._internal.query.Query.close() waits on the CLI subprocess (transport._process) and calls terminate() / kill() on it, but does not enumerate or signal the CLI's children. MCP servers spawned by the CLI reparent to PID 1 after close and accumulate across repeated query() / ClaudeSDKClient usage. Over a long-running parent process (multi-hour pipeline, daemon, test suite) this leaks one MCP server per configured server per dispatch.
Environment
- SDK:
claude-agent-sdk==0.1.68 (also reproducible on 0.1.63, 0.1.58 — behavior unchanged).
- Python: 3.12.x.
- Platform: macOS (Darwin 25.x) and Linux (Ubuntu 22.04) both affected.
Expected
After await query_obj.close() (or exiting async with ClaudeSDKClient(...)), no descendant processes of the CLI remain running.
Actual
MCP server subprocesses are reparented to PID 1 (or the launcher's init) and persist until the parent Python process exits. On a host that spawns thousands of queries over hours, this manifests as OOM, file-descriptor exhaustion, and (for MCP servers that poll) CPU burn.
Evidence in the SDK source (0.1.68)
_internal/transport/subprocess_cli.py:512-564 (SubprocessCLITransport.close, abbreviated):
async def close(self) -> None:
...
# Wait for graceful shutdown after stdin EOF, then terminate if needed.
if self._process.returncode is None:
try:
with anyio.fail_after(5):
await self._process.wait()
except TimeoutError:
# Graceful shutdown timed out — force terminate
with suppress(ProcessLookupError):
self._process.terminate()
try:
with anyio.fail_after(5):
await self._process.wait()
except TimeoutError:
# SIGTERM handler blocked — force kill (SIGKILL)
with suppress(ProcessLookupError):
self._process.kill()
with suppress(Exception):
await self._process.wait()
...
Only self._process (the CLI) is waited-on and signalled. There is no pgrep -P <cli_pid> / psutil.Process.children(recursive=True) walk, and the subprocess was not spawned in a new process group (no start_new_session=True / preexec_fn=os.setsid) that would let a single signal reach descendants via killpg.
_internal/query.py:807-828 (Query.close) calls through to self.transport.close() without adding any child-process cleanup of its own.
Minimal repro
# bug_A_repro.py
import asyncio
import os
import subprocess
import sys
from claude_agent_sdk import ClaudeAgentOptions, query
MCP_SERVERS = {
"echo": {
"command": sys.executable,
"args": ["-c", "import time; time.sleep(3600)"],
},
}
async def main():
options = ClaudeAgentOptions(mcp_servers=MCP_SERVERS, allowed_tools=[])
async for _ in query(prompt="hello", options=options):
break # close early to accelerate the leak
# query() context manager has exited; Query.close ran.
# Check for survivors.
me = os.getpid()
out = subprocess.run(
["pgrep", "-P", "1", "-f", "time.sleep(3600)"],
capture_output=True,
text=True,
)
survivors = [pid for pid in out.stdout.split() if pid.isdigit()]
print(f"PID 1-reparented MCP servers: {survivors}")
assert not survivors, "MCP server subprocess leaked after Query.close"
if __name__ == "__main__":
asyncio.run(main())
Run: python bug_A_repro.py. Output shows one or more PIDs instead of an empty list.
Workaround
Monkey-patch Query.close to capture the CLI PID and its children before calling transport.close(), then os.kill any that survive:
async def _patched_query_close(self):
cli_pid = None
children = []
transport = getattr(self, "transport", None)
if transport:
proc = getattr(transport, "_process", None)
if proc and getattr(proc, "pid", None):
cli_pid = proc.pid
children = await _get_child_pids(cli_pid)
# ... proceed with upstream close logic ...
await self.transport.close()
if cli_pid or children:
await _force_kill_pids(cli_pid, children)
where _get_child_pids shells out to pgrep -P <pid> and _force_kill_pids sends SIGKILL with a short wait.
Impact
A long-running Python process that does dozens of query() calls, each with multiple MCP servers configured, leaks one child per configured server per dispatch. Over hours the host OOMs.
Suggested fix
In SubprocessCLITransport.close(), after the CLI process is waited-on, enumerate the CLI's descendants and best-effort-kill them. psutil is probably too heavy as a required dep; os.killpg() with a process group set at CLI spawn time (preexec_fn=os.setsid or start_new_session=True) would let close send a single SIGKILL to the whole group.
Happy to send a PR if that direction is welcome.
Summary
claude_agent_sdk._internal.query.Query.close()waits on the CLI subprocess (transport._process) and callsterminate()/kill()on it, but does not enumerate or signal the CLI's children. MCP servers spawned by the CLI reparent to PID 1 after close and accumulate across repeatedquery()/ClaudeSDKClientusage. Over a long-running parent process (multi-hour pipeline, daemon, test suite) this leaks one MCP server per configured server per dispatch.Environment
claude-agent-sdk==0.1.68(also reproducible on 0.1.63, 0.1.58 — behavior unchanged).Expected
After
await query_obj.close()(or exitingasync with ClaudeSDKClient(...)), no descendant processes of the CLI remain running.Actual
MCP server subprocesses are reparented to PID 1 (or the launcher's init) and persist until the parent Python process exits. On a host that spawns thousands of queries over hours, this manifests as OOM, file-descriptor exhaustion, and (for MCP servers that poll) CPU burn.
Evidence in the SDK source (0.1.68)
_internal/transport/subprocess_cli.py:512-564(SubprocessCLITransport.close, abbreviated):Only
self._process(the CLI) is waited-on and signalled. There is nopgrep -P <cli_pid>/psutil.Process.children(recursive=True)walk, and the subprocess was not spawned in a new process group (nostart_new_session=True/preexec_fn=os.setsid) that would let a single signal reach descendants viakillpg._internal/query.py:807-828(Query.close) calls through toself.transport.close()without adding any child-process cleanup of its own.Minimal repro
Run:
python bug_A_repro.py. Output shows one or more PIDs instead of an empty list.Workaround
Monkey-patch
Query.closeto capture the CLI PID and its children before callingtransport.close(), thenos.killany that survive:where
_get_child_pidsshells out topgrep -P <pid>and_force_kill_pidssends SIGKILL with a short wait.Impact
A long-running Python process that does dozens of
query()calls, each with multiple MCP servers configured, leaks one child per configured server per dispatch. Over hours the host OOMs.Suggested fix
In
SubprocessCLITransport.close(), after the CLI process is waited-on, enumerate the CLI's descendants and best-effort-kill them.psutilis probably too heavy as a required dep;os.killpg()with a process group set at CLI spawn time (preexec_fn=os.setsidorstart_new_session=True) would let close send a singleSIGKILLto the whole group.Happy to send a PR if that direction is welcome.