Skip to content

Query.close() does not reap MCP subprocess grandchildren #889

@cisco-noor

Description

@cisco-noor

Summary

claude_agent_sdk._internal.query.Query.close() waits on the CLI subprocess (transport._process) and calls terminate() / kill() on it, but does not enumerate or signal the CLI's children. MCP servers spawned by the CLI reparent to PID 1 after close and accumulate across repeated query() / ClaudeSDKClient usage. Over a long-running parent process (multi-hour pipeline, daemon, test suite) this leaks one MCP server per configured server per dispatch.

Environment

  • SDK: claude-agent-sdk==0.1.68 (also reproducible on 0.1.63, 0.1.58 — behavior unchanged).
  • Python: 3.12.x.
  • Platform: macOS (Darwin 25.x) and Linux (Ubuntu 22.04) both affected.

Expected

After await query_obj.close() (or exiting async with ClaudeSDKClient(...)), no descendant processes of the CLI remain running.

Actual

MCP server subprocesses are reparented to PID 1 (or the launcher's init) and persist until the parent Python process exits. On a host that spawns thousands of queries over hours, this manifests as OOM, file-descriptor exhaustion, and (for MCP servers that poll) CPU burn.

Evidence in the SDK source (0.1.68)

_internal/transport/subprocess_cli.py:512-564 (SubprocessCLITransport.close, abbreviated):

async def close(self) -> None:
    ...
    # Wait for graceful shutdown after stdin EOF, then terminate if needed.
    if self._process.returncode is None:
        try:
            with anyio.fail_after(5):
                await self._process.wait()
        except TimeoutError:
            # Graceful shutdown timed out — force terminate
            with suppress(ProcessLookupError):
                self._process.terminate()
            try:
                with anyio.fail_after(5):
                    await self._process.wait()
            except TimeoutError:
                # SIGTERM handler blocked — force kill (SIGKILL)
                with suppress(ProcessLookupError):
                    self._process.kill()
                with suppress(Exception):
                    await self._process.wait()
    ...

Only self._process (the CLI) is waited-on and signalled. There is no pgrep -P <cli_pid> / psutil.Process.children(recursive=True) walk, and the subprocess was not spawned in a new process group (no start_new_session=True / preexec_fn=os.setsid) that would let a single signal reach descendants via killpg.

_internal/query.py:807-828 (Query.close) calls through to self.transport.close() without adding any child-process cleanup of its own.

Minimal repro

# bug_A_repro.py
import asyncio
import os
import subprocess
import sys

from claude_agent_sdk import ClaudeAgentOptions, query

MCP_SERVERS = {
    "echo": {
        "command": sys.executable,
        "args": ["-c", "import time; time.sleep(3600)"],
    },
}


async def main():
    options = ClaudeAgentOptions(mcp_servers=MCP_SERVERS, allowed_tools=[])
    async for _ in query(prompt="hello", options=options):
        break  # close early to accelerate the leak

    # query() context manager has exited; Query.close ran.
    # Check for survivors.
    me = os.getpid()
    out = subprocess.run(
        ["pgrep", "-P", "1", "-f", "time.sleep(3600)"],
        capture_output=True,
        text=True,
    )
    survivors = [pid for pid in out.stdout.split() if pid.isdigit()]
    print(f"PID 1-reparented MCP servers: {survivors}")
    assert not survivors, "MCP server subprocess leaked after Query.close"


if __name__ == "__main__":
    asyncio.run(main())

Run: python bug_A_repro.py. Output shows one or more PIDs instead of an empty list.

Workaround

Monkey-patch Query.close to capture the CLI PID and its children before calling transport.close(), then os.kill any that survive:

async def _patched_query_close(self):
    cli_pid = None
    children = []
    transport = getattr(self, "transport", None)
    if transport:
        proc = getattr(transport, "_process", None)
        if proc and getattr(proc, "pid", None):
            cli_pid = proc.pid
            children = await _get_child_pids(cli_pid)
    # ... proceed with upstream close logic ...
    await self.transport.close()
    if cli_pid or children:
        await _force_kill_pids(cli_pid, children)

where _get_child_pids shells out to pgrep -P <pid> and _force_kill_pids sends SIGKILL with a short wait.

Impact

A long-running Python process that does dozens of query() calls, each with multiple MCP servers configured, leaks one child per configured server per dispatch. Over hours the host OOMs.

Suggested fix

In SubprocessCLITransport.close(), after the CLI process is waited-on, enumerate the CLI's descendants and best-effort-kill them. psutil is probably too heavy as a required dep; os.killpg() with a process group set at CLI spawn time (preexec_fn=os.setsid or start_new_session=True) would let close send a single SIGKILL to the whole group.

Happy to send a PR if that direction is welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions