Query.close() does not reap MCP subprocess grandchildren

## Summary

`claude_agent_sdk._internal.query.Query.close()` waits on the CLI subprocess (`transport._process`) and calls `terminate()` / `kill()` on it, but does not enumerate or signal the CLI's children. MCP servers spawned by the CLI reparent to PID 1 after close and accumulate across repeated `query()` / `ClaudeSDKClient` usage. Over a long-running parent process (multi-hour pipeline, daemon, test suite) this leaks one MCP server per configured server per dispatch.

## Environment

- SDK: `claude-agent-sdk==0.1.68` (also reproducible on 0.1.63, 0.1.58 — behavior unchanged).
- Python: 3.12.x.
- Platform: macOS (Darwin 25.x) and Linux (Ubuntu 22.04) both affected.

## Expected

After `await query_obj.close()` (or exiting `async with ClaudeSDKClient(...)`), no descendant processes of the CLI remain running.

## Actual

MCP server subprocesses are reparented to PID 1 (or the launcher's init) and persist until the parent Python process exits. On a host that spawns thousands of queries over hours, this manifests as OOM, file-descriptor exhaustion, and (for MCP servers that poll) CPU burn.

## Evidence in the SDK source (0.1.68)

`_internal/transport/subprocess_cli.py:512-564` (`SubprocessCLITransport.close`, abbreviated):

```python
async def close(self) -> None:
    ...
    # Wait for graceful shutdown after stdin EOF, then terminate if needed.
    if self._process.returncode is None:
        try:
            with anyio.fail_after(5):
                await self._process.wait()
        except TimeoutError:
            # Graceful shutdown timed out — force terminate
            with suppress(ProcessLookupError):
                self._process.terminate()
            try:
                with anyio.fail_after(5):
                    await self._process.wait()
            except TimeoutError:
                # SIGTERM handler blocked — force kill (SIGKILL)
                with suppress(ProcessLookupError):
                    self._process.kill()
                with suppress(Exception):
                    await self._process.wait()
    ...
```

Only `self._process` (the CLI) is waited-on and signalled. There is no `pgrep -P <cli_pid>` / `psutil.Process.children(recursive=True)` walk, and the subprocess was not spawned in a new process group (no `start_new_session=True` / `preexec_fn=os.setsid`) that would let a single signal reach descendants via `killpg`.

`_internal/query.py:807-828` (`Query.close`) calls through to `self.transport.close()` without adding any child-process cleanup of its own.

## Minimal repro

```python
# bug_A_repro.py
import asyncio
import os
import subprocess
import sys

from claude_agent_sdk import ClaudeAgentOptions, query

MCP_SERVERS = {
    "echo": {
        "command": sys.executable,
        "args": ["-c", "import time; time.sleep(3600)"],
    },
}


async def main():
    options = ClaudeAgentOptions(mcp_servers=MCP_SERVERS, allowed_tools=[])
    async for _ in query(prompt="hello", options=options):
        break  # close early to accelerate the leak

    # query() context manager has exited; Query.close ran.
    # Check for survivors.
    me = os.getpid()
    out = subprocess.run(
        ["pgrep", "-P", "1", "-f", "time.sleep(3600)"],
        capture_output=True,
        text=True,
    )
    survivors = [pid for pid in out.stdout.split() if pid.isdigit()]
    print(f"PID 1-reparented MCP servers: {survivors}")
    assert not survivors, "MCP server subprocess leaked after Query.close"


if __name__ == "__main__":
    asyncio.run(main())
```

Run: `python bug_A_repro.py`. Output shows one or more PIDs instead of an empty list.

## Workaround

Monkey-patch `Query.close` to capture the CLI PID and its children before calling `transport.close()`, then `os.kill` any that survive:

```python
async def _patched_query_close(self):
    cli_pid = None
    children = []
    transport = getattr(self, "transport", None)
    if transport:
        proc = getattr(transport, "_process", None)
        if proc and getattr(proc, "pid", None):
            cli_pid = proc.pid
            children = await _get_child_pids(cli_pid)
    # ... proceed with upstream close logic ...
    await self.transport.close()
    if cli_pid or children:
        await _force_kill_pids(cli_pid, children)
```

where `_get_child_pids` shells out to `pgrep -P <pid>` and `_force_kill_pids` sends SIGKILL with a short wait.

## Impact

A long-running Python process that does dozens of `query()` calls, each with multiple MCP servers configured, leaks one child per configured server per dispatch. Over hours the host OOMs.

## Suggested fix

In `SubprocessCLITransport.close()`, after the CLI process is waited-on, enumerate the CLI's descendants and best-effort-kill them. `psutil` is probably too heavy as a required dep; `os.killpg()` with a process group set at CLI spawn time (`preexec_fn=os.setsid` or `start_new_session=True`) would let close send a single `SIGKILL` to the whole group.

Happy to send a PR if that direction is welcome.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query.close() does not reap MCP subprocess grandchildren #889

Summary

Environment

Expected

Actual

Evidence in the SDK source (0.1.68)

Minimal repro

Workaround

Impact

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Query.close() does not reap MCP subprocess grandchildren #889

Description

Summary

Environment

Expected

Actual

Evidence in the SDK source (0.1.68)

Minimal repro

Workaround

Impact

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions