SubprocessCLITransport stderr task group leaks cancel scope on query() completion — same bug #454 / #776, fix #746 incomplete

## Summary

`SubprocessCLITransport` in `_internal/transport/subprocess_cli.py` uses the exact `anyio.create_task_group()` + manual `__aenter__`/`__aexit__` anti-pattern that was fixed in `query.py` by [PR #746](https://github.com/anthropics/claude-agent-sdk-python/pull/746) for the stderr reader. The fix was not applied to the sibling file. Every normal completion of `async for message in query(...)` now triggers a cross-task cancel scope exit on async-generator finalization, which cancels the caller's task from an unrelated context and pins anyio's `_deliver_cancellation` in a 100% CPU loop until the process restarts.

This is the same failure mode as [#454](https://github.com/anthropics/claude-agent-sdk-python/issues/454) and [#776](https://github.com/anthropics/claude-agent-sdk-python/issues/776), just in a different file.

## Affected version

`claude-agent-sdk==0.1.58` (current latest). Also present on `main` as of 2026-04-12 — I pulled `src/claude_agent_sdk/_internal/transport/subprocess_cli.py` directly from the `main` branch and the offending lines are still there.

## The offending code

[`subprocess_cli.py` lines 395–399](https://github.com/anthropics/claude-agent-sdk-python/blob/main/src/claude_agent_sdk/_internal/transport/subprocess_cli.py#L395-L399) in `connect()`:

```python
if should_pipe_stderr and self._process.stderr:
    self._stderr_stream = TextReceiveStream(self._process.stderr)
    self._stderr_task_group = anyio.create_task_group()
    await self._stderr_task_group.__aenter__()
    self._stderr_task_group.start_soon(self._handle_stderr)
```

And [`subprocess_cli.py` lines 458–462](https://github.com/anthropics/claude-agent-sdk-python/blob/main/src/claude_agent_sdk/_internal/transport/subprocess_cli.py#L458-L462) in `close()`:

```python
if self._stderr_task_group:
    with suppress(Exception):
        self._stderr_task_group.cancel_scope.cancel()
        await self._stderr_task_group.__aexit__(None, None, None)
    self._stderr_task_group = None
```

Anyio cancel scopes have *task affinity* — they must be exited by the same async task that entered them. `connect()` enters the scope in whatever task is consuming the `query()` generator. `close()` is called from the generator's `finally` clause, which on normal completion runs in an asyncio-created finalizer task (via `sys.set_asyncgen_hooks` → `async_generator_athrow`), not the original consumer. The cross-task `cancel_scope.cancel()` + `__aexit__(None, None, None)` then cancels the scope's host task — the original consumer — from an unrelated task context.

## Observed failure mode

`async for message in query(...)` runs to completion. Immediately after the `async for` loop exits, asyncio's asyncgen finalizer schedules a task (`Task-<N>`) to run `async_generator_athrow(GeneratorExit)`. That task drives `process_query`'s `finally`, which calls `transport.close()`, which hits the broken cleanup path.

With diagnostic instrumentation added at the `except asyncio.CancelledError` points in my application code, I captured this traceback on two independent completions (different scope IDs, different caller Task IDs, identical structure):

```
Cancelled via cancel scope 77a038de8cb0 by <Task pending
  name='Task-12426' coro=<<async_generator_athrow without __name__>()>>
```

```
Cancelled via cancel scope 77a038034320 by <Task pending
  name='Task-39833' coro=<<async_generator_athrow without __name__>()>>
```

Once the cancellation is delivered from the non-host task, the scope's state is inconsistent: it has a pending cancel but the host task never exited the scope. Anyio's `_deliver_cancellation` reschedules itself via `call_soon` on every event loop tick, pinning one CPU core at 100% for the remaining lifetime of the process. One stuck scope per completed agent — they accumulate.

This fires on **every** normal completion of the documented `async for message in query(...)` usage pattern. It is deterministic and reproducible.

## Proposed fix

Apply the same fix [PR #746](https://github.com/anthropics/claude-agent-sdk-python/pull/746) already applied to `query.py`: replace the anyio task group with a plain `asyncio.create_task()`, since asyncio tasks have no cancel-scope affinity and can be safely cancelled from any task context.

The diff is mechanical:

- Add `self._stderr_task: asyncio.Task | None = None` to `__init__`
- In `connect()`, replace the task-group-create + `__aenter__` + `start_soon` block with `self._stderr_task = asyncio.create_task(self._handle_stderr(), name="claude-sdk-stderr-reader")`
- In `close()`, replace the task-group cleanup block with:
  ```python
  if self._stderr_task is not None and not self._stderr_task.done():
      self._stderr_task.cancel()
      with suppress(asyncio.CancelledError, Exception):
          await self._stderr_task
      self._stderr_task = None
  ```
- Remove the `_stderr_task_group` attribute entirely

`_handle_stderr()` itself needs no changes — it already handles `ClosedResourceError` and generic `Exception` cleanly.

## Reference implementation

I have a working subclass that applies this exact fix as a downstream workaround: it subclasses `SubprocessCLITransport`, delegates to `super().connect()` for all the subprocess setup, then immediately tears down the anyio task group *in the same task frame that entered it* (so the `__aexit__` is legal), and replaces it with `asyncio.create_task(self._handle_stderr())`. Happy to port this into a PR against `main` if it would help.

After deploying the subclass, I verified with 4+ consecutive agent completions on the same process:

| Before fix | After fix |
|---|---|
| Every completion → cancel scope leak → CPU at 100%+ until process restart | Every completion → clean event pipeline, CPU stays at <15% |

The diagnostic traceback from above disappears entirely.

## Why the incomplete fix likely slipped through

[PR #746](https://github.com/anthropics/claude-agent-sdk-python/pull/746)'s description explains that `query.py` used a `TaskGroup` with manual `__aenter__`/`__aexit__` and hit the cross-task affinity issue. The PR fixed that file but the same pattern exists in `subprocess_cli.py` as a separate, smaller task group for stderr reading — it was apparently not surfaced by the test case in #746 (which tested cross-task close of `query`, not of the subprocess transport). The stderr task group has the same affinity semantics and the same trigger.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SubprocessCLITransport stderr task group leaks cancel scope on query() completion — same bug #454 / #776, fix #746 incomplete #810

Summary

Affected version

The offending code

Observed failure mode

Proposed fix

Reference implementation

Why the incomplete fix likely slipped through

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

SubprocessCLITransport stderr task group leaks cancel scope on query() completion — same bug #454 / #776, fix #746 incomplete #810

Description

Summary

Affected version

The offending code

Observed failure mode

Proposed fix

Reference implementation

Why the incomplete fix likely slipped through

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions