Skip to content

SubprocessCLITransport stderr task group leaks cancel scope on query() completion — same bug #454 / #776, fix #746 incomplete #810

@davidcyze

Description

@davidcyze

Summary

SubprocessCLITransport in _internal/transport/subprocess_cli.py uses the exact anyio.create_task_group() + manual __aenter__/__aexit__ anti-pattern that was fixed in query.py by PR #746 for the stderr reader. The fix was not applied to the sibling file. Every normal completion of async for message in query(...) now triggers a cross-task cancel scope exit on async-generator finalization, which cancels the caller's task from an unrelated context and pins anyio's _deliver_cancellation in a 100% CPU loop until the process restarts.

This is the same failure mode as #454 and #776, just in a different file.

Affected version

claude-agent-sdk==0.1.58 (current latest). Also present on main as of 2026-04-12 — I pulled src/claude_agent_sdk/_internal/transport/subprocess_cli.py directly from the main branch and the offending lines are still there.

The offending code

subprocess_cli.py lines 395–399 in connect():

if should_pipe_stderr and self._process.stderr:
    self._stderr_stream = TextReceiveStream(self._process.stderr)
    self._stderr_task_group = anyio.create_task_group()
    await self._stderr_task_group.__aenter__()
    self._stderr_task_group.start_soon(self._handle_stderr)

And subprocess_cli.py lines 458–462 in close():

if self._stderr_task_group:
    with suppress(Exception):
        self._stderr_task_group.cancel_scope.cancel()
        await self._stderr_task_group.__aexit__(None, None, None)
    self._stderr_task_group = None

Anyio cancel scopes have task affinity — they must be exited by the same async task that entered them. connect() enters the scope in whatever task is consuming the query() generator. close() is called from the generator's finally clause, which on normal completion runs in an asyncio-created finalizer task (via sys.set_asyncgen_hooksasync_generator_athrow), not the original consumer. The cross-task cancel_scope.cancel() + __aexit__(None, None, None) then cancels the scope's host task — the original consumer — from an unrelated task context.

Observed failure mode

async for message in query(...) runs to completion. Immediately after the async for loop exits, asyncio's asyncgen finalizer schedules a task (Task-<N>) to run async_generator_athrow(GeneratorExit). That task drives process_query's finally, which calls transport.close(), which hits the broken cleanup path.

With diagnostic instrumentation added at the except asyncio.CancelledError points in my application code, I captured this traceback on two independent completions (different scope IDs, different caller Task IDs, identical structure):

Cancelled via cancel scope 77a038de8cb0 by <Task pending
  name='Task-12426' coro=<<async_generator_athrow without __name__>()>>
Cancelled via cancel scope 77a038034320 by <Task pending
  name='Task-39833' coro=<<async_generator_athrow without __name__>()>>

Once the cancellation is delivered from the non-host task, the scope's state is inconsistent: it has a pending cancel but the host task never exited the scope. Anyio's _deliver_cancellation reschedules itself via call_soon on every event loop tick, pinning one CPU core at 100% for the remaining lifetime of the process. One stuck scope per completed agent — they accumulate.

This fires on every normal completion of the documented async for message in query(...) usage pattern. It is deterministic and reproducible.

Proposed fix

Apply the same fix PR #746 already applied to query.py: replace the anyio task group with a plain asyncio.create_task(), since asyncio tasks have no cancel-scope affinity and can be safely cancelled from any task context.

The diff is mechanical:

  • Add self._stderr_task: asyncio.Task | None = None to __init__
  • In connect(), replace the task-group-create + __aenter__ + start_soon block with self._stderr_task = asyncio.create_task(self._handle_stderr(), name="claude-sdk-stderr-reader")
  • In close(), replace the task-group cleanup block with:
    if self._stderr_task is not None and not self._stderr_task.done():
        self._stderr_task.cancel()
        with suppress(asyncio.CancelledError, Exception):
            await self._stderr_task
        self._stderr_task = None
  • Remove the _stderr_task_group attribute entirely

_handle_stderr() itself needs no changes — it already handles ClosedResourceError and generic Exception cleanly.

Reference implementation

I have a working subclass that applies this exact fix as a downstream workaround: it subclasses SubprocessCLITransport, delegates to super().connect() for all the subprocess setup, then immediately tears down the anyio task group in the same task frame that entered it (so the __aexit__ is legal), and replaces it with asyncio.create_task(self._handle_stderr()). Happy to port this into a PR against main if it would help.

After deploying the subclass, I verified with 4+ consecutive agent completions on the same process:

Before fix After fix
Every completion → cancel scope leak → CPU at 100%+ until process restart Every completion → clean event pipeline, CPU stays at <15%

The diagnostic traceback from above disappears entirely.

Why the incomplete fix likely slipped through

PR #746's description explains that query.py used a TaskGroup with manual __aenter__/__aexit__ and hit the cross-task affinity issue. The PR fixed that file but the same pattern exists in subprocess_cli.py as a separate, smaller task group for stderr reading — it was apparently not surfaced by the test case in #746 (which tested cross-task close of query, not of the subprocess transport). The stderr task group has the same affinity semantics and the same trigger.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions