Summary
Two tests in tests/test_transcript_mirror.py covering session_store_flush="eager" fail reliably on Python 3.11.14 / macOS arm64 against current main (HEAD 9aafd84). The implementation is correct; the tests are timing-fragile — they assume 2 await asyncio.sleep(0) yields are enough for the second eager flush to reach store.append, but the actual path needs ~4 yields when there's lock contention between consecutive drains.
PR #905 (May 3) merged with all CI green; the failures appear locally despite no intervening commits to transcript_mirror_batcher.py. Suggests CI got lucky on event-loop scheduling at merge time.
Affected tests
tests/test_transcript_mirror.py::TestBuildMirrorBatcherFlushMode::test_eager_mode_flushes_per_frame (unit-level)
tests/test_transcript_mirror.py::TestReceiveLoopFramePeeling::test_eager_flush_mode_appends_per_frame_before_result (integration-level)
Both fail with the same symptom: assert appends_at_assistant == 2 (or len(store.append_calls) == 2) sees only 1 append.
Reproducer
5/5 consecutive failures on my machine:
$ for i in 1 2 3 4 5; do uv run pytest tests/test_transcript_mirror.py::TestBuildMirrorBatcherFlushMode::test_eager_mode_flushes_per_frame -q 2>&1 | tail -2 | head -1; done
FAILED tests/test_transcript_mirror.py::TestBuildMirrorBatcherFlushMode::test_eager_mode_flushes_per_frame
FAILED tests/test_transcript_mirror.py::TestBuildMirrorBatcherFlushMode::test_eager_mode_flushes_per_frame
FAILED tests/test_transcript_mirror.py::TestBuildMirrorBatcherFlushMode::test_eager_mode_flushes_per_frame
FAILED tests/test_transcript_mirror.py::TestBuildMirrorBatcherFlushMode::test_eager_mode_flushes_per_frame
FAILED tests/test_transcript_mirror.py::TestBuildMirrorBatcherFlushMode::test_eager_mode_flushes_per_frame
Sweeping the yield count with a standalone probe (same _RecordingStore and build_mirror_batcher setup) shows the boundary clearly:
yields= 2: after_first=1, after_second=1 <-- FAILS (test asserts 2)
yields= 4: after_first=1, after_second=2
yields= 6: after_first=1, after_second=2
yields= 10: after_first=1, after_second=2
yields=100: after_first=1, after_second=2
The implementation works — it just needs more event-loop turns than the tests provide.
Root cause
The first enqueue → drain happens cleanly within 2 yields because there's no contention. The second drain has to traverse:
- First drain's
await asyncio.wait_for(store.append, ...) returns (1 yield, since wait_for wraps in an inner Task)
- First drain exits
_do_flush, releases the async with self._lock (1 yield)
- Second drain acquires the lock (1 yield)
- Second drain's
wait_for(store.append) schedules its inner task (1 yield)
_RecordingStore.append records synchronously into append_calls
That's ~4 yields. Tests allot 2. On the unit test, the second drain task gets cancelled at event-loop teardown before reaching store.append — visible in the asyncio.exceptions.CancelledError traceback the failing test prints from add_done_callback(lambda t: t.exception()) at src/claude_agent_sdk/_internal/transcript_mirror_batcher.py:91.
Severity
Test-suite reliability / contributor experience. Production behavior is unaffected — eager flush works correctly given enough event-loop time. But running the full suite locally fails out of the box, which makes contribution awkward.
Suggested fixes (in order of preference)
- Replace fixed-yield-count with deterministic wait. Loop
await asyncio.sleep(0) with a deadline until the expected condition holds:
async def _wait_until(predicate, timeout=1.0):
deadline = time.monotonic() + timeout
while not predicate():
if time.monotonic() > deadline:
raise AssertionError("timed out waiting")
await asyncio.sleep(0)
- Expose a test-only
wait_quiescent() on the batcher that awaits _flush_task if set, and use it between enqueues in the tests.
- Make the test await
batcher.flush() after each enqueue. Defeats the original intent of "verify automatic eager flush triggers", so least preferred.
Happy to send a PR for option (1) if it's the preferred direction.
Environment
- claude-agent-sdk-python @
9aafd84 (current main)
- Python 3.11.14 (uv-managed venv)
- pytest 9.0.3, pytest-asyncio 1.3.0, anyio 4.13.0
- macOS 25.3.0 (Darwin arm64)
Summary
Two tests in
tests/test_transcript_mirror.pycoveringsession_store_flush="eager"fail reliably on Python 3.11.14 / macOS arm64 against currentmain(HEAD9aafd84). The implementation is correct; the tests are timing-fragile — they assume 2await asyncio.sleep(0)yields are enough for the second eager flush to reachstore.append, but the actual path needs ~4 yields when there's lock contention between consecutive drains.PR #905 (May 3) merged with all CI green; the failures appear locally despite no intervening commits to
transcript_mirror_batcher.py. Suggests CI got lucky on event-loop scheduling at merge time.Affected tests
tests/test_transcript_mirror.py::TestBuildMirrorBatcherFlushMode::test_eager_mode_flushes_per_frame(unit-level)tests/test_transcript_mirror.py::TestReceiveLoopFramePeeling::test_eager_flush_mode_appends_per_frame_before_result(integration-level)Both fail with the same symptom:
assert appends_at_assistant == 2(orlen(store.append_calls) == 2) sees only 1 append.Reproducer
5/5 consecutive failures on my machine:
Sweeping the yield count with a standalone probe (same
_RecordingStoreandbuild_mirror_batchersetup) shows the boundary clearly:The implementation works — it just needs more event-loop turns than the tests provide.
Root cause
The first enqueue → drain happens cleanly within 2 yields because there's no contention. The second drain has to traverse:
await asyncio.wait_for(store.append, ...)returns (1 yield, sincewait_forwraps in an inner Task)_do_flush, releases theasync with self._lock(1 yield)wait_for(store.append)schedules its inner task (1 yield)_RecordingStore.appendrecords synchronously intoappend_callsThat's ~4 yields. Tests allot 2. On the unit test, the second drain task gets cancelled at event-loop teardown before reaching
store.append— visible in theasyncio.exceptions.CancelledErrortraceback the failing test prints fromadd_done_callback(lambda t: t.exception())atsrc/claude_agent_sdk/_internal/transcript_mirror_batcher.py:91.Severity
Test-suite reliability / contributor experience. Production behavior is unaffected — eager flush works correctly given enough event-loop time. But running the full suite locally fails out of the box, which makes contribution awkward.
Suggested fixes (in order of preference)
await asyncio.sleep(0)with a deadline until the expected condition holds:wait_quiescent()on the batcher that awaits_flush_taskif set, and use it between enqueues in the tests.batcher.flush()after each enqueue. Defeats the original intent of "verify automatic eager flush triggers", so least preferred.Happy to send a PR for option (1) if it's the preferred direction.
Environment
9aafd84(currentmain)