Skip to content

Flaky tests: eager-flush transcript-mirror tests fail with 2 sleep(0) yields (test_transcript_mirror.py) #928

@seeincodes

Description

@seeincodes

Summary

Two tests in tests/test_transcript_mirror.py covering session_store_flush="eager" fail reliably on Python 3.11.14 / macOS arm64 against current main (HEAD 9aafd84). The implementation is correct; the tests are timing-fragile — they assume 2 await asyncio.sleep(0) yields are enough for the second eager flush to reach store.append, but the actual path needs ~4 yields when there's lock contention between consecutive drains.

PR #905 (May 3) merged with all CI green; the failures appear locally despite no intervening commits to transcript_mirror_batcher.py. Suggests CI got lucky on event-loop scheduling at merge time.

Affected tests

  • tests/test_transcript_mirror.py::TestBuildMirrorBatcherFlushMode::test_eager_mode_flushes_per_frame (unit-level)
  • tests/test_transcript_mirror.py::TestReceiveLoopFramePeeling::test_eager_flush_mode_appends_per_frame_before_result (integration-level)

Both fail with the same symptom: assert appends_at_assistant == 2 (or len(store.append_calls) == 2) sees only 1 append.

Reproducer

5/5 consecutive failures on my machine:

$ for i in 1 2 3 4 5; do uv run pytest tests/test_transcript_mirror.py::TestBuildMirrorBatcherFlushMode::test_eager_mode_flushes_per_frame -q 2>&1 | tail -2 | head -1; done
FAILED tests/test_transcript_mirror.py::TestBuildMirrorBatcherFlushMode::test_eager_mode_flushes_per_frame
FAILED tests/test_transcript_mirror.py::TestBuildMirrorBatcherFlushMode::test_eager_mode_flushes_per_frame
FAILED tests/test_transcript_mirror.py::TestBuildMirrorBatcherFlushMode::test_eager_mode_flushes_per_frame
FAILED tests/test_transcript_mirror.py::TestBuildMirrorBatcherFlushMode::test_eager_mode_flushes_per_frame
FAILED tests/test_transcript_mirror.py::TestBuildMirrorBatcherFlushMode::test_eager_mode_flushes_per_frame

Sweeping the yield count with a standalone probe (same _RecordingStore and build_mirror_batcher setup) shows the boundary clearly:

yields=  2: after_first=1, after_second=1   <-- FAILS (test asserts 2)
yields=  4: after_first=1, after_second=2
yields=  6: after_first=1, after_second=2
yields= 10: after_first=1, after_second=2
yields=100: after_first=1, after_second=2

The implementation works — it just needs more event-loop turns than the tests provide.

Root cause

The first enqueue → drain happens cleanly within 2 yields because there's no contention. The second drain has to traverse:

  1. First drain's await asyncio.wait_for(store.append, ...) returns (1 yield, since wait_for wraps in an inner Task)
  2. First drain exits _do_flush, releases the async with self._lock (1 yield)
  3. Second drain acquires the lock (1 yield)
  4. Second drain's wait_for(store.append) schedules its inner task (1 yield)
  5. _RecordingStore.append records synchronously into append_calls

That's ~4 yields. Tests allot 2. On the unit test, the second drain task gets cancelled at event-loop teardown before reaching store.append — visible in the asyncio.exceptions.CancelledError traceback the failing test prints from add_done_callback(lambda t: t.exception()) at src/claude_agent_sdk/_internal/transcript_mirror_batcher.py:91.

Severity

Test-suite reliability / contributor experience. Production behavior is unaffected — eager flush works correctly given enough event-loop time. But running the full suite locally fails out of the box, which makes contribution awkward.

Suggested fixes (in order of preference)

  1. Replace fixed-yield-count with deterministic wait. Loop await asyncio.sleep(0) with a deadline until the expected condition holds:
    async def _wait_until(predicate, timeout=1.0):
        deadline = time.monotonic() + timeout
        while not predicate():
            if time.monotonic() > deadline:
                raise AssertionError("timed out waiting")
            await asyncio.sleep(0)
  2. Expose a test-only wait_quiescent() on the batcher that awaits _flush_task if set, and use it between enqueues in the tests.
  3. Make the test await batcher.flush() after each enqueue. Defeats the original intent of "verify automatic eager flush triggers", so least preferred.

Happy to send a PR for option (1) if it's the preferred direction.

Environment

  • claude-agent-sdk-python @ 9aafd84 (current main)
  • Python 3.11.14 (uv-managed venv)
  • pytest 9.0.3, pytest-asyncio 1.3.0, anyio 4.13.0
  • macOS 25.3.0 (Darwin arm64)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions