Skip to content

fix(anam): make AnamAvatarPublisher.start() idempotent#550

Draft
aliev wants to merge 1 commit into
mainfrom
fix/anam-idempotent-start
Draft

fix(anam): make AnamAvatarPublisher.start() idempotent#550
aliev wants to merge 1 commit into
mainfrom
fix/anam-idempotent-start

Conversation

@aliev
Copy link
Copy Markdown
Member

@aliev aliev commented May 13, 2026

Why

Agent._apply(\"start\") runs subsystem starts sequentially. AnamAvatarPublisher.start() blocks for ~3-5 seconds (Anam REST session creation + WebRTC handshake + first frame), so the avatar appears ~5-7 seconds after the user dials in — and that wait sits in the critical path of agent.join().

Callers want to kick avatar.start() off as a background task before agent.join() so the Anam connection runs in parallel with SFU join. The framework then calls start() again from _apply — that second call needs to be a no-op.

The current _connect() is not safe to call twice. It clears _connected / _session_ready events in finally, so a second call deadlocks waiting on a cleared event. Concurrent callers also race into the exit-stack enter_async_context, double-opening the session.

Changes

  • Wrap start() in an asyncio.Lock and a _started flag. First successful call runs _connect(); subsequent calls short-circuit. Concurrent callers serialize on the lock.
  • Tests: idempotent re-call, concurrent calls fire _connect once, failed connect leaves _started=False so retries still work.

_connect() itself is unchanged — the lock and flag live one level up in start(), which is the only public entrypoint.

`Agent._apply("start")` calls subsystem starts sequentially. Anam's
`start()` blocks for ~3-5 seconds (REST session creation + WebRTC
handshake + first video frame), which sits in the critical path of
`agent.join()` and makes the avatar appear ~5-7 seconds after the
user dials in.

Callers can move that latency out of the join path by kicking off
`avatar.start()` as a background task before `agent.join()` — but only
if `start()` is safe to call twice (once eagerly, once from the
framework's `_apply`).

Guard `start()` with an `asyncio.Lock` and a `_started` flag: the
first call runs `_connect()`, concurrent callers wait on the lock,
and a successful start short-circuits subsequent calls. A failed
connect leaves `_started=False` so retries still work.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

AnamAvatarPublisher.start() is made idempotent and concurrent-safe by adding _start_lock (AsyncLock) and _started (bool) state variables. The start() method now acquires the lock, returns immediately if already started, and calls _connect() exactly once before setting _started to True. Three async test cases validate that repeated calls are deduplicated, concurrent callers are properly serialized, and failed attempts permit subsequent retries.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 373c5ffd-daca-4e03-b6c0-2a4031cce4e5

📥 Commits

Reviewing files that changed from the base of the PR and between 33dbd3c and b0e4cd5.

📒 Files selected for processing (2)
  • plugins/anam/tests/test_anam_plugin.py
  • plugins/anam/vision_agents/plugins/anam/anam_avatar_publisher.py

Comment on lines +95 to +106
async def fake_connect():
nonlocal call_count
call_count += 1

pub._connect = fake_connect # type: ignore[method-assign]

await pub.start()
await pub.start()
await pub.start()

assert call_count == 1
assert pub._started is True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Replace call-path spying with behavior/state assertions.

These tests validate _connect() invocation count by overriding a private method and asserting call_count, which is call-path testing and effectively mocking. Rework them to assert externally observable behavior/state transitions of start() without spying on private method invocations.

As per coding guidelines: "Never mock in tests" and "ALWAYS test behavior, not calling a path. Assert on outputs and state, not method calls."

Also applies to: 119-133, 142-157

nonlocal call_count
call_count += 1

pub._connect = fake_connect # type: ignore[method-assign]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove # type: ignore[method-assign] from tests.

Please avoid # type: ignore here; use a typed test double approach (fixture/class) so the assignment is type-safe without suppressions.

As per coding guidelines: "Avoid # type: ignore comments."

Also applies to: 124-124, 148-148

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant