Adds Grok TTS integration by amosgyamfi · Pull Request #433 · GetStream/Vision-Agents

amosgyamfi · 2026-03-20T12:26:03Z

Grok TTS plugin support

Summary by CodeRabbit

New Features
- Added text-to-speech (TTS) integration with multiple voice options, codec support, and speech markup capabilities.
- Added customer support voice agent example implementation.
- Added environment configuration template for xAI integration.
Bug Fixes
- Improved error handling and exception specificity in realtime connections.
- Enhanced server event processing and cancellation handling.
- Simplified and fixed tool call extraction logic.
Documentation
- Added TTS usage guide with configuration options and speech tag documentation.
- Updated example documentation with new TTS example instructions.

coderabbitai · 2026-03-20T12:26:12Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

A new Text-to-Speech (TTS) provider implementation for the xAI plugin is added, featuring the XAITTS class with support for multiple voices, codecs, and audio formats. The xAI Realtime agent is updated to use 24kHz audio, improved event handling, and VAD interrupt response configuration. Documentation, examples, and comprehensive tests accompany the new feature alongside minor refinements to LLM tool-call extraction and agent audio event handling.

Changes

Cohort / File(s)	Summary
TTS Implementation `plugins/xai/vision_agents/plugins/xai/tts.py`	New `XAITTS` class providing async TTS streaming via xAI REST API with retry logic (up to 3 attempts on 429/500/503), exponential backoff, and audio decoding support for PCM, WAV, MP3, G.711 mu-law/A-law codecs; includes cancellation and cleanup methods.
Realtime Agent Updates `plugins/xai/vision_agents/plugins/xai/xai_realtime.py`	Updated default sample rate from 48kHz to 24kHz; added `vad_interrupt_response` configuration; expanded session payload with model, modalities, and input transcription fields; improved exception handling with narrowed catch types; enhanced event processing with cancellation handling, special-case logging, and `plugin_name` attribution; refined tool-call filtering for server-executed functions.
TTS Example & Documentation `plugins/xai/example/xai_tts_customer_support_example.py`, `plugins/xai/example/README.md`, `plugins/xai/README.md`	Added TTS example script demonstrating a customer-support voice agent with Deepgram STT and xAI LLM; updated README sections documenting TTS class, voice options, codec/sample-rate configuration, speech tags, and MP3 requirements.
Configuration & Dependencies `plugins/xai/.env.example`, `plugins/xai/pyproject.toml`	Added environment variable template for xAI/Stream API keys and Deepgram STT; introduced optional `mp3` dependency group for `pydub>=0.25`.
Tests `plugins/xai/tests/test_xai_tts.py`, `plugins/xai/tests/test_xai_realtime.py`	Added comprehensive TTS test suite covering constructor propagation, payload building, audio decoding (PCM, WAV, G.711), voice descriptions, and optional API integration tests; updated Realtime tests to expect 24kHz sample rate and assert `vad_interrupt_response` configuration.
Module Exports `plugins/xai/vision_agents/plugins/xai/__init__.py`	Extended `__all__` to export `TTS` (alias for `XAITTS`), `Voice`, and `VOICE_DESCRIPTIONS` from the new `.tts` module.
LLM Simplification `plugins/xai/vision_agents/plugins/xai/llm.py`	Streamlined `_extract_tool_calls_from_response` by removing defensive attribute access and fallback logic; assumes `response.tool_calls` is iterable and extracts fields directly.
Core Agent Fix `agents-core/vision_agents/core/agents/agents.py`	Updated `on_audio_done` event handler to use named parameter and condition flush behavior on `event.interrupted` flag instead of always flushing when audio track exists.

Sequence Diagram

sequenceDiagram
    participant Client as Client
    participant TTS as XAITTS
    participant API as xAI TTS API
    participant Decoder as Audio Decoder
    
    Client->>TTS: stream_audio(text, **kwargs)
    TTS->>TTS: Prepare request payload<br/>(voice, codec, sample_rate, etc.)
    
    loop Retry Logic (up to 3 attempts)
        TTS->>API: POST /tts/generate<br/>(with exponential backoff on 429/500/503)
        API-->>TTS: Audio bytes (PCM/WAV/MP3/G.711)
    end
    
    TTS->>Decoder: _decode_audio(bytes, codec)
    
    alt codec == "pcm"
        Decoder->>Decoder: Pass through raw PCM
    else codec == "mulaw" or "alaw"
        Decoder->>Decoder: Numpy-based G.711 decoding
    else codec == "wav"
        Decoder->>Decoder: wave module unpacking
    else codec == "mp3"
        Decoder->>Decoder: pydub MP3 decoding
    end
    
    Decoder-->>TTS: PcmData
    TTS-->>Client: PcmData | Iterator | AsyncIterator

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

The bell jar fills with Grok's new voice,
Five tongues emerge from codecs' choice—
Retry loops beat like a darkening heart,
While 24kHz audio tears the silence apart,
And in the xAI garden, five voices converge. 🎤

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 56.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title 'Adds Grok TTS integration' directly and concisely describes the main change—adding Grok text-to-speech functionality to the xAI plugin, which is evident across multiple new TTS files, documentation, examples, and related updates.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch GrokTTS_integration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

llm.py: drop the `getattr` chain in `_extract_tool_calls_from_response` in favor of direct attribute access — `Response.tool_calls` always returns a list and the `ToolCall` proto fields (id, function, name, arguments) are guaranteed-present. Removes the dead `call_id` fallback (no such field on the proto) and narrows the bare `except Exception` to `json.JSONDecodeError`. xai_realtime.py: - Refresh the stale "as of xai-sdk 1.5.0" docstrings; verified xai-sdk 1.11 still ships no realtime/voice/websocket wrapper, so the raw `websockets` implementation remains correct. - Bump cosmetic `DEFAULT_MODEL` from "grok-3-fast" to "grok-4" (per the existing docstring this value is informational and not sent to the API). - Hoist `aiohttp` import to the module top. - Narrow each `except Exception` to specific tuples — `OSError`/ `WebSocketException`/`TimeoutError` for connect, `ConnectionError`/ `WebSocketException` for send paths, and the processing loop now swallows only transient transport/decode errors so programming bugs surface instead of being silently logged. - Pass `plugin_name="xai"` on the `LLMResponseChunkEvent` emitted from `_handle_response_done`, matching every other event in the file.

Session config now mirrors the livekit xAI plugin's known-working shape: - Send model name ("grok-4-1-fast-non-reasoning") in session.update - Include input_audio_transcription for server-side transcription - Expand turn_detection from bare {"type":"server_vad"} to full ServerVad config with threshold, padding, duration, and interrupt_response=False (prevents mic echo from cancelling the agent's own response mid-sentence) - Fix DEFAULT_SAMPLE_RATE from 48000 to 24000 — xAI's realtime model emits PCM at 24 kHz; tagging frames as 48 kHz caused 2x playback speed and premature buffer drain - Hoist aiohttp import to module level Diagnostics: - Explicitly handle response.cancelled / response.cancel events with a WARNING log so server-initiated interrupts are visible - Bump unhandled event types from DEBUG to INFO for runtime visibility - Handle rate_limits.updated at DEBUG

on_audio_done was calling _audio_track.flush() unconditionally on every RealtimeAudioOutputDoneEvent. flush() discards the buffer immediately ("Playback stops immediately"), which truncates audio when the server finishes sending faster than real-time playback drains. Now flush() is only called when event.interrupted is True (barge-in). On normal completion the buffer drains naturally through playback. This only affects realtime plugins that deliver audio via WebSocket events through the _audio_track buffer (currently xAI). OpenAI and Gemini use WebRTC where audio bypasses this buffer path entirely.

Adds Grok TTS integration

c1022f4

amosgyamfi assigned dangusev and d3xvn Mar 20, 2026

github-actions Bot added dependencies plugins config docs project-info labels Mar 20, 2026

Merge branch 'main' into GrokTTS_integration

f843766

Nash0x7E2 self-requested a review April 15, 2026 21:51

Nash0x7E2 added 7 commits April 15, 2026 16:04

Migrate grok_tts to XAI package

afa3f14

Use faster model with streaming

965fa9a

Fix duplicate messages and server side tool calling

f6058a8

Fix ruff and mypy issues

bc3b91d

github-actions Bot added the agents-core label Apr 16, 2026

Nash0x7E2 marked this pull request as ready for review April 16, 2026 16:52

Nash0x7E2 merged commit d887697 into main Apr 16, 2026
5 of 6 checks passed

Nash0x7E2 deleted the GrokTTS_integration branch April 16, 2026 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds Grok TTS integration#433

Adds Grok TTS integration#433
Nash0x7E2 merged 9 commits intomainfrom
GrokTTS_integration

amosgyamfi commented Mar 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 20, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

amosgyamfi commented Mar 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

amosgyamfi commented Mar 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 20, 2026 •

edited

Loading