Skip to content

Adds Grok TTS integration#433

Merged
Nash0x7E2 merged 9 commits intomainfrom
GrokTTS_integration
Apr 16, 2026
Merged

Adds Grok TTS integration#433
Nash0x7E2 merged 9 commits intomainfrom
GrokTTS_integration

Conversation

@amosgyamfi
Copy link
Copy Markdown
Member

@amosgyamfi amosgyamfi commented Mar 20, 2026

Grok TTS plugin support

Summary by CodeRabbit

  • New Features

    • Added text-to-speech (TTS) integration with multiple voice options, codec support, and speech markup capabilities.
    • Added customer support voice agent example implementation.
    • Added environment configuration template for xAI integration.
  • Bug Fixes

    • Improved error handling and exception specificity in realtime connections.
    • Enhanced server event processing and cancellation handling.
    • Simplified and fixed tool call extraction logic.
  • Documentation

    • Added TTS usage guide with configuration options and speech tag documentation.
    • Updated example documentation with new TTS example instructions.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 20, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

A new Text-to-Speech (TTS) provider implementation for the xAI plugin is added, featuring the XAITTS class with support for multiple voices, codecs, and audio formats. The xAI Realtime agent is updated to use 24kHz audio, improved event handling, and VAD interrupt response configuration. Documentation, examples, and comprehensive tests accompany the new feature alongside minor refinements to LLM tool-call extraction and agent audio event handling.

Changes

Cohort / File(s) Summary
TTS Implementation
plugins/xai/vision_agents/plugins/xai/tts.py
New XAITTS class providing async TTS streaming via xAI REST API with retry logic (up to 3 attempts on 429/500/503), exponential backoff, and audio decoding support for PCM, WAV, MP3, G.711 mu-law/A-law codecs; includes cancellation and cleanup methods.
Realtime Agent Updates
plugins/xai/vision_agents/plugins/xai/xai_realtime.py
Updated default sample rate from 48kHz to 24kHz; added vad_interrupt_response configuration; expanded session payload with model, modalities, and input transcription fields; improved exception handling with narrowed catch types; enhanced event processing with cancellation handling, special-case logging, and plugin_name attribution; refined tool-call filtering for server-executed functions.
TTS Example & Documentation
plugins/xai/example/xai_tts_customer_support_example.py, plugins/xai/example/README.md, plugins/xai/README.md
Added TTS example script demonstrating a customer-support voice agent with Deepgram STT and xAI LLM; updated README sections documenting TTS class, voice options, codec/sample-rate configuration, speech tags, and MP3 requirements.
Configuration & Dependencies
plugins/xai/.env.example, plugins/xai/pyproject.toml
Added environment variable template for xAI/Stream API keys and Deepgram STT; introduced optional mp3 dependency group for pydub>=0.25.
Tests
plugins/xai/tests/test_xai_tts.py, plugins/xai/tests/test_xai_realtime.py
Added comprehensive TTS test suite covering constructor propagation, payload building, audio decoding (PCM, WAV, G.711), voice descriptions, and optional API integration tests; updated Realtime tests to expect 24kHz sample rate and assert vad_interrupt_response configuration.
Module Exports
plugins/xai/vision_agents/plugins/xai/__init__.py
Extended __all__ to export TTS (alias for XAITTS), Voice, and VOICE_DESCRIPTIONS from the new .tts module.
LLM Simplification
plugins/xai/vision_agents/plugins/xai/llm.py
Streamlined _extract_tool_calls_from_response by removing defensive attribute access and fallback logic; assumes response.tool_calls is iterable and extracts fields directly.
Core Agent Fix
agents-core/vision_agents/core/agents/agents.py
Updated on_audio_done event handler to use named parameter and condition flush behavior on event.interrupted flag instead of always flushing when audio track exists.

Sequence Diagram

sequenceDiagram
    participant Client as Client
    participant TTS as XAITTS
    participant API as xAI TTS API
    participant Decoder as Audio Decoder
    
    Client->>TTS: stream_audio(text, **kwargs)
    TTS->>TTS: Prepare request payload<br/>(voice, codec, sample_rate, etc.)
    
    loop Retry Logic (up to 3 attempts)
        TTS->>API: POST /tts/generate<br/>(with exponential backoff on 429/500/503)
        API-->>TTS: Audio bytes (PCM/WAV/MP3/G.711)
    end
    
    TTS->>Decoder: _decode_audio(bytes, codec)
    
    alt codec == "pcm"
        Decoder->>Decoder: Pass through raw PCM
    else codec == "mulaw" or "alaw"
        Decoder->>Decoder: Numpy-based G.711 decoding
    else codec == "wav"
        Decoder->>Decoder: wave module unpacking
    else codec == "mp3"
        Decoder->>Decoder: pydub MP3 decoding
    end
    
    Decoder-->>TTS: PcmData
    TTS-->>Client: PcmData | Iterator | AsyncIterator
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

The bell jar fills with Grok's new voice,
Five tongues emerge from codecs' choice—
Retry loops beat like a darkening heart,
While 24kHz audio tears the silence apart,
And in the xAI garden, five voices converge. 🎤

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title 'Adds Grok TTS integration' directly and concisely describes the main change—adding Grok text-to-speech functionality to the xAI plugin, which is evident across multiple new TTS files, documentation, examples, and related updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch GrokTTS_integration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Nash0x7E2 Nash0x7E2 self-requested a review April 15, 2026 21:51
llm.py: drop the `getattr` chain in `_extract_tool_calls_from_response`
in favor of direct attribute access — `Response.tool_calls` always
returns a list and the `ToolCall` proto fields (id, function, name,
arguments) are guaranteed-present. Removes the dead `call_id` fallback
(no such field on the proto) and narrows the bare `except Exception`
to `json.JSONDecodeError`.

xai_realtime.py:
- Refresh the stale "as of xai-sdk 1.5.0" docstrings; verified xai-sdk
  1.11 still ships no realtime/voice/websocket wrapper, so the raw
  `websockets` implementation remains correct.
- Bump cosmetic `DEFAULT_MODEL` from "grok-3-fast" to "grok-4" (per
  the existing docstring this value is informational and not sent to
  the API).
- Hoist `aiohttp` import to the module top.
- Narrow each `except Exception` to specific tuples — `OSError`/
  `WebSocketException`/`TimeoutError` for connect, `ConnectionError`/
  `WebSocketException` for send paths, and the processing loop now
  swallows only transient transport/decode errors so programming bugs
  surface instead of being silently logged.
- Pass `plugin_name="xai"` on the `LLMResponseChunkEvent` emitted from
  `_handle_response_done`, matching every other event in the file.
Session config now mirrors the livekit xAI plugin's known-working shape:
- Send model name ("grok-4-1-fast-non-reasoning") in session.update
- Include input_audio_transcription for server-side transcription
- Expand turn_detection from bare {"type":"server_vad"} to full
  ServerVad config with threshold, padding, duration, and
  interrupt_response=False (prevents mic echo from cancelling the
  agent's own response mid-sentence)
- Fix DEFAULT_SAMPLE_RATE from 48000 to 24000 — xAI's realtime
  model emits PCM at 24 kHz; tagging frames as 48 kHz caused 2x
  playback speed and premature buffer drain
- Hoist aiohttp import to module level

Diagnostics:
- Explicitly handle response.cancelled / response.cancel events with
  a WARNING log so server-initiated interrupts are visible
- Bump unhandled event types from DEBUG to INFO for runtime visibility
- Handle rate_limits.updated at DEBUG
on_audio_done was calling _audio_track.flush() unconditionally on every
RealtimeAudioOutputDoneEvent. flush() discards the buffer immediately
("Playback stops immediately"), which truncates audio when the server
finishes sending faster than real-time playback drains.

Now flush() is only called when event.interrupted is True (barge-in).
On normal completion the buffer drains naturally through playback.

This only affects realtime plugins that deliver audio via WebSocket
events through the _audio_track buffer (currently xAI). OpenAI and
Gemini use WebRTC where audio bypasses this buffer path entirely.
@Nash0x7E2 Nash0x7E2 marked this pull request as ready for review April 16, 2026 16:52
@Nash0x7E2 Nash0x7E2 merged commit d887697 into main Apr 16, 2026
5 of 6 checks passed
@Nash0x7E2 Nash0x7E2 deleted the GrokTTS_integration branch April 16, 2026 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants