Add together STT and TTS services#4054
Conversation
|
@blainekasten, we've changed enough with the base TTS service class and how settings are handled, that it's probably worth collaborating on my branch here: I've updated this branch to align with those recent changes. In revisiting it today, I see that the TTS service is not returning any audio. I've poked around a bit and have written some standalone tests where I can't get audio to be returned in following examples in your docs. Was there an API change or perhaps this is something to do with my account or API key. Can you help? |
|
@markbackman sorry - let me put this in draft mode. I'm actively working on this |
fe84a88 to
7e22e23
Compare
There was a problem hiding this comment.
I pushed a number of changes to these classes and have a working 07e version.
I think the issue was three-fold:
- The STT service seems like it takes a while to warm up before it will yield a transcript. Do you see this?
- The Qwen model is really slow to produce inference.
- The STT is very sensitive and prone to outputting false positives. This is a class symptom of Whisper models—lots of "You" or "Thank you" produced. This is actually a pretty big problem for production use, so if you can tune this to reduce false positives, it will really help with adoption.
Also, I rebased on the latest main.
| api_key=os.getenv("TOGETHER_API_KEY"), | ||
| settings=TogetherLLMService.Settings( | ||
| model="Qwen/Qwen3.5-9B", | ||
| model="openai/gpt-oss-120b", |
There was a problem hiding this comment.
I found the Qwen model to be really slow and have issues with producing inference. I've used the gpt-oss-120b model for another project and it's worked well. Seems to work well here too.
| # 1. Initialize default_settings with hardcoded defaults | ||
| default_settings = self.Settings(model=model) | ||
| default_settings = self.Settings( | ||
| model="openai/gpt-oss-120b", |
There was a problem hiding this comment.
Making "openai/gpt-oss-120b" the default. This is the pattern, where we initialize settings.
| from pipecat.utils.tracing.service_decorators import traced_stt | ||
|
|
||
| # Together requires 16 kHz 16-bit mono PCM input. | ||
| _TOGETHER_SAMPLE_RATE = 16000 |
There was a problem hiding this comment.
Together only supports 16khz, so we're setting a constant and then using a resampler. This will help the service work in the event that a user sets a different sample rate via the PipelineParams.
| """ | ||
| return True | ||
|
|
||
| async def _update_settings(self, delta: STTSettings) -> dict[str, Any]: |
There was a problem hiding this comment.
This allows for runtime updates, where supported.
| """ | ||
|
|
||
| _settings: TogetherTTSSettings | ||
| Settings = TogetherTTSSettings |
There was a problem hiding this comment.
Same settings patterns here.
| logger.trace(f"{self}: flushing audio (context_id={context_id})") | ||
| await self._ws_send({"type": "input_text_buffer.commit"}) | ||
| ctx_id = context_id or self._context_id | ||
| if not ctx_id or not self.audio_context_available(ctx_id): |
There was a problem hiding this comment.
Lots of context changes, which are now required. I applied the latest in this class.
| SARVAM_TTFS_P99: float = 1.17 | ||
| SONIOX_TTFS_P99: float = 0.35 | ||
| SPEECHMATICS_TTFS_P99: float = 0.74 | ||
| TOGETHER_TTFS_P99: float = 2.028 |
There was a problem hiding this comment.
I wonder if this value is so high due to the warm up problem I think I observed in testing. SOTA services are around p50 at 0.3 sec with p99 around 0.4 sec. Ideally, the p99 latency is lower to remain competitive.
Rename 07e-interruptible-together.py to voice-together.py, add transcription-together.py, remove unused OpenAI import, and register voice-together in release evals.
fc7a19a to
a2fffcb
Compare
Move model/language out of __init__ args into settings-based configuration with default-then-apply-delta pattern. Add Settings class attribute, language_to_service_language(), and _update_settings() with reconnect support.
- Use settings-based configuration (remove model/voice/language from __init__ args) - Enable push_stop_frames, push_start_frame, pause_frame_processing to let base class manage TTS frame lifecycle - Use append_to_audio_context() and get_active_audio_context_id() instead of manual context tracking - Fix flush_audio signature to match base class - Add language_to_service_language() and _update_settings() with reconnect support - Remove unused voice arg from example
| ) | ||
| headers = { | ||
| "Authorization": f"Bearer {self._api_key}", | ||
| "OpenAI-Beta": "realtime=v1", |
Please describe the changes in your PR. If it is addressing an issue, please reference that as well.