Skip to content

Add together STT and TTS services#4054

Draft
blainekasten wants to merge 4 commits into
pipecat-ai:mainfrom
blainekasten:add_together_stt_tts
Draft

Add together STT and TTS services#4054
blainekasten wants to merge 4 commits into
pipecat-ai:mainfrom
blainekasten:add_together_stt_tts

Conversation

@blainekasten
Copy link
Copy Markdown

Please describe the changes in your PR. If it is addressing an issue, please reference that as well.

@markbackman
Copy link
Copy Markdown
Contributor

markbackman commented Mar 17, 2026

@blainekasten, we've changed enough with the base TTS service class and how settings are handled, that it's probably worth collaborating on my branch here:
#3904

I've updated this branch to align with those recent changes.

In revisiting it today, I see that the TTS service is not returning any audio. I've poked around a bit and have written some standalone tests where I can't get audio to be returned in following examples in your docs. Was there an API change or perhaps this is something to do with my account or API key. Can you help?

@markbackman markbackman self-requested a review March 17, 2026 14:29
@blainekasten
Copy link
Copy Markdown
Author

@markbackman sorry - let me put this in draft mode. I'm actively working on this

@blainekasten blainekasten marked this pull request as draft March 17, 2026 20:04
@markbackman markbackman force-pushed the add_together_stt_tts branch from fe84a88 to 7e22e23 Compare March 21, 2026 02:25
Copy link
Copy Markdown
Contributor

@markbackman markbackman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a number of changes to these classes and have a working 07e version.

I think the issue was three-fold:

  1. The STT service seems like it takes a while to warm up before it will yield a transcript. Do you see this?
  2. The Qwen model is really slow to produce inference.
  3. The STT is very sensitive and prone to outputting false positives. This is a class symptom of Whisper models—lots of "You" or "Thank you" produced. This is actually a pretty big problem for production use, so if you can tune this to reduce false positives, it will really help with adoption.

Also, I rebased on the latest main.

api_key=os.getenv("TOGETHER_API_KEY"),
settings=TogetherLLMService.Settings(
model="Qwen/Qwen3.5-9B",
model="openai/gpt-oss-120b",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the Qwen model to be really slow and have issues with producing inference. I've used the gpt-oss-120b model for another project and it's worked well. Seems to work well here too.

Comment thread examples/transcription/transcription-together.py
Comment thread src/pipecat/services/together/llm.py Outdated
# 1. Initialize default_settings with hardcoded defaults
default_settings = self.Settings(model=model)
default_settings = self.Settings(
model="openai/gpt-oss-120b",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making "openai/gpt-oss-120b" the default. This is the pattern, where we initialize settings.

Comment thread src/pipecat/services/together/stt.py Outdated
from pipecat.utils.tracing.service_decorators import traced_stt

# Together requires 16 kHz 16-bit mono PCM input.
_TOGETHER_SAMPLE_RATE = 16000
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Together only supports 16khz, so we're setting a constant and then using a resampler. This will help the service work in the event that a user sets a different sample rate via the PipelineParams.

Comment thread src/pipecat/services/together/stt.py
Comment thread src/pipecat/services/together/stt.py
Comment thread src/pipecat/services/together/stt.py
Comment thread src/pipecat/services/together/stt.py Outdated
"""
return True

async def _update_settings(self, delta: STTSettings) -> dict[str, Any]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows for runtime updates, where supported.

"""

_settings: TogetherTTSSettings
Settings = TogetherTTSSettings
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same settings patterns here.

Comment thread src/pipecat/services/together/tts.py Outdated
logger.trace(f"{self}: flushing audio (context_id={context_id})")
await self._ws_send({"type": "input_text_buffer.commit"})
ctx_id = context_id or self._context_id
if not ctx_id or not self.audio_context_available(ctx_id):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of context changes, which are now required. I applied the latest in this class.

Comment thread src/pipecat/services/together/stt.py
Comment thread src/pipecat/services/stt_latency.py Outdated
SARVAM_TTFS_P99: float = 1.17
SONIOX_TTFS_P99: float = 0.35
SPEECHMATICS_TTFS_P99: float = 0.74
TOGETHER_TTFS_P99: float = 2.028
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this value is so high due to the warm up problem I think I observed in testing. SOTA services are around p50 at 0.3 sec with p99 around 0.4 sec. Ideally, the p99 latency is lower to remain competitive.

blainekasten and others added 2 commits April 1, 2026 18:15
Rename 07e-interruptible-together.py to voice-together.py, add
transcription-together.py, remove unused OpenAI import, and register
voice-together in release evals.
@markbackman markbackman force-pushed the add_together_stt_tts branch from fc7a19a to a2fffcb Compare April 1, 2026 22:22
Move model/language out of __init__ args into settings-based
configuration with default-then-apply-delta pattern. Add Settings
class attribute, language_to_service_language(), and _update_settings()
with reconnect support.
- Use settings-based configuration (remove model/voice/language from
  __init__ args)
- Enable push_stop_frames, push_start_frame, pause_frame_processing
  to let base class manage TTS frame lifecycle
- Use append_to_audio_context() and get_active_audio_context_id()
  instead of manual context tracking
- Fix flush_audio signature to match base class
- Add language_to_service_language() and _update_settings() with
  reconnect support
- Remove unused voice arg from example
)
headers = {
"Authorization": f"Bearer {self._api_key}",
"OpenAI-Beta": "realtime=v1",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants