Reference: speech-sdk (@speech-sdk/core)
- Cartesia engine (
sonic-3,sonic-2) with audio tag / emotion-to-SSML support - Deepgram engine (
aura-2) with static voice list - ElevenLabs v3 audio tag passthrough (
[laugh],[sigh], etc.) - Generic property pass-through via
properties/propertiesJson - Hume engine (
octave-2,octave-1) with streaming via separate/tts/stream/fileendpoint - xAI engine (
grok-tts) with native audio tag passthrough, language config - Fish Audio engine (
s2-pro) with audio tag passthrough, model-as-header pattern - Mistral engine (
voxtral-mini-tts-2603) with SSE streaming, base64 chunk parsing - Murf engine (
GEN2,FALCON) with dual model/endpoints, base64 GEN2 / binary FALCON - Unreal Speech engine with two-step URI non-streaming, direct streaming
- Resemble engine with base64 JSON non-streaming, direct streaming
| Engine | Models | Key Features | Notes |
|---|---|---|---|
| fal | f5-tts, kokoro, dia-tts, orpheus-tts, index-tts-2 |
Voice cloning, open-source | No streaming, many sub-models |
| Google Gemini TTS | gemini-2.5-flash-preview-tts, gemini-2.5-pro-preview-tts |
Pseudo-streaming, 23 languages | Different from existing Google Cloud TTS |
Unified [tag] syntax mapped to provider-specific representations:
- ElevenLabs v3 — native passthrough (done)
- Cartesia sonic-3 — emotions to
<emotion value="..."/>SSML (done) - OpenAI gpt-4o-mini-tts — tags to natural language
instructions - xAI grok-tts — native passthrough
- Fish Audio s2-pro — native passthrough
- All others — strip tags with warnings
Add per-model capability metadata (from speech-sdk pattern):
streaming— supports real-time audio streamingaudio-tags— supports[tag]syntaxinline-voice-cloning— accepts reference audio inlineopen-source— model is open source
Enables runtime capability checks via hasFeature().
Current: engine-specific voice IDs
Proposed: string | { url: string } | { audio: string | Uint8Array }
string— standard voice ID{ url }— voice cloning from URL{ audio }— voice cloning from inline audio
Providers that support inline voice cloning:
- Cartesia sonic-3
- Hume octave-2
- Fish Audio s2-pro
- Resemble
- Mistral voxtral-mini-tts-2603
- fal (f5-tts, dia-tts, index-tts-2)
- Cartesia: true streaming (already pipes response.body)
- Deepgram: true streaming (already pipes response.body)
- ElevenLabs: true streaming (fixed — pipes response.body when not using timestamps)
- Polly: true streaming for MP3/OGG (already pipes AudioStream; WAV requires buffering for header)
- Standardize
synthToBytestreamto return actual streaming responses where supported - Google Cloud TTS: SDK returns all audio at once — would need StreamingSynthesize beta API
- Google Gemini TTS: pseudo-streaming via SSE base64 chunks (new engine, not yet implemented)
From speech-sdk pattern — add per-provider subpath exports in package.json:
{
"exports": {
".": "./dist/esm/index.js",
"./cartesia": "./dist/esm/engines/cartesia.js",
"./deepgram": "./dist/esm/engines/deepgram.js"
}
}Standardize errors across engines with rich context (statusCode, model, responseBody).
| Engine | Update Needed |
|---|---|
| OpenAI | Add gpt-4o-mini-tts model with instructions/audio tag support |
| Add Gemini-based TTS alongside existing Cloud TTS | |
| ElevenLabs | Close issue #24 (already fixed) |