Skip to content

Commit 472cf60

Browse files
authored
Merge pull request #31 from willwade/feat/new-tts-engines
Feat/new tts engines
2 parents b6c9a20 + ef9d6b8 commit 472cf60

64 files changed

Lines changed: 5831 additions & 958 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

BACKLOG.md

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
# js-tts-wrapper Engine & Feature Backlog
2+
3+
Reference: [speech-sdk](https://github.com/Jellypod-Inc/speech-sdk) (`@speech-sdk/core`)
4+
5+
## Completed
6+
7+
- [x] Cartesia engine (`sonic-3`, `sonic-2`) with audio tag / emotion-to-SSML support
8+
- [x] Deepgram engine (`aura-2`) with static voice list
9+
- [x] ElevenLabs v3 audio tag passthrough (`[laugh]`, `[sigh]`, etc.)
10+
- [x] Generic property pass-through via `properties` / `propertiesJson`
11+
- [x] Hume engine (`octave-2`, `octave-1`) with streaming via separate `/tts/stream/file` endpoint
12+
- [x] xAI engine (`grok-tts`) with native audio tag passthrough, language config
13+
- [x] Fish Audio engine (`s2-pro`) with audio tag passthrough, model-as-header pattern
14+
- [x] Mistral engine (`voxtral-mini-tts-2603`) with SSE streaming, base64 chunk parsing
15+
- [x] Murf engine (`GEN2`, `FALCON`) with dual model/endpoints, base64 GEN2 / binary FALCON
16+
- [x] Unreal Speech engine with two-step URI non-streaming, direct streaming
17+
- [x] Resemble engine with base64 JSON non-streaming, direct streaming
18+
19+
## New Engines to Add
20+
21+
### Lower Priority (Open-Source / Niche)
22+
23+
| Engine | Models | Key Features | Notes |
24+
|--------|--------|-------------|-------|
25+
| **fal** | `f5-tts`, `kokoro`, `dia-tts`, `orpheus-tts`, `index-tts-2` | Voice cloning, open-source | No streaming, many sub-models |
26+
| **Google Gemini TTS** | `gemini-2.5-flash-preview-tts`, `gemini-2.5-pro-preview-tts` | Pseudo-streaming, 23 languages | Different from existing Google Cloud TTS |
27+
28+
## Cross-Cutting Features
29+
30+
### Audio Tags (Cross-Provider Abstraction)
31+
32+
Unified `[tag]` syntax mapped to provider-specific representations:
33+
- **ElevenLabs v3** — native passthrough (done)
34+
- **Cartesia sonic-3** — emotions to `<emotion value="..."/>` SSML (done)
35+
- **OpenAI gpt-4o-mini-tts** — tags to natural language `instructions`
36+
- **xAI grok-tts** — native passthrough
37+
- **Fish Audio s2-pro** — native passthrough
38+
- **All others** — strip tags with warnings
39+
40+
### Model-Level Feature Declarations
41+
42+
Add per-model capability metadata (from speech-sdk pattern):
43+
- `streaming` — supports real-time audio streaming
44+
- `audio-tags` — supports `[tag]` syntax
45+
- `inline-voice-cloning` — accepts reference audio inline
46+
- `open-source` — model is open source
47+
48+
Enables runtime capability checks via `hasFeature()`.
49+
50+
### Unified Voice Type
51+
52+
Current: engine-specific voice IDs
53+
Proposed: `string | { url: string } | { audio: string | Uint8Array }`
54+
- `string` — standard voice ID
55+
- `{ url }` — voice cloning from URL
56+
- `{ audio }` — voice cloning from inline audio
57+
58+
### Voice Cloning Support
59+
60+
Providers that support inline voice cloning:
61+
- Cartesia sonic-3
62+
- Hume octave-2
63+
- Fish Audio s2-pro
64+
- Resemble
65+
- Mistral voxtral-mini-tts-2603
66+
- fal (f5-tts, dia-tts, index-tts-2)
67+
68+
### Streaming Improvements
69+
70+
- [x] Cartesia: true streaming (already pipes response.body)
71+
- [x] Deepgram: true streaming (already pipes response.body)
72+
- [x] ElevenLabs: true streaming (fixed — pipes response.body when not using timestamps)
73+
- [x] Polly: true streaming for MP3/OGG (already pipes AudioStream; WAV requires buffering for header)
74+
- [x] Standardize `synthToBytestream` to return actual streaming responses where supported
75+
- Google Cloud TTS: SDK returns all audio at once — would need StreamingSynthesize beta API
76+
- Google Gemini TTS: pseudo-streaming via SSE base64 chunks (new engine, not yet implemented)
77+
78+
### Tree-Shakeable Subpath Exports
79+
80+
From speech-sdk pattern — add per-provider subpath exports in package.json:
81+
```json
82+
{
83+
"exports": {
84+
".": "./dist/esm/index.js",
85+
"./cartesia": "./dist/esm/engines/cartesia.js",
86+
"./deepgram": "./dist/esm/engines/deepgram.js"
87+
}
88+
}
89+
```
90+
91+
### Unified Error Hierarchy
92+
93+
Standardize errors across engines with rich context (statusCode, model, responseBody).
94+
95+
## Existing Engine Updates Needed
96+
97+
| Engine | Update Needed |
98+
|--------|--------------|
99+
| **OpenAI** | Add `gpt-4o-mini-tts` model with instructions/audio tag support |
100+
| **Google** | Add Gemini-based TTS alongside existing Cloud TTS |
101+
| **ElevenLabs** | Close issue #24 (already fixed) |

README.md

Lines changed: 143 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,15 @@ A JavaScript/TypeScript library that provides a unified API for working with mul
5757
| `espeak-wasm` | `EspeakBrowserTTSClient` | Both | eSpeak NG | `mespeak` (Node.js) or meSpeak.js (browser) |
5858
| `sapi` | `SAPITTSClient` | Node.js | Windows Speech API (SAPI) | None (uses PowerShell) |
5959
| `witai` | `WitAITTSClient` | Both | Wit.ai | None (uses fetch API) |
60+
| `cartesia` | `CartesiaTTSClient` | Both | Cartesia | None (uses fetch API) |
61+
| `deepgram` | `DeepgramTTSClient` | Both | Deepgram | None (uses fetch API) |
62+
| `hume` | `HumeTTSClient` | Both | Hume AI | None (uses fetch API) |
63+
| `xai` | `XAITTSClient` | Both | xAI (Grok) | None (uses fetch API) |
64+
| `fishaudio` | `FishAudioTTSClient` | Both | Fish Audio | None (uses fetch API) |
65+
| `mistral` | `MistralTTSClient` | Both | Mistral AI | None (uses fetch API) |
66+
| `murf` | `MurfTTSClient` | Both | Murf AI | None (uses fetch API) |
67+
| `unrealspeech` | `UnrealSpeechTTSClient` | Both | Unreal Speech | None (uses fetch API) |
68+
| `resemble` | `ResembleTTSClient` | Both | Resemble AI | None (uses fetch API) |
6069

6170
**Factory Name**: Use with `createTTSClient('factory-name', credentials)`
6271
**Class Name**: Use with direct import `import { ClassName } from 'js-tts-wrapper'`
@@ -90,6 +99,15 @@ A JavaScript/TypeScript library that provides a unified API for working with mul
9099
| **SherpaOnnx** || Estimated || Low |
91100
| **SherpaOnnx-WASM** || Estimated || Low |
92101
| **SAPI** || Estimated || Low |
102+
| **Cartesia** || Estimated || Low |
103+
| **Deepgram** || Estimated || Low |
104+
| **Hume** || Estimated || Low |
105+
| **xAI** || Estimated || Low |
106+
| **Fish Audio** || Estimated || Low |
107+
| **Mistral** || Estimated || Low |
108+
| **Murf** || Estimated || Low |
109+
| **Unreal Speech** || Estimated || Low |
110+
| **Resemble** || Estimated || Low |
93111

94112
**Character-Level Timing**: Only ElevenLabs provides precise character-level timing data via the `/with-timestamps` endpoint, enabling the most accurate word highlighting and speech synchronization.
95113

@@ -253,7 +271,7 @@ async function runExample() {
253271
runExample().catch(console.error);
254272
```
255273

256-
The factory supports all engines: `'azure'`, `'google'`, `'polly'`, `'elevenlabs'`, `'openai'`, `'modelslab'`, `'playht'`, `'watson'`, `'witai'`, `'sherpaonnx'`, `'sherpaonnx-wasm'`, `'espeak'`, `'espeak-wasm'`, `'sapi'`, etc.
274+
The factory supports all engines: `'azure'`, `'google'`, `'polly'`, `'elevenlabs'`, `'openai'`, `'modelslab'`, `'playht'`, `'watson'`, `'witai'`, `'sherpaonnx'`, `'sherpaonnx-wasm'`, `'espeak'`, `'espeak-wasm'`, `'sapi'`, `'cartesia'`, `'deepgram'`, `'hume'`, `'xai'`, `'fishaudio'`, `'mistral'`, `'murf'`, `'unrealspeech'`, `'resemble'`, etc.
257275

258276
## Core Functionality
259277

@@ -471,6 +489,15 @@ The following engines **automatically strip SSML tags** and convert to plain tex
471489
- **PlayHT** - SSML tags are removed, plain text is synthesized
472490
- **ModelsLab** - SSML tags are removed, plain text is synthesized
473491
- **SherpaOnnx/SherpaOnnx-WASM** - SSML tags are removed, plain text is synthesized
492+
- **Cartesia** - SSML tags removed; audio tags (`[laugh]`, `[sigh]`, etc.) mapped to `<emotion>` for sonic-3, stripped for others
493+
- **Deepgram** - SSML tags are removed, plain text is synthesized
494+
- **Hume** - SSML tags are removed, plain text is synthesized
495+
- **xAI** - SSML tags are removed; audio tags passed natively for grok-tts
496+
- **Fish Audio** - SSML tags removed; audio tags passed natively for s2-pro
497+
- **Mistral** - SSML tags are removed, plain text is synthesized
498+
- **Murf** - SSML tags are removed, plain text is synthesized
499+
- **Unreal Speech** - SSML tags are removed, plain text is synthesized
500+
- **Resemble** - SSML tags are removed, plain text is synthesized
474501

475502
### Usage Examples
476503

@@ -667,6 +694,15 @@ When disabled, js-tts-wrapper falls back to the lightweight built-in converter (
667694
| OpenAI | ✅ Converted | → SSML → Plain text |
668695
| PlayHT | ✅ Converted | → SSML → Plain text |
669696
| SherpaOnnx | ✅ Converted | → SSML → Plain text |
697+
| Cartesia | ✅ Converted | → SSML → Plain text |
698+
| Deepgram | ✅ Converted | → SSML → Plain text |
699+
| Hume | ✅ Converted | → SSML → Plain text |
700+
| xAI | ✅ Converted | → SSML → Plain text |
701+
| Fish Audio | ✅ Converted | → SSML → Plain text |
702+
| Mistral | ✅ Converted | → SSML → Plain text |
703+
| Murf | ✅ Converted | → SSML → Plain text |
704+
| Unreal Speech | ✅ Converted | → SSML → Plain text |
705+
| Resemble | ✅ Converted | → SSML → Plain text |
670706

671707
### Speech Markdown vs Raw SSML: When to Use Each
672708

@@ -1069,6 +1105,112 @@ await tts.speak('Hello from Windows SAPI!');
10691105

10701106
> **Note**: This engine is **Windows-only**
10711107
1108+
### Cartesia
1109+
1110+
```javascript
1111+
import { CartesiaTTSClient } from 'js-tts-wrapper';
1112+
1113+
const tts = new CartesiaTTSClient({ apiKey: 'your-api-key' });
1114+
await tts.setVoice('sonic-3'); // or 'sonic-2'
1115+
await tts.speak('Hello from Cartesia!');
1116+
```
1117+
1118+
> Audio tags like `[laugh]`, `[sigh]` are mapped to `<emotion>` SSML for sonic-3, stripped for other models.
1119+
1120+
### Deepgram
1121+
1122+
```javascript
1123+
import { DeepgramTTSClient } from 'js-tts-wrapper';
1124+
1125+
const tts = new DeepgramTTSClient({ apiKey: 'your-api-key' });
1126+
await tts.setVoice('aura-2-asteria-en');
1127+
await tts.speak('Hello from Deepgram!');
1128+
```
1129+
1130+
> Uses a static voice list. Model and voice are combined in the URL parameter.
1131+
1132+
### Hume AI
1133+
1134+
```javascript
1135+
import { HumeTTSClient } from 'js-tts-wrapper';
1136+
1137+
const tts = new HumeTTSClient({ apiKey: 'your-api-key' });
1138+
await tts.setVoice('ito'); // or any Hume voice name
1139+
await tts.speak('Hello from Hume!');
1140+
```
1141+
1142+
> Supports `octave-2` and `octave-1` models. Streaming uses a separate `/tts/stream/file` endpoint.
1143+
1144+
### xAI (Grok)
1145+
1146+
```javascript
1147+
import { XAITTSClient } from 'js-tts-wrapper';
1148+
1149+
const tts = new XAITTSClient({ apiKey: 'your-api-key' });
1150+
await tts.speak('Hello from xAI!');
1151+
```
1152+
1153+
> Native audio tag passthrough for grok-tts model. Language can be configured via properties.
1154+
1155+
### Fish Audio
1156+
1157+
```javascript
1158+
import { FishAudioTTSClient } from 'js-tts-wrapper';
1159+
1160+
const tts = new FishAudioTTSClient({ apiKey: 'your-api-key' });
1161+
await tts.setVoice('your-voice-reference-id');
1162+
await tts.speak('Hello from Fish Audio!');
1163+
```
1164+
1165+
> Model ID is passed as a header. Audio tags passed natively for s2-pro model.
1166+
1167+
### Mistral
1168+
1169+
```javascript
1170+
import { MistralTTSClient } from 'js-tts-wrapper';
1171+
1172+
const tts = new MistralTTSClient({ apiKey: 'your-api-key' });
1173+
await tts.speak('Hello from Mistral!');
1174+
```
1175+
1176+
> Uses SSE streaming with base64 audio chunks. Non-streaming returns base64 JSON.
1177+
1178+
### Murf
1179+
1180+
```javascript
1181+
import { MurfTTSClient } from 'js-tts-wrapper';
1182+
1183+
const tts = new MurfTTSClient({ apiKey: 'your-api-key' });
1184+
await tts.setVoice('en-US-natalie');
1185+
await tts.speak('Hello from Murf!');
1186+
```
1187+
1188+
> Two models: GEN2 (base64 response) and FALCON (binary streaming). Uses static voice list.
1189+
1190+
### Unreal Speech
1191+
1192+
```javascript
1193+
import { UnrealSpeechTTSClient } from 'js-tts-wrapper';
1194+
1195+
const tts = new UnrealSpeechTTSClient({ apiKey: 'your-api-key' });
1196+
await tts.setVoice('Scarlett');
1197+
await tts.speak('Hello from Unreal Speech!');
1198+
```
1199+
1200+
> Non-streaming uses two-step URI-based flow. Streaming returns audio directly.
1201+
1202+
### Resemble
1203+
1204+
```javascript
1205+
import { ResembleTTSClient } from 'js-tts-wrapper';
1206+
1207+
const tts = new ResembleTTSClient({ apiKey: 'your-api-key' });
1208+
await tts.setVoice('your-voice-id');
1209+
await tts.speak('Hello from Resemble!');
1210+
```
1211+
1212+
> Non-streaming returns base64 JSON. Streaming returns raw binary audio.
1213+
10721214
## API Reference
10731215

10741216
### Factory Function

0 commit comments

Comments
 (0)