Skip to content

Commit 429b3a8

Browse files
authored
Merge pull request #54 from willwade/feat/53-gemini-flash-tts
Add Gemini Flash TTS engine support
2 parents cfb656a + 1e0fb5d commit 429b3a8

11 files changed

Lines changed: 835 additions & 3 deletions

File tree

README.md

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ A JavaScript/TypeScript library that provides a unified API for working with mul
4444
|--------------|------------|-------------|----------|-------------|
4545
| `azure` | `AzureTTSClient` | Both | Microsoft Azure Cognitive Services | `@azure/cognitiveservices-speechservices`, `microsoft-cognitiveservices-speech-sdk` |
4646
| `google` | `GoogleTTSClient` | Both | Google Cloud Text-to-Speech | `@google-cloud/text-to-speech` |
47+
| `gemini` | `GeminiTTSClient` | Both | Gemini Flash TTS | None (uses fetch API) |
4748
| `elevenlabs` | `ElevenLabsTTSClient` | Both | ElevenLabs | `node-fetch@2` (Node.js only) |
4849
| `watson` | `WatsonTTSClient` | Both | IBM Watson | None (uses fetch API) |
4950
| `openai` | `OpenAITTSClient` | Both | OpenAI | `openai` |
@@ -271,7 +272,7 @@ async function runExample() {
271272
runExample().catch(console.error);
272273
```
273274

274-
The factory supports all engines: `'azure'`, `'google'`, `'polly'`, `'elevenlabs'`, `'openai'`, `'modelslab'`, `'playht'`, `'watson'`, `'witai'`, `'sherpaonnx'`, `'sherpaonnx-wasm'`, `'espeak'`, `'espeak-wasm'`, `'sapi'`, `'cartesia'`, `'deepgram'`, `'hume'`, `'xai'`, `'fishaudio'`, `'mistral'`, `'murf'`, `'unrealspeech'`, `'resemble'`, etc.
275+
The factory supports all engines: `'azure'`, `'google'`, `'gemini'`, `'polly'`, `'elevenlabs'`, `'openai'`, `'modelslab'`, `'playht'`, `'watson'`, `'witai'`, `'sherpaonnx'`, `'sherpaonnx-wasm'`, `'espeak'`, `'espeak-wasm'`, `'sapi'`, `'cartesia'`, `'deepgram'`, `'hume'`, `'xai'`, `'fishaudio'`, `'mistral'`, `'murf'`, `'unrealspeech'`, `'resemble'`, etc.
275276

276277
## Core Functionality
277278

@@ -492,6 +493,7 @@ The following engines **automatically strip SSML tags** and convert to plain tex
492493
- **Cartesia** - SSML tags removed; audio tags (`[laugh]`, `[sigh]`, etc.) mapped to `<emotion>` for sonic-3, stripped for others
493494
- **Deepgram** - SSML tags are removed, plain text is synthesized
494495
- **Hume** - SSML tags are removed, plain text is synthesized
496+
- **Gemini** - SSML tags are removed; Gemini audio tags are passed natively
495497
- **xAI** - SSML tags are removed; audio tags passed natively for grok-tts
496498
- **Fish Audio** - SSML tags removed; audio tags passed natively for s2-pro
497499
- **Mistral** - SSML tags are removed, plain text is synthesized
@@ -697,6 +699,7 @@ When disabled, js-tts-wrapper falls back to the lightweight built-in converter (
697699
| Cartesia | ✅ Converted | → SSML → Plain text |
698700
| Deepgram | ✅ Converted | → SSML → Plain text |
699701
| Hume | ✅ Converted | → SSML → Plain text |
702+
| Gemini | ✅ Converted | → SSML → Plain text |
700703
| xAI | ✅ Converted | → SSML → Plain text |
701704
| Fish Audio | ✅ Converted | → SSML → Plain text |
702705
| Mistral | ✅ Converted | → SSML → Plain text |
@@ -842,6 +845,44 @@ Notes:
842845
- For true timings, use service account credentials (Node) where the beta client can be used.
843846
- Environment variable supported by examples/tests: `GOOGLECLOUDTTS_API_KEY`.
844847

848+
### Gemini Flash TTS
849+
850+
Gemini Flash TTS uses the Gemini API, not Google Cloud Text-to-Speech. Configure `GEMINI_API_KEY` or pass `apiKey` directly.
851+
852+
Enable the **Gemini API** (`generativelanguage.googleapis.com`) in your Google Cloud project. Google Cloud Text-to-Speech (`texttospeech.googleapis.com`) is not used by this engine.
853+
854+
#### ESM
855+
```javascript
856+
import { GeminiTTSClient } from 'js-tts-wrapper';
857+
858+
const tts = new GeminiTTSClient({
859+
apiKey: process.env.GEMINI_API_KEY,
860+
model: 'gemini-3.1-flash-tts-preview',
861+
voice: 'Kore'
862+
});
863+
864+
const audio = await tts.synthToBytes('Say cheerfully: Have a wonderful day!');
865+
```
866+
867+
#### Factory
868+
```javascript
869+
import { createTTSClient } from 'js-tts-wrapper';
870+
871+
const tts = createTTSClient('gemini', {
872+
apiKey: process.env.GEMINI_API_KEY,
873+
voice: 'Puck'
874+
});
875+
876+
await tts.speak('[excitedly] Hello from Gemini Flash TTS!');
877+
```
878+
879+
Notes:
880+
- Supported models: `gemini-3.1-flash-tts-preview` (default) and `gemini-2.5-flash-preview-tts`.
881+
- Supported voices: Zephyr, Puck, Charon, Kore, Fenrir, Leda, Orus, Aoede, Callirrhoe, Autonoe, Enceladus, Iapetus, Umbriel, Algieba, Despina, Erinome, Algenib, Rasalgethi, Laomedeia, Achernar, Alnilam, Schedar, Gacrux, Pulcherrima, Achird, Zubenelgenubi, Vindemiatrix, Sadachbia, Sadaltager, Sulafat.
882+
- Gemini TTS does not support SSML; SSML tags are stripped before synthesis.
883+
- Gemini TTS does not provide true streaming; `synthToBytestream()` wraps the completed audio bytes in a stream.
884+
- Output is WAV by default. Use `{ format: 'pcm' }` to return raw PCM.
885+
- Gemini audio tags can be included directly in text, such as `[whispers]`, `[laughs]`, or `[excitedly]`.
845886

846887
### AWS Polly
847888

@@ -1441,6 +1482,7 @@ cd your-project
14411482
# Install specific engine dependencies
14421483
npx js-tts-wrapper@latest run install:azure
14431484
npx js-tts-wrapper@latest run install:google
1485+
npx js-tts-wrapper@latest run install:gemini # no additional dependencies
14441486
npx js-tts-wrapper@latest run install:polly
14451487
npx js-tts-wrapper@latest run install:openai
14461488
npx js-tts-wrapper@latest run install:sherpaonnx

bin/cli.js

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ async function installEngine(engine) {
6161
const engineDeps = {
6262
azure: ["microsoft-cognitiveservices-speech-sdk"],
6363
google: ["@google-cloud/text-to-speech"],
64+
gemini: [],
6465
elevenlabs: ["node-fetch@2"],
6566
playht: ["node-fetch@2"],
6667
polly: ["@aws-sdk/client-polly"],
@@ -136,6 +137,7 @@ Commands:
136137
Available engines:
137138
azure Microsoft Azure TTS
138139
google Google Cloud TTS
140+
gemini Gemini Flash TTS (direct REST, no dependencies)
139141
elevenlabs ElevenLabs TTS
140142
playht PlayHT TTS
141143
polly AWS Polly TTS

package.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@
6060
"test:azure": "node run-tts-tests.cjs azure",
6161
"test:elevenlabs": "node run-tts-tests.cjs elevenlabs",
6262
"test:google": "node run-tts-tests.cjs google",
63+
"test:gemini": "node run-tts-tests.cjs gemini",
6364
"test:polly": "node run-tts-tests.cjs polly",
6465
"test:openai": "node run-tts-tests.cjs openai",
6566
"test:playht": "node run-tts-tests.cjs playht",
@@ -99,6 +100,7 @@
99100
"install:deps": "echo 'Use npm install js-tts-wrapper[engine] instead. For example: npm install js-tts-wrapper[azure]'",
100101
"install:azure": "npm install microsoft-cognitiveservices-speech-sdk",
101102
"install:google": "npm install @google-cloud/text-to-speech",
103+
"install:gemini": "echo 'Gemini TTS uses direct REST API calls; no additional dependencies required.'",
102104
"install:polly": "npm install @aws-sdk/client-polly",
103105
"install:openai": "npm install openai",
104106
"install:elevenlabs": "npm install @elevenlabs/elevenlabs-js",
@@ -118,6 +120,7 @@
118120
"text-to-speech",
119121
"azure",
120122
"google",
123+
"gemini",
121124
"polly",
122125
"elevenlabs",
123126
"ibm",
@@ -247,6 +250,7 @@
247250
"google": {
248251
"@google-cloud/text-to-speech": "^6.4.0"
249252
},
253+
"gemini": {},
250254
"elevenlabs": {
251255
"@elevenlabs/elevenlabs-js": "^2.32.0"
252256
},

run-tts-tests.cjs

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,14 +20,15 @@ const engineName = process.argv[2];
2020

2121
if (!engineName) {
2222
console.error('Usage: node run-tts-tests.cjs <engine-name>');
23-
console.error('Available engines: azure, google, polly, openai, elevenlabs, playht, upliftai, sherpaonnx, sherpaonnx-wasm, sapi, espeak, system');
23+
console.error('Available engines: azure, google, gemini, polly, openai, elevenlabs, playht, upliftai, sherpaonnx, sherpaonnx-wasm, sapi, espeak, system');
2424
process.exit(1);
2525
}
2626

2727
// Map engine names to test patterns
2828
const engineTestPatterns = {
2929
'azure': 'azure',
3030
'google': 'google',
31+
'gemini': 'gemini',
3132
'polly': 'polly',
3233
'openai': 'openai',
3334
'elevenlabs': 'elevenlabs',

src/__tests__/gemini.test.ts

Lines changed: 232 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
import { afterEach, beforeEach, describe, expect, it, jest } from "@jest/globals";
2+
import { GeminiTTSClient } from "../engines/gemini";
3+
import { createBrowserTTSClient } from "../factory-browser";
4+
import { createTTSClient } from "../factory";
5+
6+
const originalFetch = globalThis.fetch;
7+
8+
function response(body: any, init: { ok?: boolean; status?: number; statusText?: string } = {}) {
9+
return {
10+
ok: init.ok ?? true,
11+
status: init.status ?? 200,
12+
statusText: init.statusText ?? "OK",
13+
headers: {} as Headers,
14+
body: null as any,
15+
json: async () => body,
16+
text: async () => (typeof body === "string" ? body : JSON.stringify(body)),
17+
arrayBuffer: async () => new ArrayBuffer(0),
18+
};
19+
}
20+
21+
function audioResponse(base64Audio: string) {
22+
return response({
23+
candidates: [
24+
{
25+
content: {
26+
parts: [
27+
{
28+
inlineData: {
29+
data: base64Audio,
30+
},
31+
},
32+
],
33+
},
34+
},
35+
],
36+
});
37+
}
38+
39+
function b64(bytes: number[]): string {
40+
return Buffer.from(new Uint8Array(bytes)).toString("base64");
41+
}
42+
43+
describe("GeminiTTSClient", () => {
44+
let client: GeminiTTSClient;
45+
46+
beforeEach(() => {
47+
client = new GeminiTTSClient({ apiKey: "test-api-key" });
48+
});
49+
50+
afterEach(() => {
51+
globalThis.fetch = originalFetch;
52+
jest.restoreAllMocks();
53+
});
54+
55+
it("initializes with default values", () => {
56+
expect(client.getProperty("model")).toBe("gemini-3.1-flash-tts-preview");
57+
expect(client.getProperty("voice")).toBe("Kore");
58+
});
59+
60+
it("initializes with custom model and voice", () => {
61+
const c = new GeminiTTSClient({
62+
apiKey: "test",
63+
model: "gemini-2.5-flash-preview-tts",
64+
voice: "Puck",
65+
});
66+
67+
expect(c.getProperty("model")).toBe("gemini-2.5-flash-preview-tts");
68+
expect(c.getProperty("voice")).toBe("Puck");
69+
});
70+
71+
it("initializes with properties object", () => {
72+
const c = new GeminiTTSClient({
73+
apiKey: "test",
74+
properties: { model: "gemini-2.5-flash-preview-tts", voice: "Zephyr" },
75+
});
76+
77+
expect(c.getProperty("model")).toBe("gemini-2.5-flash-preview-tts");
78+
expect(c.getProperty("voice")).toBe("Zephyr");
79+
});
80+
81+
it("initializes with propertiesJson string", () => {
82+
const c = new GeminiTTSClient({
83+
apiKey: "test",
84+
propertiesJson: JSON.stringify({ voice: "Sulafat" }),
85+
});
86+
87+
expect(c.getProperty("voice")).toBe("Sulafat");
88+
});
89+
90+
it("sets and gets model, voice, and baseURL", () => {
91+
client.setProperty("model", "gemini-2.5-flash-preview-tts");
92+
client.setProperty("voice", "Puck");
93+
client.setProperty("baseURL", "https://example.test/v1beta");
94+
95+
expect(client.getProperty("model")).toBe("gemini-2.5-flash-preview-tts");
96+
expect(client.getProperty("voice")).toBe("Puck");
97+
expect(client.getProperty("baseURL")).toBe("https://example.test/v1beta");
98+
});
99+
100+
it("requires apiKey credential", () => {
101+
expect((client as any).getRequiredCredentials()).toEqual(["apiKey"]);
102+
});
103+
104+
it("returns false for checkCredentials without api key", async () => {
105+
expect(await new GeminiTTSClient({}).checkCredentials()).toBe(false);
106+
});
107+
108+
it("checks credentials against model list", async () => {
109+
globalThis.fetch = jest.fn(async () =>
110+
response({
111+
models: [{ name: "models/gemini-3.1-flash-tts-preview" }],
112+
})
113+
) as any;
114+
115+
expect(await client.checkCredentials()).toBe(true);
116+
});
117+
118+
it("gets static voices", async () => {
119+
const voices = await client.getVoices();
120+
121+
expect(voices).toHaveLength(30);
122+
expect(voices[0]).toHaveProperty("id", "Zephyr");
123+
expect(voices[0]).toHaveProperty("provider", "gemini");
124+
});
125+
126+
it("filters voices by supported languages", async () => {
127+
expect((await client.getVoicesByLanguage("en")).length).toBeGreaterThan(0);
128+
expect((await client.getVoicesByLanguage("fr")).length).toBeGreaterThan(0);
129+
});
130+
131+
it("creates via node and browser factories", () => {
132+
expect(createTTSClient("gemini", { apiKey: "test" })).toBeInstanceOf(GeminiTTSClient);
133+
expect(createBrowserTTSClient("gemini", { apiKey: "test" })).toBeInstanceOf(GeminiTTSClient);
134+
});
135+
136+
it("strips SSML while preserving Gemini audio tags", async () => {
137+
const result = await (client as any).prepareText(
138+
"<speak>Hello <break time=\"500ms\"/> [laughs] world</speak>"
139+
);
140+
141+
expect(result).not.toContain("<speak>");
142+
expect(result).not.toContain("<break");
143+
expect(result).toContain("[laughs]");
144+
});
145+
146+
it("returns WAV bytes by default and sends the Gemini request shape", async () => {
147+
const pcm = b64([0, 0, 1, 0]);
148+
globalThis.fetch = jest.fn(async () => audioResponse(pcm)) as any;
149+
150+
const bytes = await client.synthToBytes("Say cheerfully: Hello", { voice: "Puck" });
151+
const request = JSON.parse(((globalThis.fetch as any).mock.calls[0][1] as any).body);
152+
153+
expect(String.fromCharCode(bytes[0], bytes[1], bytes[2], bytes[3])).toBe("RIFF");
154+
expect((globalThis.fetch as any).mock.calls[0][0]).toContain(
155+
"/models/gemini-3.1-flash-tts-preview:generateContent"
156+
);
157+
expect(request.generationConfig.responseModalities).toEqual(["AUDIO"]);
158+
expect(
159+
request.generationConfig.speechConfig.voiceConfig.prebuiltVoiceConfig.voiceName
160+
).toBe("Puck");
161+
});
162+
163+
it("returns raw PCM when requested", async () => {
164+
globalThis.fetch = jest.fn(async () => audioResponse(b64([0, 0, 1, 0]))) as any;
165+
166+
const bytes = await client.synthToBytes("Hello", { format: "pcm" });
167+
168+
expect(Array.from(bytes)).toEqual([0, 0, 1, 0]);
169+
});
170+
171+
it("uses selected model in request URL", async () => {
172+
globalThis.fetch = jest.fn(async () => audioResponse(b64([0, 0]))) as any;
173+
174+
await client.synthToBytes("Hello", { model: "gemini-2.5-flash-preview-tts" });
175+
176+
expect((globalThis.fetch as any).mock.calls[0][0]).toContain(
177+
"/models/gemini-2.5-flash-preview-tts:generateContent"
178+
);
179+
});
180+
181+
it("throws useful error for HTTP failures", async () => {
182+
globalThis.fetch = jest.fn(async () =>
183+
response("bad request", { ok: false, status: 400, statusText: "Bad Request" })
184+
) as any;
185+
186+
await expect(client.synthToBytes("Hello")).rejects.toThrow(
187+
"Gemini TTS API error: 400 Bad Request"
188+
);
189+
});
190+
191+
it("throws useful error for missing audio data", async () => {
192+
globalThis.fetch = jest.fn(async () =>
193+
response({
194+
candidates: [
195+
{
196+
finishReason: "STOP",
197+
content: { parts: [{ text: "not audio" }] },
198+
},
199+
],
200+
})
201+
) as any;
202+
203+
await expect(client.synthToBytes("Hello")).rejects.toThrow(
204+
"Gemini TTS response did not include audio data"
205+
);
206+
});
207+
208+
it("wraps synthesized bytes in a stream and returns estimated word boundaries", async () => {
209+
globalThis.fetch = jest.fn(async () => audioResponse(b64([0, 0, 1, 0]))) as any;
210+
211+
const result = await client.synthToBytestream("Hello world", { useWordBoundary: true });
212+
const reader = result.audioStream.getReader();
213+
const chunk = await reader.read();
214+
215+
expect(chunk.done).toBe(false);
216+
expect(chunk.value?.length).toBeGreaterThan(0);
217+
expect(result.wordBoundaries).toHaveLength(2);
218+
});
219+
220+
it("provides credential status", async () => {
221+
globalThis.fetch = jest.fn(async () =>
222+
response({
223+
models: [{ name: "models/gemini-3.1-flash-tts-preview" }],
224+
})
225+
) as any;
226+
227+
const status = await client.getCredentialStatus();
228+
229+
expect(status.engine).toBe("gemini");
230+
expect(status.requiresCredentials).toBe(true);
231+
});
232+
});

src/browser.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ export { DeepgramTTSClient } from "./engines/deepgram";
1313
export { ElevenLabsTTSClient } from "./engines/elevenlabs";
1414
export { EspeakBrowserTTSClient } from "./engines/espeak-wasm";
1515
export { FishAudioTTSClient } from "./engines/fishaudio";
16+
export { GeminiTTSClient } from "./engines/gemini";
1617
export { GoogleTTSClient } from "./engines/google";
1718
export { HumeTTSClient } from "./engines/hume";
1819
export { MistralTTSClient } from "./engines/mistral";

0 commit comments

Comments
 (0)