Skip to content

Commit 54bec55

Browse files
authored
feat!: text-to-speech x LLM integration (#936)
## Description This pull request introduces a few changes to the Text-to-Speech module: - Improved streaming mode by allowing an incrementally expanded text input. This change focuses on integrating T2S with text generation models (e.g. Llama 3.2). - Added simple test cases for T2S module. ### Introduces a breaking change? - [x] Yes - [ ] No ### Type of change - [ ] Bug fix (change which fixes an issue) - [x] New feature (change which adds functionality) - [ ] Documentation update (improves or adds clarity to existing documentation) - [x] Other (chores, tests, code style improvements etc.) ### Tested on - [ ] iOS - [x] Android ### Testing instructions To test the Text-to-Speech module, run the set of tests for this module. To test the new streaming mode and it's integration with text generation models, one can use 'text-to-speech-llm' demo app. ### Screenshots <!-- Add screenshots here, if applicable --> ### Related issues #773 #897 ### Checklist - [x] I have performed a self-review of my code - [x] I have commented my code, particularly in hard-to-understand areas - [x] I have updated the documentation accordingly - [x] My changes generate no new warnings ### Additional notes <!-- Include any additional information, assumptions, or context that reviewers might need to understand this PR. -->
1 parent effbfff commit 54bec55

File tree

20 files changed

+786
-124
lines changed

20 files changed

+786
-124
lines changed

apps/speech/App.tsx

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ import { SpeechToTextScreen } from './screens/SpeechToTextScreen';
55
import ColorPalette from './colors';
66
import ExecutorchLogo from './assets/executorch.svg';
77
import { Quiz } from './screens/Quiz';
8+
import { TextToSpeechLLMScreen } from './screens/TextToSpeechLLMScreen';
89
import { initExecutorch } from 'react-native-executorch';
910
import { ExpoResourceFetcher } from '@react-native-executorch/expo-resource-fetcher';
1011

@@ -14,7 +15,7 @@ initExecutorch({
1415

1516
export default function App() {
1617
const [currentScreen, setCurrentScreen] = useState<
17-
'menu' | 'speech-to-text' | 'text-to-speech' | 'quiz'
18+
'menu' | 'speech-to-text' | 'text-to-speech' | 'quiz' | 'text-to-speech-llm'
1819
>('menu');
1920

2021
const goToMenu = () => setCurrentScreen('menu');
@@ -31,6 +32,10 @@ export default function App() {
3132
return <Quiz onBack={goToMenu} />;
3233
}
3334

35+
if (currentScreen === 'text-to-speech-llm') {
36+
return <TextToSpeechLLMScreen onBack={goToMenu} />;
37+
}
38+
3439
return (
3540
<View style={styles.container}>
3641
<ExecutorchLogo width={64} height={64} />
@@ -54,6 +59,12 @@ export default function App() {
5459
>
5560
<Text style={styles.buttonText}>Text to Speech - Quiz</Text>
5661
</TouchableOpacity>
62+
<TouchableOpacity
63+
style={styles.button}
64+
onPress={() => setCurrentScreen('text-to-speech-llm')}
65+
>
66+
<Text style={styles.buttonText}>Text to Speech - LLM Streaming</Text>
67+
</TouchableOpacity>
5768
</View>
5869
</View>
5970
);
Lines changed: 323 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,323 @@
1+
import React, { useEffect, useState, useRef } from 'react';
2+
import {
3+
View,
4+
Text,
5+
StyleSheet,
6+
TouchableOpacity,
7+
ScrollView,
8+
} from 'react-native';
9+
import { SafeAreaProvider, SafeAreaView } from 'react-native-safe-area-context';
10+
import FontAwesome from '@expo/vector-icons/FontAwesome';
11+
import SWMIcon from '../assets/swm_icon.svg';
12+
import {
13+
useLLM,
14+
useTextToSpeech,
15+
KOKORO_MEDIUM,
16+
KOKORO_VOICE_AF_HEART,
17+
LLAMA3_2_1B_QLORA,
18+
} from 'react-native-executorch';
19+
import {
20+
AudioManager,
21+
AudioContext,
22+
AudioBuffer,
23+
AudioBufferSourceNode,
24+
} from 'react-native-audio-api';
25+
26+
interface TextToSpeechLLMProps {
27+
onBack: () => void;
28+
}
29+
30+
/**
31+
* Converts an audio vector (Float32Array) to an AudioBuffer for playback
32+
* @param audioVector - The generated audio samples from the model
33+
* @param sampleRate - The sample rate (default: 24000 Hz for Kokoro)
34+
* @returns AudioBuffer ready for playback
35+
*/
36+
const createAudioBufferFromVector = (
37+
audioVector: Float32Array,
38+
audioContext: AudioContext,
39+
sampleRate: number = 24000
40+
): AudioBuffer => {
41+
const audioBuffer = audioContext.createBuffer(
42+
1,
43+
audioVector.length,
44+
sampleRate
45+
);
46+
const channelData = audioBuffer.getChannelData(0);
47+
channelData.set(audioVector);
48+
49+
return audioBuffer;
50+
};
51+
52+
export const TextToSpeechLLMScreen = ({ onBack }: TextToSpeechLLMProps) => {
53+
const [displayText, setDisplayText] = useState('');
54+
const [isTtsStreaming, setIsTtsStreaming] = useState(false);
55+
const llm = useLLM({ model: LLAMA3_2_1B_QLORA });
56+
const tts = useTextToSpeech({
57+
model: KOKORO_MEDIUM,
58+
voice: KOKORO_VOICE_AF_HEART,
59+
});
60+
61+
const processedLengthRef = useRef(0);
62+
const audioContextRef = useRef<AudioContext | null>(null);
63+
const sourceRef = useRef<AudioBufferSourceNode>(null);
64+
65+
useEffect(() => {
66+
AudioManager.setAudioSessionOptions({
67+
iosCategory: 'playAndRecord',
68+
iosMode: 'spokenAudio',
69+
iosOptions: ['defaultToSpeaker'],
70+
});
71+
72+
audioContextRef.current = new AudioContext({ sampleRate: 24000 });
73+
audioContextRef.current.suspend();
74+
75+
return () => {
76+
audioContextRef.current?.close();
77+
audioContextRef.current = null;
78+
};
79+
}, []);
80+
81+
// Update displayText gradually as response gets generated and insert new text chunks into TTS stream
82+
useEffect(() => {
83+
if (llm.response && tts.isReady) {
84+
setDisplayText(llm.response);
85+
86+
const previousLength = processedLengthRef.current;
87+
if (llm.response.length > previousLength && isTtsStreaming) {
88+
const newChunk = llm.response.slice(previousLength);
89+
tts.streamInsert(newChunk);
90+
processedLengthRef.current = llm.response.length;
91+
}
92+
} else {
93+
processedLengthRef.current = 0;
94+
}
95+
}, [llm.response, tts, isTtsStreaming]);
96+
97+
const handleGenerate = async () => {
98+
setDisplayText('');
99+
processedLengthRef.current = 0;
100+
setIsTtsStreaming(true);
101+
102+
const startTTS = async () => {
103+
try {
104+
const audioContext = audioContextRef.current;
105+
if (!audioContext) return;
106+
107+
if (audioContext.state === 'suspended') {
108+
await audioContext.resume();
109+
}
110+
111+
const onNext = async (audioVec: Float32Array) => {
112+
return new Promise<void>((resolve) => {
113+
const audioBuffer = createAudioBufferFromVector(
114+
audioVec,
115+
audioContext,
116+
24000
117+
);
118+
119+
const source = (sourceRef.current =
120+
audioContext.createBufferSource());
121+
source.buffer = audioBuffer;
122+
source.connect(audioContext.destination);
123+
124+
source.onEnded = () => resolve();
125+
126+
source.start();
127+
});
128+
};
129+
130+
await tts.stream({
131+
speed: 0.9,
132+
stopAutomatically: false,
133+
onNext,
134+
});
135+
} catch (e) {
136+
console.error('TTS streaming error:', e);
137+
} finally {
138+
setIsTtsStreaming(false);
139+
}
140+
};
141+
142+
const ttsPromise = startTTS();
143+
144+
try {
145+
await llm.sendMessage(
146+
'Generate a short story about a robot learning to paint. The story should be around 200 words long.'
147+
);
148+
} catch (e) {
149+
console.error('Generation failed:', e);
150+
} finally {
151+
tts.streamStop(false);
152+
await ttsPromise;
153+
154+
if (
155+
audioContextRef.current &&
156+
audioContextRef.current.state === 'running'
157+
) {
158+
await audioContextRef.current.suspend();
159+
}
160+
}
161+
};
162+
163+
const handleStop = () => {
164+
llm.interrupt();
165+
tts.streamStop(true);
166+
if (sourceRef.current) {
167+
try {
168+
sourceRef.current.stop();
169+
} catch (e) {
170+
// Source might have already stopped or disconnected
171+
}
172+
}
173+
};
174+
175+
const isProcessing = llm.isGenerating || isTtsStreaming;
176+
const isModelsReady = llm.isReady && tts.isReady;
177+
178+
const getModelStatus = () => {
179+
if (llm.error) return `LLM Error: ${llm.error.message}`;
180+
if (tts.error) return `TTS Error: ${tts.error.message}`;
181+
if (!llm.isReady)
182+
return `Loading LLM: ${(100 * llm.downloadProgress).toFixed(2)}%`;
183+
if (!tts.isReady)
184+
return `Loading TTS: ${(100 * tts.downloadProgress).toFixed(2)}%`;
185+
if (isProcessing) return 'Generating/Streaming...';
186+
return 'Ready';
187+
};
188+
189+
return (
190+
<SafeAreaProvider>
191+
<SafeAreaView style={styles.container}>
192+
<View style={styles.header}>
193+
<TouchableOpacity style={styles.backButton} onPress={onBack}>
194+
<FontAwesome name="chevron-left" size={20} color="#0f186e" />
195+
</TouchableOpacity>
196+
<SWMIcon width={60} height={60} />
197+
<Text style={styles.headerText}>React Native ExecuTorch</Text>
198+
<Text style={styles.headerText}>LLM to Speech Demo</Text>
199+
</View>
200+
201+
<View style={styles.statusContainer}>
202+
<Text>Status: {getModelStatus()}</Text>
203+
</View>
204+
205+
<View style={styles.contentContainer}>
206+
<Text style={styles.label}>Generated Story</Text>
207+
<View style={styles.responseContainer}>
208+
<ScrollView contentContainerStyle={styles.responseContent}>
209+
<Text style={styles.responseText}>
210+
{displayText ||
211+
(isModelsReady
212+
? 'Press the button to generate a story and hear it spoken aloud.'
213+
: 'Please wait for models to load...')}
214+
</Text>
215+
</ScrollView>
216+
</View>
217+
</View>
218+
219+
<View style={styles.buttonContainer}>
220+
{isProcessing ? (
221+
<TouchableOpacity
222+
style={[styles.actionButton, styles.stopButton]}
223+
onPress={handleStop}
224+
>
225+
<FontAwesome name="stop" size={20} color="white" />
226+
<Text style={styles.buttonText}>Stop Generation</Text>
227+
</TouchableOpacity>
228+
) : (
229+
<TouchableOpacity
230+
disabled={!isModelsReady}
231+
onPress={handleGenerate}
232+
style={[styles.actionButton, !isModelsReady && styles.disabled]}
233+
>
234+
<FontAwesome name="magic" size={20} color="white" />
235+
<Text style={styles.buttonText}>Generate & Stream Speech</Text>
236+
</TouchableOpacity>
237+
)}
238+
</View>
239+
</SafeAreaView>
240+
</SafeAreaProvider>
241+
);
242+
};
243+
244+
const styles = StyleSheet.create({
245+
container: {
246+
flex: 1,
247+
alignItems: 'center',
248+
backgroundColor: 'white',
249+
paddingHorizontal: 16,
250+
},
251+
header: {
252+
alignItems: 'center',
253+
position: 'relative',
254+
width: '100%',
255+
},
256+
backButton: {
257+
position: 'absolute',
258+
left: 0,
259+
top: 10,
260+
padding: 10,
261+
zIndex: 1,
262+
},
263+
headerText: {
264+
fontSize: 22,
265+
fontWeight: 'bold',
266+
color: '#0f186e',
267+
},
268+
statusContainer: {
269+
marginTop: 12,
270+
alignItems: 'center',
271+
},
272+
contentContainer: {
273+
width: '100%',
274+
marginTop: 24,
275+
flex: 1,
276+
marginBottom: 24,
277+
},
278+
label: {
279+
marginLeft: 12,
280+
marginBottom: 4,
281+
color: '#0f186e',
282+
fontWeight: '600',
283+
},
284+
responseContainer: {
285+
borderRadius: 12,
286+
borderWidth: 1,
287+
borderColor: '#0f186e',
288+
flex: 1,
289+
},
290+
responseContent: {
291+
padding: 12,
292+
},
293+
responseText: {
294+
fontSize: 16,
295+
color: '#333',
296+
lineHeight: 24,
297+
},
298+
buttonContainer: {
299+
marginBottom: 24,
300+
width: '100%',
301+
},
302+
actionButton: {
303+
backgroundColor: '#0f186e',
304+
flexDirection: 'row',
305+
justifyContent: 'center',
306+
alignItems: 'center',
307+
padding: 12,
308+
borderRadius: 12,
309+
gap: 8,
310+
},
311+
stopButton: {
312+
backgroundColor: '#ff4444',
313+
},
314+
buttonText: {
315+
color: 'white',
316+
fontWeight: '600',
317+
letterSpacing: -0.5,
318+
fontSize: 16,
319+
},
320+
disabled: {
321+
opacity: 0.5,
322+
},
323+
});

docs/docs/03-hooks/01-natural-language-processing/useTextToSpeech.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,8 @@ The module provides two ways to generate speech using either raw text or pre-gen
8787
### Using Text
8888

8989
1. [**`forward({ text, speed })`**](../../06-api-reference/interfaces/TextToSpeechType.md#forward): Generates the complete audio waveform at once. Returns a promise resolving to a `Float32Array`.
90-
2. [**`stream({ text, speed, onNext, ... })`**](../../06-api-reference/interfaces/TextToSpeechType.md#stream): An async generator that yields chunks of audio as they are computed. This is ideal for reducing the "time to first audio" for long sentences.
90+
2. [**`stream({speed, stopAutomatically, onNext, ...})`**](../../06-api-reference/interfaces/TextToSpeechType.md#stream): An async generator-like functionality (managed via callbacks like `onNext`) that yields chunks of audio as they are computed.
91+
This is ideal for reducing the "time to first audio" for long sentences. You can also dynamically insert text during the generation process using `streamInsert(text)` and stop it with `streamStop(instant)`.
9192

9293
### Using Phonemes
9394

docs/docs/04-typescript-api/01-natural-language-processing/TextToSpeechModule.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,14 +52,14 @@ The module provides two ways to generate speech using either raw text or pre-gen
5252
### Using Text
5353

5454
1. [**`forward(text, speed)`**](../../06-api-reference/classes/TextToSpeechModule.md#forward): Generates the complete audio waveform at once. Returns a promise resolving to a `Float32Array`.
55-
2. [**`stream({ text, speed })`**](../../06-api-reference/classes/TextToSpeechModule.md#stream): An async generator that yields chunks of audio as they are computed. This is ideal for reducing the "time to first audio" for long sentences.
55+
2. [**`stream({ speed, stopAutomatically, onNext, ... })`**](../../06-api-reference/classes/TextToSpeechModule.md#stream): An async generator that yields chunks of audio as they are computed. This is ideal for reducing the "time to first audio" for long sentences. In contrast to `forward`, it enables inserting text chunks dynamically into processing buffer with [**`streamInsert(text)`**](../../06-api-reference/classes/TextToSpeechModule.md#streaminsert) and allows stopping generation early with [**`streamStop(instant)`**](../../06-api-reference/classes/TextToSpeechModule.md#streamstop).
5656

5757
### Using Phonemes
5858

5959
If you have pre-computed phonemes (e.g., from an external dictionary or a custom G2P model), you can skip the internal phoneme generation step:
6060

6161
1. [**`forwardFromPhonemes(phonemes, speed)`**](../../06-api-reference/classes/TextToSpeechModule.md#forwardfromphonemes): Generates the complete audio waveform from a phoneme string.
62-
2. [**`streamFromPhonemes({ phonemes, speed })`**](../../06-api-reference/classes/TextToSpeechModule.md#streamfromphonemes): Streams audio chunks generated from a phoneme string.
62+
2. [**`streamFromPhonemes({ phonemes, speed, onNext, ... })`**](../../06-api-reference/classes/TextToSpeechModule.md#streamfromphonemes): Streams audio chunks generated from a phoneme string.
6363

6464
:::note
6565
Since `forward` and `forwardFromPhonemes` process the entire input at once, they might take a significant amount of time to produce audio for long inputs.

0 commit comments

Comments
 (0)