Skip to content

Commit 310465d

Browse files
yocontraIgorSwat
andauthored
feat!: allow passing pre-computed phonemes to Kokoro TTS
Right now if you want to use Kokoro TTS, you have to go through the built-in phonemis G2P pipeline. There's no way around it. This PR adds `generateFromPhonemes` / `streamFromPhonemes` methods that let you skip phonemis and pass your own IPA phoneme strings directly to the synthesis engine. Why would you want this? A few reasons we've run into: - phonemis doesn't handle every word well. Libraries like [phonemizer](https://github.com/bootphon/phonemizer) (espeak-ng backend) do better on edge cases, foreign words, etc. - Custom lexicons. If you have domain-specific pronunciation (game character names, medical terms), you probably want control over the G2P step. - Server-side G2P. Pre-compute phonemes on a server with a proper NLP pipeline, send them to the device. - Languages phonemis doesn't cover yet. ## What changed The existing `generate()` / `stream()` methods now delegate to shared internal helpers (`generateFromPhonemesImpl` / `streamFromPhonemesImpl`). The new public methods call the same helpers but skip the `phonemizer_.process()` step. No behavior change for existing callers. Changes across layers: - C++ `Kokoro`: `generateFromPhonemes`, `streamFromPhonemes` + input validation (empty string, invalid UTF-8) - JSI `ModelHostObject`: exposes new methods - `TextToSpeechModule`: `forwardFromPhonemes()`, `streamFromPhonemes()` (shared `streamImpl` helper, no copy-paste) - `useTextToSpeech` hook: same, with shared guard + streaming orchestration - Types: `TextToSpeechPhonemeInput`, `TextToSpeechStreamingPhonemeInput`, `TextToSpeechStreamingCallbacks` ## Usage ```typescript const tts = new TextToSpeechModule(); await tts.load(config); // text path (unchanged -- goes through phonemis) const audio = await tts.forward("Hello world"); // phoneme path (bypasses phonemis) const audio = await tts.forwardFromPhonemes("həloʊ wɝːld"); // streaming for await (const chunk of tts.streamFromPhonemes({ phonemes: "həloʊ wɝːld", speed: 1.0 })) { playAudio(chunk); } ``` ## Test plan - [ ] Existing `generate()` and `stream()` still work (refactor is internal) - [ ] `generateFromPhonemes()` with known Kokoro IPA strings - [ ] `streamFromPhonemes()` produces same audio as `stream()` for identical phonemes - [ ] Multi-byte UTF-8 phoneme characters (ʊ, ɪ, ŋ, etc.) - [ ] Empty string and invalid UTF-8 rejected with proper error --------- Co-authored-by: IgorSwat <igorswat2002@o2.pl>
1 parent f43d3c5 commit 310465d

File tree

9 files changed

+377
-98
lines changed

9 files changed

+377
-98
lines changed

.cspell-wordlist.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,3 +127,4 @@ detr
127127
metaprogramming
128128
ktlint
129129
lefthook
130+
espeak

docs/docs/03-hooks/01-natural-language-processing/useTextToSpeech.md

Lines changed: 55 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -82,17 +82,24 @@ You need more details? Check the following resources:
8282

8383
## Running the model
8484

85-
The module provides two ways to generate speech:
85+
The module provides two ways to generate speech using either raw text or pre-generated phonemes:
8686

87-
1. [**`forward(text, speed)`**](../../06-api-reference/interfaces/TextToSpeechType.md#forward): Generates the complete audio waveform at once. Returns a promise resolving to a `Float32Array`.
87+
### Using Text
88+
89+
1. [**`forward({ text, speed })`**](../../06-api-reference/interfaces/TextToSpeechType.md#forward): Generates the complete audio waveform at once. Returns a promise resolving to a `Float32Array`.
90+
2. [**`stream({ text, speed, onNext, ... })`**](../../06-api-reference/interfaces/TextToSpeechType.md#stream): An async generator that yields chunks of audio as they are computed. This is ideal for reducing the "time to first audio" for long sentences.
91+
92+
### Using Phonemes
93+
94+
If you have pre-computed phonemes (e.g., from an external dictionary or a custom G2P model), you can skip the internal phoneme generation step:
95+
96+
1. [**`forwardFromPhonemes({ phonemes, speed })`**](../../06-api-reference/interfaces/TextToSpeechType.md#forwardfromphonemes): Generates the complete audio waveform from a phoneme string.
97+
2. [**`streamFromPhonemes({ phonemes, speed, onNext, ... })`**](../../06-api-reference/interfaces/TextToSpeechType.md#streamfromphonemes): Streams audio chunks generated from a phoneme string.
8898

8999
:::note
90-
Since it processes the entire text at once, it might take a significant amount of time to produce an audio for long text inputs.
100+
Since `forward` and `forwardFromPhonemes` process the entire input at once, they might take a significant amount of time to produce audio for long inputs.
91101
:::
92102

93-
2. [**`stream({ text, speed })`**](../../06-api-reference/interfaces/TextToSpeechType.md#stream): An async generator that yields chunks of audio as they are computed.
94-
This is ideal for reducing the "time to first audio" for long sentences.
95-
96103
## Example
97104

98105
### Speech Synthesis
@@ -185,6 +192,48 @@ export default function App() {
185192
}
186193
```
187194

195+
### Synthesis from Phonemes
196+
197+
If you already have a phoneme string obtained from an external source (e.g. the Python `phonemizer` library,
198+
`espeak-ng`, or any custom phonemizer), you can use `forwardFromPhonemes` or `streamFromPhonemes` to synthesize audio directly, skipping the phoneme generation stage.
199+
200+
```tsx
201+
import React from 'react';
202+
import { Button, View } from 'react-native';
203+
import {
204+
useTextToSpeech,
205+
KOKORO_MEDIUM,
206+
KOKORO_VOICE_AF_HEART,
207+
} from 'react-native-executorch';
208+
209+
export default function App() {
210+
const tts = useTextToSpeech({
211+
model: KOKORO_MEDIUM,
212+
voice: KOKORO_VOICE_AF_HEART,
213+
});
214+
215+
const synthesizePhonemes = async () => {
216+
// Example phonemes for "Hello"
217+
const audioData = await tts.forwardFromPhonemes({
218+
phonemes:
219+
'ɐ mˈæn hˌu dˈʌzᵊnt tɹˈʌst hɪmsˈɛlf, kæn nˈɛvəɹ ɹˈiᵊli tɹˈʌst ˈɛniwˌʌn ˈɛls.',
220+
});
221+
222+
// ... process or play audioData ...
223+
};
224+
225+
return (
226+
<View style={{ flex: 1, justifyContent: 'center', alignItems: 'center' }}>
227+
<Button
228+
title="Synthesize Phonemes"
229+
onPress={synthesizePhonemes}
230+
disabled={!tts.isReady}
231+
/>
232+
</View>
233+
);
234+
}
235+
```
236+
188237
## Supported models
189238

190239
| Model | Language |

docs/docs/04-typescript-api/01-natural-language-processing/TextToSpeechModule.md

Lines changed: 43 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -53,16 +53,24 @@ For more information on resource sources, see [loading models](../../01-fundamen
5353

5454
## Running the model
5555

56-
The module provides two ways to generate speech:
56+
The module provides two ways to generate speech using either raw text or pre-generated phonemes:
57+
58+
### Using Text
5759

5860
1. [**`forward(text, speed)`**](../../06-api-reference/classes/TextToSpeechModule.md#forward): Generates the complete audio waveform at once. Returns a promise resolving to a `Float32Array`.
61+
2. [**`stream({ text, speed })`**](../../06-api-reference/classes/TextToSpeechModule.md#stream): An async generator that yields chunks of audio as they are computed. This is ideal for reducing the "time to first audio" for long sentences.
62+
63+
### Using Phonemes
64+
65+
If you have pre-computed phonemes (e.g., from an external dictionary or a custom G2P model), you can skip the internal phoneme generation step:
66+
67+
1. [**`forwardFromPhonemes(phonemes, speed)`**](../../06-api-reference/classes/TextToSpeechModule.md#forwardfromphonemes): Generates the complete audio waveform from a phoneme string.
68+
2. [**`streamFromPhonemes({ phonemes, speed })`**](../../06-api-reference/classes/TextToSpeechModule.md#streamfromphonemes): Streams audio chunks generated from a phoneme string.
5969

6070
:::note
61-
Since it processes the entire text at once, it might take a significant amount of time to produce an audio for long text inputs.
71+
Since `forward` and `forwardFromPhonemes` process the entire input at once, they might take a significant amount of time to produce audio for long inputs.
6272
:::
6373

64-
2. [**`stream({ text, speed })`**](../../06-api-reference/classes/TextToSpeechModule.md#stream): An async generator that yields chunks of audio as they are computed. This is ideal for reducing the "time to first audio" for long sentences.
65-
6674
## Example
6775

6876
### Speech Synthesis
@@ -135,3 +143,34 @@ try {
135143
console.error('Streaming failed:', error);
136144
}
137145
```
146+
147+
### Synthesis from Phonemes
148+
149+
If you already have a phoneme string (e.g., from an external library), you can use `forwardFromPhonemes` or `streamFromPhonemes` to synthesize audio directly, skipping the internal phonemizer stage.
150+
151+
```typescript
152+
import {
153+
TextToSpeechModule,
154+
KOKORO_MEDIUM,
155+
KOKORO_VOICE_AF_HEART,
156+
} from 'react-native-executorch';
157+
158+
const tts = new TextToSpeechModule();
159+
160+
await tts.load({
161+
model: KOKORO_MEDIUM,
162+
voice: KOKORO_VOICE_AF_HEART,
163+
});
164+
165+
// Example phonemes for "ExecuTorch"
166+
const waveform = await tts.forwardFromPhonemes('həlˈO wˈɜɹld!', 1.0);
167+
168+
// Or stream from phonemes
169+
for await (const chunk of tts.streamFromPhonemes({
170+
phonemes:
171+
'ɐ mˈæn hˌu dˈʌzᵊnt tɹˈʌst hɪmsˈɛlf, kæn nˈɛvəɹ ɹˈiᵊli tɹˈʌst ˈɛniwˌʌn ˈɛls.',
172+
speed: 1.0,
173+
})) {
174+
// ... process chunk ...
175+
}
176+
```

packages/react-native-executorch/common/rnexecutorch/host_objects/ModelHostObject.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,14 @@ template <typename Model> class ModelHostObject : public JsiHostObject {
171171
addFunctions(JSI_EXPORT_FUNCTION(ModelHostObject<Model>,
172172
promiseHostFunction<&Model::stream>,
173173
"stream"));
174+
addFunctions(JSI_EXPORT_FUNCTION(
175+
ModelHostObject<Model>,
176+
promiseHostFunction<&Model::generateFromPhonemes>,
177+
"generateFromPhonemes"));
178+
addFunctions(JSI_EXPORT_FUNCTION(
179+
ModelHostObject<Model>,
180+
promiseHostFunction<&Model::streamFromPhonemes>,
181+
"streamFromPhonemes"));
174182
}
175183

176184
if constexpr (meta::HasGenerateFromString<Model>) {

packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Kokoro.cpp

Lines changed: 61 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
#include <algorithm>
66
#include <fstream>
7+
#include <phonemis/utilities/string_utils.h>
78
#include <rnexecutorch/Error.h>
89
#include <rnexecutorch/data_processing/Sequential.h>
910

@@ -73,16 +74,9 @@ void Kokoro::loadVoice(const std::string &voiceSource) {
7374
}
7475
}
7576

76-
std::vector<float> Kokoro::generate(std::string text, float speed) {
77-
if (text.size() > params::kMaxTextSize) {
78-
throw RnExecutorchError(RnExecutorchErrorCode::InvalidUserInput,
79-
"Kokoro: maximum input text size exceeded");
80-
}
81-
82-
// G2P (Grapheme to Phoneme) conversion
83-
auto phonemes = phonemizer_.process(text);
84-
85-
// Divide the phonemes string intro substrings.
77+
std::vector<float>
78+
Kokoro::generateFromPhonemesImpl(const std::u32string &phonemes, float speed) {
79+
// Divide the phonemes string into substrings.
8680
// Affects the further calculations only in case of string size
8781
// exceeding the biggest model's input.
8882
auto subsentences =
@@ -98,26 +92,20 @@ std::vector<float> Kokoro::generate(std::string text, float speed) {
9892
size_t pauseMs = params::kPauseValues.contains(lastPhoneme)
9993
? params::kPauseValues.at(lastPhoneme)
10094
: params::kDefaultPause;
101-
std::vector<float> pause(pauseMs * constants::kSamplesPerMilisecond, 0.F);
10295

103-
// Add audio part and pause to the main audio vector
96+
// Add audio part and silence pause to the main audio vector
10497
audio.insert(audio.end(), std::make_move_iterator(audioPart.begin()),
10598
std::make_move_iterator(audioPart.end()));
106-
audio.insert(audio.end(), std::make_move_iterator(pause.begin()),
107-
std::make_move_iterator(pause.end()));
99+
audio.resize(audio.size() + pauseMs * constants::kSamplesPerMilisecond,
100+
0.F);
108101
}
109102

110103
return audio;
111104
}
112105

113-
void Kokoro::stream(std::string text, float speed,
114-
std::shared_ptr<jsi::Function> callback) {
115-
if (text.size() > params::kMaxTextSize) {
116-
throw RnExecutorchError(RnExecutorchErrorCode::InvalidUserInput,
117-
"Kokoro: maximum input text size exceeded");
118-
}
119-
120-
// Build a full callback function
106+
void Kokoro::streamFromPhonemesImpl(
107+
const std::u32string &phonemes, float speed,
108+
std::shared_ptr<jsi::Function> callback) {
121109
auto nativeCallback = [this, callback](const std::vector<float> &audioVec) {
122110
if (this->isStreaming_) {
123111
this->callInvoker_->invokeAsync([callback, audioVec](jsi::Runtime &rt) {
@@ -127,21 +115,12 @@ void Kokoro::stream(std::string text, float speed,
127115
}
128116
};
129117

130-
// Mark the beginning of the streaming process
131118
isStreaming_ = true;
132119

133-
// G2P (Grapheme to Phoneme) conversion
134-
auto phonemes = phonemizer_.process(text);
135-
136-
// Divide the phonemes string intro substrings.
137-
// Use specialized implementation to minimize the latency between the
138-
// sentences.
120+
// Use LATENCY strategy to minimize the time-to-first-audio for streaming
139121
auto subsentences =
140122
partitioner_.divide<Partitioner::Strategy::LATENCY>(phonemes);
141123

142-
// We follow the implementation of generate() method, but
143-
// instead of accumulating results in a vector, we push them
144-
// back to the JS side with the callback.
145124
for (size_t i = 0; i < subsentences.size(); i++) {
146125
if (!isStreaming_) {
147126
break;
@@ -151,7 +130,7 @@ void Kokoro::stream(std::string text, float speed,
151130

152131
// Determine the silent padding duration to be stripped from the edges of
153132
// the generated audio. If a chunk ends with a space or follows one that
154-
// did, it indicates a word boundary split – we use a shorter padding (20ms)
133+
// did, it indicates a word boundary split – we use a shorter padding
155134
// to ensure natural speech flow. Otherwise, we use 50ms for standard
156135
// pauses.
157136
bool endsWithSpace = (subsentence.back() == U' ');
@@ -161,25 +140,67 @@ void Kokoro::stream(std::string text, float speed,
161140
// Generate an audio vector with the Kokoro model
162141
auto audioPart = synthesize(subsentence, speed, paddingMs);
163142

164-
// Calculate a pause between the sentences
143+
// Calculate and append a pause between the sentences
165144
char32_t lastPhoneme = subsentence.back();
166145
size_t pauseMs = params::kPauseValues.contains(lastPhoneme)
167146
? params::kPauseValues.at(lastPhoneme)
168147
: params::kDefaultPause;
169-
std::vector<float> pause(pauseMs * constants::kSamplesPerMilisecond, 0.F);
170-
171-
// Add pause to the audio vector
172-
audioPart.insert(audioPart.end(), std::make_move_iterator(pause.begin()),
173-
std::make_move_iterator(pause.end()));
148+
audioPart.resize(
149+
audioPart.size() + pauseMs * constants::kSamplesPerMilisecond, 0.F);
174150

175151
// Push the audio right away to the JS side
176152
nativeCallback(audioPart);
177153
}
178154

179-
// Mark the end of the streaming process
180155
isStreaming_ = false;
181156
}
182157

158+
std::vector<float> Kokoro::generate(std::string text, float speed) {
159+
if (text.size() > params::kMaxTextSize) {
160+
throw RnExecutorchError(RnExecutorchErrorCode::InvalidUserInput,
161+
"Kokoro: maximum input text size exceeded");
162+
}
163+
164+
// G2P (Grapheme to Phoneme) conversion
165+
auto phonemes = phonemizer_.process(text);
166+
167+
return generateFromPhonemesImpl(phonemes, speed);
168+
}
169+
170+
std::vector<float> Kokoro::generateFromPhonemes(std::string phonemes,
171+
float speed) {
172+
if (phonemes.empty()) {
173+
throw RnExecutorchError(RnExecutorchErrorCode::InvalidUserInput,
174+
"Kokoro: phoneme string must not be empty");
175+
}
176+
return generateFromPhonemesImpl(
177+
phonemis::utilities::string_utils::utf8_to_u32string(phonemes), speed);
178+
}
179+
180+
void Kokoro::stream(std::string text, float speed,
181+
std::shared_ptr<jsi::Function> callback) {
182+
if (text.size() > params::kMaxTextSize) {
183+
throw RnExecutorchError(RnExecutorchErrorCode::InvalidUserInput,
184+
"Kokoro: maximum input text size exceeded");
185+
}
186+
187+
// G2P (Grapheme to Phoneme) conversion
188+
auto phonemes = phonemizer_.process(text);
189+
190+
streamFromPhonemesImpl(phonemes, speed, callback);
191+
}
192+
193+
void Kokoro::streamFromPhonemes(std::string phonemes, float speed,
194+
std::shared_ptr<jsi::Function> callback) {
195+
if (phonemes.empty()) {
196+
throw RnExecutorchError(RnExecutorchErrorCode::InvalidUserInput,
197+
"Kokoro: phoneme string must not be empty");
198+
}
199+
streamFromPhonemesImpl(
200+
phonemis::utilities::string_utils::utf8_to_u32string(phonemes), speed,
201+
callback);
202+
}
203+
183204
void Kokoro::streamStop() noexcept { isStreaming_ = false; }
184205

185206
std::vector<float> Kokoro::synthesize(const std::u32string &phonemes,

packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Kokoro.h

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,11 +27,22 @@ class Kokoro {
2727
// Processes the entire text at once, before sending back to the JS side.
2828
std::vector<float> generate(std::string text, float speed = 1.F);
2929

30+
// Accepts pre-computed phonemes (as a UTF-8 IPA string) and synthesizes
31+
// audio, bypassing the built-in phonemizer. This allows callers to use
32+
// an external G2P system (e.g. the Python `phonemizer` library, espeak-ng,
33+
// or any custom phonemizer).
34+
std::vector<float> generateFromPhonemes(std::string phonemes,
35+
float speed = 1.F);
36+
3037
// Processes text in chunks, sending each chunk individualy to the JS side
3138
// with asynchronous callbacks.
3239
void stream(std::string text, float speed,
3340
std::shared_ptr<jsi::Function> callback);
3441

42+
// Streaming variant that accepts pre-computed phonemes instead of text.
43+
void streamFromPhonemes(std::string phonemes, float speed,
44+
std::shared_ptr<jsi::Function> callback);
45+
3546
// Stops the streaming process
3647
void streamStop() noexcept;
3748

@@ -42,6 +53,12 @@ class Kokoro {
4253
// Helper function - loading voice array
4354
void loadVoice(const std::string &voiceSource);
4455

56+
// Helper function - shared synthesis pipeline (partition + synthesize)
57+
std::vector<float> generateFromPhonemesImpl(const std::u32string &phonemes,
58+
float speed);
59+
void streamFromPhonemesImpl(const std::u32string &phonemes, float speed,
60+
std::shared_ptr<jsi::Function> callback);
61+
4562
// Helper function - generate specialization for given input size
4663
std::vector<float> synthesize(const std::u32string &phonemes, float speed,
4764
size_t paddingMs = 50);

0 commit comments

Comments
 (0)