feat!: allow passing pre-computed phonemes to Kokoro TTS

yocontra · IgorSwat · web-flow · commit 310465d7ad3d · 2026-03-09T10:54:16.000+01:00
Right now if you want to use Kokoro TTS, you have to go through the built-in phonemis G2P pipeline. There's no way around it. This PR adds `generateFromPhonemes` / `streamFromPhonemes` methods that let you skip phonemis and pass your own IPA phoneme strings directly to the synthesis engine. Why would you want this? A few reasons we've run into: - phonemis doesn't handle every word well. Libraries like [phonemizer](https://github.com/bootphon/phonemizer) (espeak-ng backend) do better on edge cases, foreign words, etc. - Custom lexicons. If you have domain-specific pronunciation (game character names, medical terms), you probably want control over the G2P step. - Server-side G2P. Pre-compute phonemes on a server with a proper NLP pipeline, send them to the device. - Languages phonemis doesn't cover yet. ## What changed The existing `generate()` / `stream()` methods now delegate to shared internal helpers (`generateFromPhonemesImpl` / `streamFromPhonemesImpl`). The new public methods call the same helpers but skip the `phonemizer_.process()` step. No behavior change for existing callers. Changes across layers: - C++ `Kokoro`: `generateFromPhonemes`, `streamFromPhonemes` + input validation (empty string, invalid UTF-8) - JSI `ModelHostObject`: exposes new methods - `TextToSpeechModule`: `forwardFromPhonemes()`, `streamFromPhonemes()` (shared `streamImpl` helper, no copy-paste) - `useTextToSpeech` hook: same, with shared guard + streaming orchestration - Types: `TextToSpeechPhonemeInput`, `TextToSpeechStreamingPhonemeInput`, `TextToSpeechStreamingCallbacks` ## Usage ```typescript const tts = new TextToSpeechModule(); await tts.load(config); // text path (unchanged -- goes through phonemis) const audio = await tts.forward("Hello world"); // phoneme path (bypasses phonemis) const audio = await tts.forwardFromPhonemes("həloʊ wɝːld"); // streaming for await (const chunk of tts.streamFromPhonemes({ phonemes: "həloʊ wɝːld", speed: 1.0 })) { playAudio(chunk); } ``` ## Test plan - [ ] Existing `generate()` and `stream()` still work (refactor is internal) - [ ] `generateFromPhonemes()` with known Kokoro IPA strings - [ ] `streamFromPhonemes()` produces same audio as `stream()` for identical phonemes - [ ] Multi-byte UTF-8 phoneme characters (ʊ, ɪ, ŋ, etc.) - [ ] Empty string and invalid UTF-8 rejected with proper error --------- Co-authored-by: IgorSwat <igorswat2002@o2.pl>
diff --git a/.cspell-wordlist.txt b/.cspell-wordlist.txt
@@ -127,3 +127,4 @@ detr
 metaprogramming
 ktlint
 lefthook
+espeak
diff --git a/docs/docs/03-hooks/01-natural-language-processing/useTextToSpeech.md b/docs/docs/03-hooks/01-natural-language-processing/useTextToSpeech.md
@@ -82,17 +82,24 @@ You need more details? Check the following resources:
 
 ## Running the model
 
-The module provides two ways to generate speech:
+The module provides two ways to generate speech using either raw text or pre-generated phonemes:
 
-1.  [**`forward(text, speed)`**](../../06-api-reference/interfaces/TextToSpeechType.md#forward): Generates the complete audio waveform at once. Returns a promise resolving to a `Float32Array`.
+### Using Text
+
+1.  [**`forward({ text, speed })`**](../../06-api-reference/interfaces/TextToSpeechType.md#forward): Generates the complete audio waveform at once. Returns a promise resolving to a `Float32Array`.
+2.  [**`stream({ text, speed, onNext, ... })`**](../../06-api-reference/interfaces/TextToSpeechType.md#stream): An async generator that yields chunks of audio as they are computed. This is ideal for reducing the "time to first audio" for long sentences.
+
+### Using Phonemes
+
+If you have pre-computed phonemes (e.g., from an external dictionary or a custom G2P model), you can skip the internal phoneme generation step:
+
+1.  [**`forwardFromPhonemes({ phonemes, speed })`**](../../06-api-reference/interfaces/TextToSpeechType.md#forwardfromphonemes): Generates the complete audio waveform from a phoneme string.
+2.  [**`streamFromPhonemes({ phonemes, speed, onNext, ... })`**](../../06-api-reference/interfaces/TextToSpeechType.md#streamfromphonemes): Streams audio chunks generated from a phoneme string.
 
 :::note
-Since it processes the entire text at once, it might take a significant amount of time to produce an audio for long text inputs.
+Since `forward` and `forwardFromPhonemes` process the entire input at once, they might take a significant amount of time to produce audio for long inputs.
 :::
 
-2.  [**`stream({ text, speed })`**](../../06-api-reference/interfaces/TextToSpeechType.md#stream): An async generator that yields chunks of audio as they are computed.
-    This is ideal for reducing the "time to first audio" for long sentences.
-
 ## Example
 
 ### Speech Synthesis
@@ -185,6 +192,48 @@ export default function App() {
 }
 ```
 
+### Synthesis from Phonemes
+
+If you already have a phoneme string obtained from an external source (e.g. the Python `phonemizer` library,
+`espeak-ng`, or any custom phonemizer), you can use `forwardFromPhonemes` or `streamFromPhonemes` to synthesize audio directly, skipping the phoneme generation stage.
+
+```tsx
+import React from 'react';
+import { Button, View } from 'react-native';
+import {
+  useTextToSpeech,
+  KOKORO_MEDIUM,
+  KOKORO_VOICE_AF_HEART,
+} from 'react-native-executorch';
+
+export default function App() {
+  const tts = useTextToSpeech({
+    model: KOKORO_MEDIUM,
+    voice: KOKORO_VOICE_AF_HEART,
+  });
+
+  const synthesizePhonemes = async () => {
+    // Example phonemes for "Hello"
+    const audioData = await tts.forwardFromPhonemes({
+      phonemes:
+        'ɐ mˈæn hˌu dˈʌzᵊnt tɹˈʌst hɪmsˈɛlf, kæn nˈɛvəɹ ɹˈiᵊli tɹˈʌst ˈɛniwˌʌn ˈɛls.',
+    });
+
+    // ... process or play audioData ...
+  };
+
+  return (
+    <View style={{ flex: 1, justifyContent: 'center', alignItems: 'center' }}>
+      <Button
+        title="Synthesize Phonemes"
+        onPress={synthesizePhonemes}
+        disabled={!tts.isReady}
+      />
+    </View>
+  );
+}
+```
+
 ## Supported models
 
 | Model                                                                            | Language |
diff --git a/docs/docs/04-typescript-api/01-natural-language-processing/TextToSpeechModule.md b/docs/docs/04-typescript-api/01-natural-language-processing/TextToSpeechModule.md
@@ -53,16 +53,24 @@ For more information on resource sources, see [loading models](../../01-fundamen
 
 ## Running the model
 
-The module provides two ways to generate speech:
+The module provides two ways to generate speech using either raw text or pre-generated phonemes:
+
+### Using Text
 
 1.  [**`forward(text, speed)`**](../../06-api-reference/classes/TextToSpeechModule.md#forward): Generates the complete audio waveform at once. Returns a promise resolving to a `Float32Array`.
+2.  [**`stream({ text, speed })`**](../../06-api-reference/classes/TextToSpeechModule.md#stream): An async generator that yields chunks of audio as they are computed. This is ideal for reducing the "time to first audio" for long sentences.
+
+### Using Phonemes
+
+If you have pre-computed phonemes (e.g., from an external dictionary or a custom G2P model), you can skip the internal phoneme generation step:
+
+1.  [**`forwardFromPhonemes(phonemes, speed)`**](../../06-api-reference/classes/TextToSpeechModule.md#forwardfromphonemes): Generates the complete audio waveform from a phoneme string.
+2.  [**`streamFromPhonemes({ phonemes, speed })`**](../../06-api-reference/classes/TextToSpeechModule.md#streamfromphonemes): Streams audio chunks generated from a phoneme string.
 
 :::note
-Since it processes the entire text at once, it might take a significant amount of time to produce an audio for long text inputs.
+Since `forward` and `forwardFromPhonemes` process the entire input at once, they might take a significant amount of time to produce audio for long inputs.
 :::
 
-2.  [**`stream({ text, speed })`**](../../06-api-reference/classes/TextToSpeechModule.md#stream): An async generator that yields chunks of audio as they are computed. This is ideal for reducing the "time to first audio" for long sentences.
-
 ## Example
 
 ### Speech Synthesis
@@ -135,3 +143,34 @@ try {
   console.error('Streaming failed:', error);
 }
 ```
+
+### Synthesis from Phonemes
+
+If you already have a phoneme string (e.g., from an external library), you can use `forwardFromPhonemes` or `streamFromPhonemes` to synthesize audio directly, skipping the internal phonemizer stage.
+
+```typescript
+import {
+  TextToSpeechModule,
+  KOKORO_MEDIUM,
+  KOKORO_VOICE_AF_HEART,
+} from 'react-native-executorch';
+
+const tts = new TextToSpeechModule();
+
+await tts.load({
+  model: KOKORO_MEDIUM,
+  voice: KOKORO_VOICE_AF_HEART,
+});
+
+// Example phonemes for "ExecuTorch"
+const waveform = await tts.forwardFromPhonemes('həlˈO wˈɜɹld!', 1.0);
+
+// Or stream from phonemes
+for await (const chunk of tts.streamFromPhonemes({
+  phonemes:
+    'ɐ mˈæn hˌu dˈʌzᵊnt tɹˈʌst hɪmsˈɛlf, kæn nˈɛvəɹ ɹˈiᵊli tɹˈʌst ˈɛniwˌʌn ˈɛls.',
+  speed: 1.0,
+})) {
+  // ... process chunk ...
+}
+```
diff --git a/packages/react-native-executorch/common/rnexecutorch/host_objects/ModelHostObject.h b/packages/react-native-executorch/common/rnexecutorch/host_objects/ModelHostObject.h
@@ -171,6 +171,14 @@ template <typename Model> class ModelHostObject : public JsiHostObject {
       addFunctions(JSI_EXPORT_FUNCTION(ModelHostObject<Model>,
                                        promiseHostFunction<&Model::stream>,
                                        "stream"));
+      addFunctions(JSI_EXPORT_FUNCTION(
+          ModelHostObject<Model>,
+          promiseHostFunction<&Model::generateFromPhonemes>,
+          "generateFromPhonemes"));
+      addFunctions(JSI_EXPORT_FUNCTION(
+          ModelHostObject<Model>,
+          promiseHostFunction<&Model::streamFromPhonemes>,
+          "streamFromPhonemes"));
     }
 
     if constexpr (meta::HasGenerateFromString<Model>) {
diff --git a/packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Kokoro.cpp b/packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Kokoro.cpp
@@ -4,6 +4,7 @@
 
 #include <algorithm>
 #include <fstream>
+#include <phonemis/utilities/string_utils.h>
 #include <rnexecutorch/Error.h>
 #include <rnexecutorch/data_processing/Sequential.h>
 
@@ -73,16 +74,9 @@ void Kokoro::loadVoice(const std::string &voiceSource) {
   }
 }
 
-std::vector<float> Kokoro::generate(std::string text, float speed) {
-  if (text.size() > params::kMaxTextSize) {
-    throw RnExecutorchError(RnExecutorchErrorCode::InvalidUserInput,
-                            "Kokoro: maximum input text size exceeded");
-  }
-
-  // G2P (Grapheme to Phoneme) conversion
-  auto phonemes = phonemizer_.process(text);
-
-  // Divide the phonemes string intro substrings.
+std::vector<float>
+Kokoro::generateFromPhonemesImpl(const std::u32string &phonemes, float speed) {
+  // Divide the phonemes string into substrings.
   // Affects the further calculations only in case of string size
   // exceeding the biggest model's input.
   auto subsentences =
@@ -98,26 +92,20 @@ std::vector<float> Kokoro::generate(std::string text, float speed) {
     size_t pauseMs = params::kPauseValues.contains(lastPhoneme)
                          ? params::kPauseValues.at(lastPhoneme)
                          : params::kDefaultPause;
-    std::vector<float> pause(pauseMs * constants::kSamplesPerMilisecond, 0.F);
 
-    // Add audio part and pause to the main audio vector
+    // Add audio part and silence pause to the main audio vector
     audio.insert(audio.end(), std::make_move_iterator(audioPart.begin()),
                  std::make_move_iterator(audioPart.end()));
-    audio.insert(audio.end(), std::make_move_iterator(pause.begin()),
-                 std::make_move_iterator(pause.end()));
+    audio.resize(audio.size() + pauseMs * constants::kSamplesPerMilisecond,
+                 0.F);
   }
 
   return audio;
 }
 
-void Kokoro::stream(std::string text, float speed,
-                    std::shared_ptr<jsi::Function> callback) {
-  if (text.size() > params::kMaxTextSize) {
-    throw RnExecutorchError(RnExecutorchErrorCode::InvalidUserInput,
-                            "Kokoro: maximum input text size exceeded");
-  }
-
-  // Build a full callback function
+void Kokoro::streamFromPhonemesImpl(
+    const std::u32string &phonemes, float speed,
+    std::shared_ptr<jsi::Function> callback) {
   auto nativeCallback = [this, callback](const std::vector<float> &audioVec) {
     if (this->isStreaming_) {
       this->callInvoker_->invokeAsync([callback, audioVec](jsi::Runtime &rt) {
@@ -127,21 +115,12 @@ void Kokoro::stream(std::string text, float speed,
     }
   };
 
-  // Mark the beginning of the streaming process
   isStreaming_ = true;
 
-  // G2P (Grapheme to Phoneme) conversion
-  auto phonemes = phonemizer_.process(text);
-
-  // Divide the phonemes string intro substrings.
-  // Use specialized implementation to minimize the latency between the
-  // sentences.
+  // Use LATENCY strategy to minimize the time-to-first-audio for streaming
   auto subsentences =
       partitioner_.divide<Partitioner::Strategy::LATENCY>(phonemes);
 
-  // We follow the implementation of generate() method, but
-  // instead of accumulating results in a vector, we push them
-  // back to the JS side with the callback.
   for (size_t i = 0; i < subsentences.size(); i++) {
     if (!isStreaming_) {
       break;
@@ -151,7 +130,7 @@ void Kokoro::stream(std::string text, float speed,
 
     // Determine the silent padding duration to be stripped from the edges of
     // the generated audio. If a chunk ends with a space or follows one that
-    // did, it indicates a word boundary split – we use a shorter padding (20ms)
+    // did, it indicates a word boundary split – we use a shorter padding
     // to ensure natural speech flow. Otherwise, we use 50ms for standard
     // pauses.
     bool endsWithSpace = (subsentence.back() == U' ');
@@ -161,25 +140,67 @@ void Kokoro::stream(std::string text, float speed,
     // Generate an audio vector with the Kokoro model
     auto audioPart = synthesize(subsentence, speed, paddingMs);
 
-    // Calculate a pause between the sentences
+    // Calculate and append a pause between the sentences
     char32_t lastPhoneme = subsentence.back();
     size_t pauseMs = params::kPauseValues.contains(lastPhoneme)
                          ? params::kPauseValues.at(lastPhoneme)
                          : params::kDefaultPause;
-    std::vector<float> pause(pauseMs * constants::kSamplesPerMilisecond, 0.F);
-
-    // Add pause to the audio vector
-    audioPart.insert(audioPart.end(), std::make_move_iterator(pause.begin()),
-                     std::make_move_iterator(pause.end()));
+    audioPart.resize(
+        audioPart.size() + pauseMs * constants::kSamplesPerMilisecond, 0.F);
 
     // Push the audio right away to the JS side
     nativeCallback(audioPart);
   }
 
-  // Mark the end of the streaming process
   isStreaming_ = false;
 }
 
+std::vector<float> Kokoro::generate(std::string text, float speed) {
+  if (text.size() > params::kMaxTextSize) {
+    throw RnExecutorchError(RnExecutorchErrorCode::InvalidUserInput,
+                            "Kokoro: maximum input text size exceeded");
+  }
+
+  // G2P (Grapheme to Phoneme) conversion
+  auto phonemes = phonemizer_.process(text);
+
+  return generateFromPhonemesImpl(phonemes, speed);
+}
+
+std::vector<float> Kokoro::generateFromPhonemes(std::string phonemes,
+                                                float speed) {
+  if (phonemes.empty()) {
+    throw RnExecutorchError(RnExecutorchErrorCode::InvalidUserInput,
+                            "Kokoro: phoneme string must not be empty");
+  }
+  return generateFromPhonemesImpl(
+      phonemis::utilities::string_utils::utf8_to_u32string(phonemes), speed);
+}
+
+void Kokoro::stream(std::string text, float speed,
+                    std::shared_ptr<jsi::Function> callback) {
+  if (text.size() > params::kMaxTextSize) {
+    throw RnExecutorchError(RnExecutorchErrorCode::InvalidUserInput,
+                            "Kokoro: maximum input text size exceeded");
+  }
+
+  // G2P (Grapheme to Phoneme) conversion
+  auto phonemes = phonemizer_.process(text);
+
+  streamFromPhonemesImpl(phonemes, speed, callback);
+}
+
+void Kokoro::streamFromPhonemes(std::string phonemes, float speed,
+                                std::shared_ptr<jsi::Function> callback) {
+  if (phonemes.empty()) {
+    throw RnExecutorchError(RnExecutorchErrorCode::InvalidUserInput,
+                            "Kokoro: phoneme string must not be empty");
+  }
+  streamFromPhonemesImpl(
+      phonemis::utilities::string_utils::utf8_to_u32string(phonemes), speed,
+      callback);
+}
+
 void Kokoro::streamStop() noexcept { isStreaming_ = false; }
 
 std::vector<float> Kokoro::synthesize(const std::u32string &phonemes,
diff --git a/packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Kokoro.h b/packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Kokoro.h
@@ -27,11 +27,22 @@ class Kokoro {
   // Processes the entire text at once, before sending back to the JS side.
   std::vector<float> generate(std::string text, float speed = 1.F);
 
+  // Accepts pre-computed phonemes (as a UTF-8 IPA string) and synthesizes
+  // audio, bypassing the built-in phonemizer. This allows callers to use
+  // an external G2P system (e.g. the Python `phonemizer` library, espeak-ng,
+  // or any custom phonemizer).
+  std::vector<float> generateFromPhonemes(std::string phonemes,
+                                          float speed = 1.F);
+
   // Processes text in chunks, sending each chunk individualy to the JS side
   // with asynchronous callbacks.
   void stream(std::string text, float speed,
               std::shared_ptr<jsi::Function> callback);
 
+  // Streaming variant that accepts pre-computed phonemes instead of text.
+  void streamFromPhonemes(std::string phonemes, float speed,
+                          std::shared_ptr<jsi::Function> callback);
+
   // Stops the streaming process
   void streamStop() noexcept;
 
@@ -42,6 +53,12 @@ class Kokoro {
   // Helper function - loading voice array
   void loadVoice(const std::string &voiceSource);
 
+  // Helper function - shared synthesis pipeline (partition + synthesize)
+  std::vector<float> generateFromPhonemesImpl(const std::u32string &phonemes,
+                                              float speed);
+  void streamFromPhonemesImpl(const std::u32string &phonemes, float speed,
+                              std::shared_ptr<jsi::Function> callback);
+
   // Helper function - generate specialization for given input size
   std::vector<float> synthesize(const std::u32string &phonemes, float speed,
                                 size_t paddingMs = 50);
diff --git a/packages/react-native-executorch/src/hooks/natural_language_processing/useTextToSpeech.ts b/packages/react-native-executorch/src/hooks/natural_language_processing/useTextToSpeech.ts
diff --git a/packages/react-native-executorch/src/modules/natural_language_processing/TextToSpeechModule.ts b/packages/react-native-executorch/src/modules/natural_language_processing/TextToSpeechModule.ts
diff --git a/packages/react-native-executorch/src/types/tts.ts b/packages/react-native-executorch/src/types/tts.ts