Skip to content

Commit e34d1a3

Browse files
authored
fix(kokoro): voice loading, method selection, padding, and audio safety fixes (#943)
Fixes for Kokoro TTS native code. Addresses voice data truncation, missing Synthesizer method selection, progressive speed-up on longer inputs, phoneme token reordering, and several additional safety fixes. ### Voice loading reads only 128 of 510 rows `voice_` was a fixed `std::array<..., kMaxInputTokens>` (128 elements), but `hexgrad/Kokoro-82M` voice files contain 510 rows. The remaining 382 rows were silently dropped. Changed `voice_` to `std::vector`, sized dynamically from the file. Also fixed an OOB in `voiceID` — upstream used `std::min(phonemes.size() - 1, noTokens)` where `noTokens` could equal 128, indexing past the end of a 128-element array. Now uses a three-way `std::min({phonemes.size() - 1, noTokens - 1, voice_.size() - 1})`. ### Synthesizer doesn't do method selection `DurationPredictor` discovers and selects from `forward_8`/`forward_32`/`forward_64`/`forward_128` based on input size, but `Synthesizer` only knew about `forward`. Added the same discovery and selection logic. Falls back to `"forward"` if no `forward_N` methods exist, so older models still work. ### Audio progressively speeds up on longer inputs The Synthesizer's attention mechanism drifts on longer input sequences (60+ tokens), causing later phonemes to be spoken progressively faster than the Duration Predictor intended. The DP's timing predictions are correct, but the Synthesizer compresses later phonemes into fewer samples. Fixed by capping `inputTokensLimit` to 60, which forces the Partitioner to split text into shorter chunks that the Synthesizer can render faithfully. Each chunk is roughly one sentence (~15-20 words). ### `tokenize()` scrambles phoneme order on invalid tokens `std::partition` was used to filter out invalid (unrecognized) phoneme tokens, but `partition` does not preserve relative order. When any phonemes fall outside the vocabulary, the remaining valid tokens could be reordered, producing garbled audio. Changed to `std::stable_partition` which preserves relative order. ### `stripAudio` unsigned integer underflow `lbound - margin` wraps `size_t` to ~2^64 when the audio's first non-silent sample is near the start of the buffer (i.e., `lbound < margin`). `std::max(huge_value, 0)` returns the huge value, and `audio.subspan()` reads out-of-bounds. This is especially triggered in streaming mode where `paddingMs=15` (margin = 360 samples) on short chunks. Fixed by guarding the subtraction: `lbound > margin ? lbound - margin : 0`. Also guarded `audio.size() - 1` against empty spans. ### `isStreaming_` data race `isStreaming_` is a plain `bool` read by `stream()` on a background thread and written by `streamStop()` from the JS thread. Non-atomic access is undefined behavior — the compiler may optimize away the read, making `streamStop()` ineffective. Changed to `std::atomic<bool>`. ### `scaleDurations` drops phonemes When aggressively shrinking durations (many tokens, few total ticks), individual token durations can be driven to 0 by the correction loop. A zero-duration token is skipped by `repeatInterleave`, effectively deleting that phoneme from the output. Fixed by clamping each scaled duration to a minimum of 1, and guarding the correction loop so it never drives a duration below 1. Without the correction loop guard, the clamp is immediately undone — the priority queue picks clamped entries (they have high remainders) and subtracts 1, defeating the purpose. ### Misc perf - Replace temporary pause zero-vectors with `resize()` directly on the output - Move-capture audio in the streaming callback instead of copying ## Changes - `Kokoro.h` — `voice_` from fixed array to vector, `isStreaming_` to `std::atomic<bool>` - `Kokoro.cpp` — `loadVoice()`, `synthesize()`, `generate()`, `stream()`, constructor token limit cap - `DurationPredictor.cpp` — `scaleDurations()` min-1 clamp with correction loop guard - `Synthesizer.h` — `forwardMethods_` member - `Synthesizer.cpp` — method discovery and selection - `Utils.cpp` — `stable_partition` in `tokenize()`, `stripAudio` underflow guard
1 parent cc11c3e commit e34d1a3

File tree

6 files changed

+78
-58
lines changed

6 files changed

+78
-58
lines changed

packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/DurationPredictor.cpp

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -175,7 +175,7 @@ void DurationPredictor::scaleDurations(Tensor &durations, size_t nTokens,
175175
shrinking ? std::ceil(scaled) - scaled : scaled - std::floor(scaled);
176176

177177
durationsPtr[i] = static_cast<int64_t>(shrinking ? std::ceil(scaled)
178-
: std::floor(scaled));
178+
: std::floor(scaled));
179179
scaledSum += durationsPtr[i];
180180

181181
// Keeps the entries sorted by the remainders
@@ -193,4 +193,4 @@ void DurationPredictor::scaleDurations(Tensor &durations, size_t nTokens,
193193
}
194194
}
195195

196-
} // namespace rnexecutorch::models::text_to_speech::kokoro
196+
} // namespace rnexecutorch::models::text_to_speech::kokoro

packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Kokoro.cpp

Lines changed: 26 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -39,38 +39,39 @@ Kokoro::Kokoro(const std::string &lang, const std::string &taggerDataSource,
3939
}
4040

4141
void Kokoro::loadVoice(const std::string &voiceSource) {
42-
constexpr size_t rows = static_cast<size_t>(constants::kMaxInputTokens);
43-
constexpr size_t cols = static_cast<size_t>(constants::kVoiceRefSize); // 256
44-
const size_t expectedCount = rows * cols;
45-
const std::streamsize expectedBytes =
46-
static_cast<std::streamsize>(expectedCount * sizeof(float));
42+
constexpr size_t cols = static_cast<size_t>(constants::kVoiceRefSize);
43+
constexpr size_t bytesPerRow = cols * sizeof(float);
4744

4845
std::ifstream in(voiceSource, std::ios::binary);
4946
if (!in) {
5047
throw RnExecutorchError(RnExecutorchErrorCode::FileReadFailed,
51-
"[Kokoro::loadSingleVoice]: cannot open file: " +
48+
"[Kokoro::loadVoice]: cannot open file: " +
5249
voiceSource);
5350
}
5451

55-
// Check the file size
52+
// Determine number of rows from file size
5653
in.seekg(0, std::ios::end);
57-
const std::streamsize fileSize = in.tellg();
54+
const auto fileSize = static_cast<size_t>(in.tellg());
5855
in.seekg(0, std::ios::beg);
59-
if (fileSize < expectedBytes) {
56+
57+
if (fileSize < bytesPerRow) {
6058
throw RnExecutorchError(
6159
RnExecutorchErrorCode::FileReadFailed,
62-
"[Kokoro::loadSingleVoice]: file too small: expected at least " +
63-
std::to_string(expectedBytes) + " bytes, got " +
60+
"[Kokoro::loadVoice]: file too small: need at least " +
61+
std::to_string(bytesPerRow) + " bytes for one row, got " +
6462
std::to_string(fileSize));
6563
}
6664

67-
// Read [rows, 1, cols] as contiguous floats directly into voice_
68-
// ([rows][cols])
69-
if (!in.read(reinterpret_cast<char *>(voice_.data()->data()),
70-
expectedBytes)) {
65+
const size_t rows = fileSize / bytesPerRow;
66+
const auto readBytes = static_cast<std::streamsize>(rows * bytesPerRow);
67+
68+
// Resize voice vector to hold all rows from the file
69+
voice_.resize(rows);
70+
71+
if (!in.read(reinterpret_cast<char *>(voice_.data()->data()), readBytes)) {
7172
throw RnExecutorchError(
7273
RnExecutorchErrorCode::FileReadFailed,
73-
"[Kokoro::loadSingleVoice]: failed to read voice weights");
74+
"[Kokoro::loadVoice]: failed to read voice weights");
7475
}
7576
}
7677

@@ -92,7 +93,6 @@ Kokoro::generateFromPhonemesImpl(const std::u32string &phonemes, float speed) {
9293
size_t pauseMs = params::kPauseValues.contains(lastPhoneme)
9394
? params::kPauseValues.at(lastPhoneme)
9495
: params::kDefaultPause;
95-
9696
// Add audio part and silence pause to the main audio vector
9797
audio.insert(audio.end(), std::make_move_iterator(audioPart.begin()),
9898
std::make_move_iterator(audioPart.end()));
@@ -108,10 +108,11 @@ void Kokoro::streamFromPhonemesImpl(
108108
std::shared_ptr<jsi::Function> callback) {
109109
auto nativeCallback = [this, callback](const std::vector<float> &audioVec) {
110110
if (this->isStreaming_) {
111-
this->callInvoker_->invokeAsync([callback, audioVec](jsi::Runtime &rt) {
112-
callback->call(rt,
113-
rnexecutorch::jsi_conversion::getJsiValue(audioVec, rt));
114-
});
111+
this->callInvoker_->invokeAsync(
112+
[callback, audioVec = std::move(audioVec)](jsi::Runtime &rt) {
113+
callback->call(
114+
rt, rnexecutorch::jsi_conversion::getJsiValue(audioVec, rt));
115+
});
115116
}
116117
};
117118

@@ -149,7 +150,7 @@ void Kokoro::streamFromPhonemesImpl(
149150
audioPart.size() + pauseMs * constants::kSamplesPerMilisecond, 0.F);
150151

151152
// Push the audio right away to the JS side
152-
nativeCallback(audioPart);
153+
nativeCallback(std::move(audioPart));
153154
}
154155

155156
isStreaming_ = false;
@@ -219,7 +220,8 @@ std::vector<float> Kokoro::synthesize(const std::u32string &phonemes,
219220
const auto tokens = utils::tokenize(phonemes, {noTokens});
220221

221222
// Select the appropriate voice vector
222-
size_t voiceID = std::min(phonemes.size() - 1, noTokens);
223+
size_t voiceID = std::min({phonemes.size() - 1, noTokens - 1,
224+
voice_.size() - 1});
223225
auto &voice = voice_[voiceID];
224226

225227
// Initialize text mask
@@ -254,9 +256,7 @@ std::vector<float> Kokoro::synthesize(const std::u32string &phonemes,
254256
auto croppedAudio =
255257
utils::stripAudio(audio, paddingMs * constants::kSamplesPerMilisecond);
256258

257-
std::vector<float> result(croppedAudio.begin(), croppedAudio.end());
258-
259-
return result;
259+
return {croppedAudio.begin(), croppedAudio.end()};
260260
}
261261

262262
std::size_t Kokoro::getMemoryLowerBound() const noexcept {

packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Kokoro.h

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
#pragma once
22

33
#include <array>
4+
#include <atomic>
45
#include <memory>
56
#include <optional>
67
#include <string>
@@ -75,19 +76,16 @@ class Kokoro {
7576
DurationPredictor durationPredictor_;
7677
Synthesizer synthesizer_;
7778

78-
// Voice array
79-
// There is a separate voice vector for each of the possible numbers of input
80-
// tokens.
81-
std::array<std::array<float, constants::kVoiceRefSize>,
82-
constants::kMaxInputTokens>
83-
voice_;
79+
// Voice array — dynamically sized to match the voice file.
80+
// Each row is a style vector for a given input token count.
81+
std::vector<std::array<float, constants::kVoiceRefSize>> voice_;
8482

8583
// Extra control variables
86-
bool isStreaming_ = false;
84+
std::atomic<bool> isStreaming_{false};
8785
};
8886
} // namespace models::text_to_speech::kokoro
8987

9088
REGISTER_CONSTRUCTOR(models::text_to_speech::kokoro::Kokoro, std::string,
9189
std::string, std::string, std::string, std::string,
9290
std::string, std::shared_ptr<react::CallInvoker>);
93-
} // namespace rnexecutorch
91+
} // namespace rnexecutorch

packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Synthesizer.cpp

Lines changed: 38 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -13,18 +13,34 @@ Synthesizer::Synthesizer(const std::string &modelSource,
1313
const Context &modelContext,
1414
std::shared_ptr<react::CallInvoker> callInvoker)
1515
: BaseModel(modelSource, callInvoker), context_(modelContext) {
16-
const auto inputTensors = getAllInputShapes("forward");
16+
// Discover all forward methods (forward, forward_8, forward_32, etc.)
17+
auto availableMethods = module_->method_names();
18+
if (availableMethods.ok()) {
19+
const auto &names = *availableMethods;
20+
for (const auto &name : names) {
21+
if (name.rfind("forward", 0) == 0) {
22+
const auto inputTensors = getAllInputShapes(name);
23+
CHECK_SIZE(inputTensors, 5);
24+
CHECK_SIZE(inputTensors[0], 2);
25+
CHECK_SIZE(inputTensors[1], 2);
26+
CHECK_SIZE(inputTensors[2], 1);
27+
size_t inputSize = inputTensors[0][1];
28+
forwardMethods_.emplace_back(name, inputSize);
29+
}
30+
}
31+
std::stable_sort(forwardMethods_.begin(), forwardMethods_.end(),
32+
[](const auto &a, const auto &b) { return a.second < b.second; });
33+
}
1734

18-
// Perform checks to validate model's compatibility with native code
19-
CHECK_SIZE(inputTensors, 5);
20-
CHECK_SIZE(
21-
inputTensors[0],
22-
2); // input tokens must be of shape {1, T}, where T is number of tokens
23-
CHECK_SIZE(
24-
inputTensors[1],
25-
2); // text mask must be of shape {1, T}, where T is number of tokens
26-
CHECK_SIZE(inputTensors[2],
27-
1); // indices must be of shape {D}, where D is a maximum duration
35+
// Fallback: if no methods discovered, validate "forward" directly
36+
if (forwardMethods_.empty()) {
37+
const auto inputTensors = getAllInputShapes("forward");
38+
CHECK_SIZE(inputTensors, 5);
39+
CHECK_SIZE(inputTensors[0], 2);
40+
CHECK_SIZE(inputTensors[1], 2);
41+
CHECK_SIZE(inputTensors[2], 1);
42+
forwardMethods_.emplace_back("forward", inputTensors[0][1]);
43+
}
2844
}
2945

3046
Result<std::vector<EValue>> Synthesizer::generate(std::span<const Token> tokens,
@@ -54,14 +70,19 @@ Result<std::vector<EValue>> Synthesizer::generate(std::span<const Token> tokens,
5470
auto voiceRefTensor = make_tensor_ptr({1, constants::kVoiceRefSize},
5571
ref_s.data(), ScalarType::Float);
5672

57-
// Execute the appropriate "forward_xyz" method, based on given method name
58-
auto results = forward(
73+
// Select appropriate forward method based on token count
74+
auto it = std::ranges::find_if(forwardMethods_,
75+
[noTokens](const auto &entry) { return static_cast<int32_t>(entry.second) >= noTokens; });
76+
std::string selectedMethod = (it != forwardMethods_.end()) ? it->first : forwardMethods_.back().first;
77+
78+
// Execute the selected forward method
79+
auto results = execute(selectedMethod,
5980
{tokensTensor, textMaskTensor, indicesTensor, durTensor, voiceRefTensor});
6081

6182
if (!results.ok()) {
6283
throw RnExecutorchError(
6384
RnExecutorchErrorCode::InvalidModelOutput,
64-
"[Kokoro::Synthesizer] Failed to execute method forward"
85+
"[Kokoro::Synthesizer] Failed to execute method " + selectedMethod +
6586
", error: " +
6687
std::to_string(static_cast<uint32_t>(results.error())));
6788
}
@@ -72,13 +93,12 @@ Result<std::vector<EValue>> Synthesizer::generate(std::span<const Token> tokens,
7293
}
7394

7495
size_t Synthesizer::getTokensLimit() const {
75-
// Returns tokens input (shape {1, T}) second dim
76-
return getInputShape("forward", 0)[1];
96+
return forwardMethods_.empty() ? 0 : forwardMethods_.back().second;
7797
}
7898

7999
size_t Synthesizer::getDurationLimit() const {
80-
// Returns indices vector first dim (shape {D})
81-
return getInputShape("forward", 2)[0];
100+
if (forwardMethods_.empty()) return 0;
101+
return getInputShape(forwardMethods_.back().first, 2)[0];
82102
}
83103

84104
} // namespace rnexecutorch::models::text_to_speech::kokoro

packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Synthesizer.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,8 @@ class Synthesizer : public BaseModel {
5050
size_t getDurationLimit() const;
5151

5252
private:
53+
// Forward methods discovered at construction (e.g. forward_8, forward_64, forward_128)
54+
std::vector<std::pair<std::string, size_t>> forwardMethods_;
5355
// Shared model context
5456
// A const reference to singleton in Kokoro.
5557
const Context &context_;

packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Utils.cpp

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -55,8 +55,8 @@ std::span<const float> stripAudio(std::span<const float> audio, size_t margin) {
5555
auto lbound = findAudioBound<false>(audio);
5656
auto rbound = findAudioBound<true>(audio);
5757

58-
lbound = std::max(lbound - margin, size_t(0));
59-
rbound = std::min(rbound + margin, audio.size() - 1);
58+
lbound = lbound > margin ? lbound - margin : 0;
59+
rbound = std::min(rbound + margin, audio.size() > 0 ? audio.size() - 1 : 0);
6060

6161
return audio.subspan(lbound, rbound >= lbound ? rbound - lbound + 1 : 0);
6262
}
@@ -85,7 +85,7 @@ std::vector<Token> tokenize(const std::u32string &phonemes,
8585
? constants::kVocab.at(p)
8686
: constants::kInvalidToken;
8787
});
88-
auto validSeqEnd = std::partition(
88+
auto validSeqEnd = std::stable_partition(
8989
tokens.begin() + 1, tokens.begin() + effNoTokens + 1,
9090
[](Token t) -> bool { return t != constants::kInvalidToken; });
9191
std::fill(validSeqEnd, tokens.begin() + effNoTokens + 1,
@@ -94,4 +94,4 @@ std::vector<Token> tokenize(const std::u32string &phonemes,
9494
return tokens;
9595
}
9696

97-
} // namespace rnexecutorch::models::text_to_speech::kokoro::utils
97+
} // namespace rnexecutorch::models::text_to_speech::kokoro::utils

0 commit comments

Comments
 (0)