Skip to content

Commit e7b1c1b

Browse files
committed
fix(kokoro): cap token limit to prevent speed-up, preserve phoneme order
The Synthesizer's attention drifts on longer sequences (60+ tokens), causing later phonemes to be spoken progressively faster. Cap inputTokensLimit to 60 so the Partitioner splits text into shorter chunks that stay faithful to the Duration Predictor's timing. Also switch tokenize()'s std::partition to std::stable_partition so phoneme token order is preserved when invalid tokens are filtered out.
1 parent ae59ea6 commit e7b1c1b

File tree

2 files changed

+9
-1
lines changed
  • packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro

2 files changed

+9
-1
lines changed

packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Kokoro.cpp

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,14 @@ Kokoro::Kokoro(const std::string &lang, const std::string &taggerDataSource,
3636

3737
context_.inputTokensLimit = durationPredictor_.getTokensLimit();
3838
context_.inputDurationLimit = synthesizer_.getDurationLimit();
39+
40+
// Cap effective token limit to prevent the Synthesizer's attention from
41+
// drifting on longer sequences, which manifests as progressive speed-up
42+
// in the generated audio. Shorter chunks keep timing faithful to the
43+
// Duration Predictor's output.
44+
static constexpr size_t kSafeTokensLimit = 60;
45+
context_.inputTokensLimit =
46+
std::min(context_.inputTokensLimit, kSafeTokensLimit);
3947
}
4048

4149
void Kokoro::loadVoice(const std::string &voiceSource) {

packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Utils.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ std::vector<Token> tokenize(const std::u32string &phonemes,
8585
? constants::kVocab.at(p)
8686
: constants::kInvalidToken;
8787
});
88-
auto validSeqEnd = std::partition(
88+
auto validSeqEnd = std::stable_partition(
8989
tokens.begin() + 1, tokens.begin() + effNoTokens + 1,
9090
[](Token t) -> bool { return t != constants::kInvalidToken; });
9191
std::fill(validSeqEnd, tokens.begin() + effNoTokens + 1,

0 commit comments

Comments
 (0)