Sample iOS app for hexgrad/Kokoro-82M — open-weight 82M-parameter TTS using a style-conditioned StyleTTS2 architecture (BERT + duration predictor + iSTFTNet vocoder) producing 24 kHz speech.
The first CoreML port with on-device bilingual (English + Japanese) free-text input — no MLX, no MeCab, no IPADic, no Python G2P at runtime.
ScreenRecording_04-07-2026.12-30-44_1.mov
2 CoreML models: a flexible-length Predictor (BERT + LSTM duration head + text encoder) and 3 fixed-shape Decoder buckets (128 / 256 / 512 frames). The Swift pipeline picks the smallest bucket that fits the predicted duration, pads input features with zeros, and trims the output audio.
| Model | Size | Input | Output |
|---|---|---|---|
| Kokoro_Predictor.mlpackage.zip | 75 MB | input_ids [1, T≤256] (int32) + ref_s_style [1, 128] | duration [1, T] + d_for_align [1, 640, T] + t_en [1, 512, T] |
| Kokoro_Decoder_128.mlpackage.zip | 238 MB | en_aligned [1, 640, 128] + asr_aligned [1, 512, 128] + ref_s [1, 256] | audio [1, 76800] @ 24 kHz |
| Kokoro_Decoder_256.mlpackage.zip | 241 MB | en_aligned [1, 640, 256] + asr_aligned [1, 512, 256] + ref_s [1, 256] | audio [1, 153600] @ 24 kHz |
| Kokoro_Decoder_512.mlpackage.zip | 246 MB | en_aligned [1, 640, 512] + asr_aligned [1, 512, 512] + ref_s [1, 256] | audio [1, 307200] @ 24 kHz |
- Download the predictor and the three decoder buckets above
- Unzip and drag them into the Xcode project
- Build and run on a physical device (iOS 17+)
Voices ship with the project (510×256 style tensors, ~512 KB each):
- English (5):
af_heart,af_bella,am_michael,bf_emma,bm_george - Japanese (5):
jf_alpha,jf_gongitsune,jm_kumo,jf_nezumi,jf_tebukuro
- English — lexicon-based:
us_gold(~90k entries) +us_silverfallback, with possessive splitting, acronym detection, and a rule-based grapheme→phoneme fallback for OOV words. ~6 MB of bundled JSON. No MLX, no Python. - Japanese — Apple's
CFStringTokenizer(ja_JPlocale) emits romaji per token;Latin-HiraganaICU transform converts to hiragana; a ported subset of misaki'scutlet.HEPBURNIPA table plus context rules (long vowels, ん assimilation, particles は→βa / へ→e) emits Kokoro-compatible phonemes. No MeCab, no IPADic, ~zero overhead.
RangeDimfor the predictor — flexible phoneme length 1–256 (incl. BOS/EOS) so the bidirectional LSTM never sees padding tokens.- Fixed-shape decoder buckets — the iSTFTNet vocoder + InstanceNorm + bidirectional LSTM stack is length-sensitive, so we ship 3 buckets (128 / 256 / 512). Padding inside one bucket only causes a phase shift (spec corr 0.93 vs unpadded reference) which is perceptually inaudible. The smaller padding ratio you can pick, the cleaner the output — hence multiple buckets.
- Critical bug: CoreML's
modop silently produces wrong values for(float / scalar) % 1(e.g. inside the SineGen of iSTFTNet). Output spec correlation drops from 0.996 to 0.67 vs PyTorch with this op in the graph. Replacing(f0 / sr) % 1with(f0 / sr) - floor(f0 / sr)brings it back to 0.996. Seeconversion_scripts/convert_kokoro.pyand the patchedkokoro/istftnet.py. - Bypass
pack_padded_sequence/pad_packed_sequencein the predictor'sTextEncoderandDurationEncoderLSTMs — coremltools can't trace them. The LSTM runs on the unpadded tensor directly viaRangeDim. - Decoder ships at FP32 (
compute_precision=ct.precision.FLOAT32); FP16 corrupts audio quality. - Deterministic noise — the random noise generators (
torch.randn_likeinSineGenandSourceModuleHnNSF) are zeroed before tracing so the CoreML model is bit-for-bit reproducible vs PyTorch. - ANE-friendly decoder — first vocoder inference ~700 ms, subsequent inferences ~200 ms on iPhone (warm cache).