Skip to content

Commit cbe4c3c

Browse files
authored
docs(providers): add Minimax voice provider page (#1029)
## Description - Creates `fern/providers/voice/minimax.mdx` — a new voice provider page for Minimax. Minimax was previously absent from the Voices section in the docs nav. - Registers the new page in `fern/docs.yml` under the existing `Voices (Text-to-speech)` section, alphabetically positioned after LMNT. - Documents basic Minimax voice configuration (`provider`, `model`, `voiceId`). - Documents the new `subtitleType` voice param (`"sentence"` default, `"word"` opt-in) and how it interacts with the [`assistant.speechStarted`](/server-url/events#assistant-speech-started) message. - Honest limitations section explaining that Minimax word-progress events fire near the _end_ of each synthesis segment (because Minimax only attaches subtitle metadata to the final audio chunk per segment), not per-word in real time. Recommends ElevenLabs for true playback-cadence highlighting. - Documents the other gotchas: `totalWords: 0` sentinel, `force-say` always emits text-only, barge-in cuts emission, CJK languages are word-counted per ideograph/kana/hangul. - Stacked on [PR #1028](https://app.graphite.com/github/pr/VapiAI/docs/1028) which adds the `#assistant-speech-started` anchor this page links to. ## Testing Steps - [x] Verified the page registers in `docs.yml` correctly (matches the structure of sibling voice provider pages like ElevenLabs and Cartesia) - [x] Verified the cross-reference link `/server-url/events#assistant-speech-started` matches the anchor added in the parent PR - [x] Reviewer: spot-check the Fern preview build to confirm the page renders and shows up in the Voices nav between LMNT and RimeAI
1 parent aecebff commit cbe4c3c

2 files changed

Lines changed: 71 additions & 0 deletions

File tree

fern/docs.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -606,6 +606,8 @@ navigation:
606606
path: providers/voice/cartesia.mdx
607607
- page: LMNT
608608
path: providers/voice/imnt.mdx
609+
- page: Minimax
610+
path: providers/voice/minimax.mdx
609611
- page: RimeAI
610612
path: providers/voice/rimeai.mdx
611613
- page: Deepgram

fern/providers/voice/minimax.mdx

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
---
2+
title: Minimax
3+
subtitle: Configure Minimax TTS voices and word-level subtitle timing
4+
slug: providers/voice/minimax
5+
---
6+
7+
Minimax provides streaming TTS over WebSocket with multi-language support including English, Chinese, Japanese, and Korean. Vapi connects to Minimax via the `speech-02-hd` and `speech-02-turbo` model families.
8+
9+
## Basic configuration
10+
11+
Set the voice on your assistant:
12+
13+
```json
14+
{
15+
"voice": {
16+
"provider": "minimax",
17+
"model": "speech-02-hd",
18+
"voiceId": "Wise_Woman"
19+
}
20+
}
21+
```
22+
23+
## Subtitle timing for live captions (`subtitleType`)
24+
25+
Minimax can return subtitle data alongside synthesized audio, which Vapi forwards through the [`assistant.speechStarted`](/server-url/events#assistant-speech-started) client/server message. This is intended for live caption UIs and karaoke-style word highlighting.
26+
27+
| Value | Behavior |
28+
|---|---|
29+
| `"sentence"` *(default)* | No subtitle data. `assistant.speechStarted` fires as a text-only event tied to audio playback. |
30+
| `"word"` | Per-word timestamps. `assistant.speechStarted` fires with `timing.type: "word-progress"`, including `wordsSpoken`, `totalWords`, the current `segment` text, `segmentDurationMs`, and a `words[]` array with `startMs`/`endMs` per word. |
31+
32+
```json
33+
{
34+
"voice": {
35+
"provider": "minimax",
36+
"model": "speech-02-hd",
37+
"voiceId": "Wise_Woman",
38+
"subtitleType": "word"
39+
}
40+
}
41+
```
42+
43+
You also need to subscribe to the message itself by adding `"assistant.speechStarted"` to your assistant's `clientMessages` and/or `serverMessages` arrays.
44+
45+
### How the timing actually works (and what it can't do)
46+
47+
This is the most important part to understand before building on top of it.
48+
49+
Minimax synthesizes audio incrementally, but it only attaches subtitle metadata to the **final audio chunk of each synthesis segment**. Vapi streams every audio chunk to the call as soon as it arrives, but the `wordsSpoken` cursor only advances when that final chunk is reached. In practice, this means:
50+
51+
- You will receive **one `assistant.speechStarted` event per Minimax synthesis segment**, not one per word.
52+
- That event fires **near the end of the segment's audio playback**, not at the start. The `wordsSpoken` value jumps forward in segment-sized increments rather than ticking word by word.
53+
- The `timing.words[]` array in each event carries the per-word start/end timestamps for the segment that just finished. You can use it to animate that segment retroactively, or to extrapolate forward during the next segment — but you cannot use it to drive smooth real-time highlighting *in* the current segment.
54+
- Per-word timestamps are relative to the segment's start, not the start of the call.
55+
56+
If your use case requires word-by-word highlighting at audio playback cadence with no interpolation, use ElevenLabs — its `word-alignment` timing arrives every 50–200ms with real per-word timestamps from the provider. Minimax word-progress is best suited to:
57+
58+
- Caption blocks that update once per spoken sentence/clause.
59+
- "How far through the response are we" progress indicators.
60+
- Post-hoc transcript annotation with word-level timing.
61+
62+
### Other behaviors to be aware of
63+
64+
- **`totalWords === 0` is a valid value** on the first event of a turn, before Minimax has confirmed the word count. Guard against divide-by-zero when computing progress fractions.
65+
- **`force-say` events** (your `firstMessage`, queued `say` actions) are emitted as text-only events — no `timing` object — even when `subtitleType: "word"` is configured. This is because Minimax does not return subtitle metadata for these utterances.
66+
- **On user barge-in**, no further events fire for the interrupted turn. The most recent `wordsSpoken` tells you how much of `text` was actually spoken before the interruption.
67+
- **CJK languages** (Chinese, Japanese, Korean) are word-counted per ideograph/kana/hangul. A 30-character Japanese sentence reports `totalWords: 30`.
68+
69+
For the full event schema and `timing` shapes across all voice providers, see [Server events → Assistant Speech Started](/server-url/events#assistant-speech-started).

0 commit comments

Comments
 (0)