Skip to content

Commit aecebff

Browse files
authored
docs(server-events): document assistant.speechStarted message (#1028)
## Description - Adds a new `### Assistant Speech Started` section to `server-url/events.mdx` documenting the opt-in `assistant.speechStarted` server/client message - Documents the full payload (`text`, `turn`, `source`, optional `timing`) and the three `timing` shapes: - `word-alignment` (ElevenLabs) — per-word timestamps at playback cadence - `word-progress` (Minimax with `voice.subtitleType: "word"`) — cursor-based per-segment progress - absent — text-only fallback for all other providers - Calls out the real limitations honestly so customers can choose the right provider for their use case: - Minimax events fire near the _end_ of each synthesis segment (subtitle data only attaches to the final audio chunk per segment), not per-word in real time - `force-say` events always emit text-only, even on ElevenLabs/Minimax - No companion `assistant.speechStopped` event — use `speech-update` (`status: "stopped"`) or watch `turn` increment - Barge-in stops emission for the interrupted turn — pair with `user-interrupted` - `totalWords: 0` is a valid sentinel; guard against divide-by-zero - This is the canonical schema page for the event. Two follow-up PRs in this stack ([Minimax provider page](https://app.graphite.com/github/pr/VapiAI/docs/1029), [Web SDK live captions section](https://app.graphite.com/github/pr/VapiAI/docs/1030)) deep-link into the `#assistant-speech-started` anchor created here. ## Testing Steps - [x] Verified MDX renders by inspecting the source — no broken code fences, all `<Warning>` and table syntax matches existing usage on the page - [x] Anchor `#assistant-speech-started` will be auto-generated by Fern from the `### Assistant Speech Started` heading; cross-references in PR 2 and PR 3 use this anchor - [x] Reviewer: spot-check the Fern preview build for this PR
1 parent 7dc5137 commit aecebff

1 file changed

Lines changed: 75 additions & 0 deletions

File tree

fern/server-url/events.mdx

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -287,6 +287,81 @@ For final-only events, you may receive `type: "transcript[transcriptType=\"final
287287
}
288288
```
289289

290+
### Assistant Speech Started
291+
292+
Sent as the assistant begins speaking each segment of a turn, synchronized to audio playback. Designed for live captions, karaoke-style word highlighting, and any UI that needs to track what's being spoken in real time.
293+
294+
This event is **opt-in**. Add `"assistant.speechStarted"` to your assistant's `serverMessages` and/or `clientMessages` to receive it.
295+
296+
```json
297+
{
298+
"message": {
299+
"type": "assistant.speechStarted",
300+
"text": "Hello world, how can I help you today?",
301+
"turn": 2,
302+
"source": "model",
303+
"timing": {
304+
/* optional — shape depends on voice provider, see below */
305+
}
306+
}
307+
}
308+
```
309+
310+
| Field | Description |
311+
|---|---|
312+
| `text` | Full assistant text for the current turn. **Not a delta** — accumulates across events in the same turn. |
313+
| `turn` | 0-indexed turn number. Multiple events within the same turn share the same `turn`. |
314+
| `source` | `"model"` (LLM-generated), `"force-say"` (firstMessage / queued `say` actions), or `"custom-voice"`. |
315+
| `timing` | Optional. Present when the voice provider supports word-level timing. Shape depends on `timing.type`. |
316+
317+
#### `timing.type: "word-alignment"` — ElevenLabs
318+
319+
```json
320+
{
321+
"type": "word-alignment",
322+
"words": ["Hello", " ", "world"],
323+
"wordsStartTimesMs": [0, 320, 360],
324+
"wordsEndTimesMs": [310, 350, 720]
325+
}
326+
```
327+
328+
Per-word timestamps from ElevenLabs' alignment API. Events arrive at audio playback cadence (~50–200ms apart). The `words[]` array includes space entries with real timing — join them and track a running character cursor to highlight `text` up to that position. No client-side interpolation needed.
329+
330+
#### `timing.type: "word-progress"` — Minimax (with `voice.subtitleType: "word"`)
331+
332+
```json
333+
{
334+
"type": "word-progress",
335+
"wordsSpoken": 22,
336+
"totalWords": 45,
337+
"segment": "the latest spoken segment text",
338+
"segmentDurationMs": 3200,
339+
"words": [
340+
{ "word": "the", "startMs": 0, "endMs": 110 },
341+
{ "word": "latest", "startMs": 110, "endMs": 480 }
342+
]
343+
}
344+
```
345+
346+
Cursor-based per-segment progress.
347+
348+
<Warning>
349+
Minimax only attaches subtitle data to the **final audio chunk of each synthesis segment**, so each `assistant.speechStarted` event for a Minimax turn fires near the *end* of that segment's audio playback — not at the start, and not per-word. The `wordsSpoken` value jumps in segment-sized increments, and the `words[]` array carries timestamps for the segment that just *finished*. Use it to retroactively animate that segment, or to extrapolate forward — but it cannot drive smooth real-time highlighting *during* the current segment. For true playback-cadence per-word events, use ElevenLabs.
350+
</Warning>
351+
352+
`totalWords: 0` is a valid sentinel on the very first event of a turn before Minimax confirms its word count — guard against divide-by-zero when computing a progress fraction. See the [Minimax voice provider page](/providers/voice/minimax) for full configuration details.
353+
354+
#### No `timing` field — text-only fallback
355+
356+
All other providers (Cartesia, Deepgram, Azure, OpenAI, Inworld, etc.) emit text-only events with no `timing` object. One event per TTS chunk, gated to actual audio playback. Display `text` as a caption block, or interpolate a word cursor at a flat rate (~3.5 words/sec) between events for an approximate cursor.
357+
358+
#### Behaviors to be aware of
359+
360+
- **`force-say` events always emit as text-only**, even on ElevenLabs and Minimax — there's no provider-level alignment for forced utterances (firstMessage, queued `say` actions).
361+
- **On user barge-in, no further events fire for the interrupted turn.** Pair with the [`user-interrupted`](#user-interrupted) message and use the most recent `wordsSpoken` (or joined char cursor) to know what was actually spoken.
362+
- **There is no companion `assistant.speechStopped` event.** Use [`speech-update`](#speech-update) (`status: "stopped"`) or watch `turn` increment to detect end-of-turn.
363+
- **Custom voice timing depends on what your voice server returns.** If you return timestamped JSON frames from your custom voice server, those flow through as `timing.words[]`; raw PCM responses produce text-only events.
364+
290365
### Model Output
291366

292367
Tokens or tool-call outputs as the model generates. The optional `turnId` groups all tokens from the same LLM response, so you can correlate output with a specific turn.

0 commit comments

Comments
 (0)