You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: fern/customization/speech-configuration.mdx
+8-17Lines changed: 8 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,27 +39,18 @@ This plan defines the parameters for when the assistant begins speaking after th
39
39
-**End-of-turn prediction** - predicting when the current speaker is likely to finish their turn.
40
40
-**Backchannel prediction** - detecting moments where a listener may provide short verbal acknowledgments like "uh-huh", "yeah", etc. to show engagement, without intending to take over the speaking turn. This is better handled by the assistant's stopSpeakingPlan.
41
41
42
-
We offer different providers that can be audio-based, text-based, or audio-text based:
-**LiveKit (`livekit`)**: Recommended for English conversations. LiveKit can be fine-tuned using the `waitFunction` parameter to adjust response timing based on the probability that the user is still speaking.
45
+
-**Vapi (`vapi`)**: Recommended for non-English conversations or as an alternative when LiveKit isn't suitable.
46
+
-**Custom endpointing model (`custom-endpointing-model`)**: Use your own endpointing service by setting `smartEndpointingPlan.server.url`.
45
47
46
-
-**Krisp**: Audio-based model that analyzes prosodic and acoustic features such as changes in intonation, pitch, and rhythm to detect when users finish speaking. Since it's audio-based, it always notifies when the user is done speaking, even for brief acknowledgments. Vapi offers configurable acknowledgement words and a well-configured stop speaking plan to handle this properly.
48
+
If you want smart endpointing off, omit `smartEndpointingPlan`. In that case, the system uses transcriber end-of-turn detection when available.
47
49
48
-
Configure Krisp with a threshold between 0 and 1 (default 0.5), where 1 means the user definitely stopped speaking and 0 means they're still speaking. Use lower values for snappier conversations and higher values for more conservative detection.
When interacting with an AI agent, users may genuinely want to interrupt to ask a question or shift the conversation, or they might simply be using backchannel cues like "right" or "okay" to signal they're actively listening. The core challenge lies in distinguishing meaningful interruptions from casual acknowledgments. Since the audio-based model signals end-of-turn after each word, configure the stop speaking plan with the right number of words to interrupt, interruption settings, and acknowledgement phrases to handle backchanneling properly.
51
-
52
-
**Audio-text based providers:**
53
-
54
-
-**Deepgram Flux**: Deepgram's latest transcriber model with built-in conversational speech recognition. Flux combines high-quality speech-to-text with native turn detection, while delivering ultra-low latency and Nova-3 level accuracy.
55
-
56
-
-**Assembly**: Transcriber that also reports end-of-turn detection. To use Assembly, choose it as your transcriber without setting a separate smart endpointing plan. As transcripts arrive, we consider the `end_of_turn` flag that Assembly sends to mark the end-of-turn, stream to the LLM, and generate a response.
57
-
58
-
**Text-based providers:**
59
-
60
-
-**Off**: Disabled by default. When smart endpointing is set to "Off", the system will automatically use the transcriber's end-of-turn detection if available. If no transcriber EOT detection is available, the system defaults to LiveKit if the language is set to English or to Vapi's standard endpointing mode.
61
-
-**LiveKit**: Recommended for English conversations as it provides the most sophisticated solution for detecting natural speech patterns and pauses. LiveKit can be fine-tuned using the `waitFunction` parameter to adjust response timing based on the probability that the user is still speaking.
62
-
-**Vapi**: Recommended for non-English conversations or as an alternative when LiveKit isn't suitable
52
+
-**Deepgram Flux**: Deepgram's transcriber with built-in conversational turn detection.
53
+
-**Assembly**: Assembly transcriber that returns `end_of_turn` signals. Configure this on the transcriber without setting `smartEndpointingPlan`.
Copy file name to clipboardExpand all lines: fern/customization/voice-pipeline-configuration.mdx
+18-29Lines changed: 18 additions & 29 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -162,7 +162,7 @@ This plan is only used if `smartEndpointingPlan` is not set and transcriber does
162
162
163
163
### Smart endpointing
164
164
165
-
Uses AI models to analyze speech patterns, context, and audio cues to predict when users have finished speaking. Only available for English conversations.
165
+
Uses supported endpointing providers to predict when users have finished speaking.
166
166
167
167
**Important:** If your transcriber has built-in end-of-turn detection (like Deepgram Flux or Assembly) and you don't configure a smart endpointing plan, the system will automatically use the transcriber's EOT detection instead of smart endpointing.
168
168
@@ -181,27 +181,19 @@ Uses AI models to analyze speech patterns, context, and audio cues to predict wh
181
181
```
182
182
</Tab>
183
183
<Tabtitle="Providers">
184
-
**Text-based providers:**
184
+
Supported values for `smartEndpointingPlan.provider`:
185
185
- **livekit**: Advanced model trained on conversation data (English only)
186
-
- **vapi**: VAPI-trained model (non-English conversations or LiveKit alternative)
187
-
188
-
**Audio-based providers:**
189
-
- **krisp**: Audio-based model analyzing prosodic features (intonation, pitch, rhythm)
190
-
191
-
**Audio-text based providers:**
192
-
- **deepgram-flux**: Deepgram's latest transcriber model with built-in conversational speech recognition. (English only)
193
-
- **assembly**: Transcriber with built-in end-of-turn detection (English only)
186
+
- **vapi**: Vapi endpointing model for multilingual conversations
187
+
- **custom-endpointing-model**: Send endpointing requests to your own model using `smartEndpointingPlan.server.url`
194
188
195
189
</Tab>
196
190
</Tabs>
197
191
198
192
**When to use:**
199
193
200
-
-**Deepgram Flux**: English conversations using Deepgram as a transcriber.
201
-
-**Assembly**: Best used when Assembly is already your transcriber provider for English conversations with integrated end-of-turn detection
202
194
-**LiveKit**: English conversations where Deepgram is not the transcriber of choice.
203
195
-**Vapi**: Non-English conversations with default stop speaking plan settings
204
-
-**Krisp**: Non-English conversations with a robustly configured stop speaking plan
196
+
-**Custom endpointing model**: Bring your own endpointing logic for specialized domains
205
197
206
198
### Deepgram Flux configuration
207
199
@@ -305,31 +297,26 @@ The system continuously analyzes the latest user message and applies the first m
Since Krisp is audio-based, it always notifies when the user is done speaking, even for brief acknowledgments. Configure the stop speaking plan with appropriate `acknowledgementPhrases` and `numWords` settings to handle backchanneling properly.
319
+
When configured, Vapi sends `call.endpointing.request` events to your server URL and expects a timeout-based response.
333
320
334
321
### Assembly turn detection
335
322
@@ -613,15 +600,17 @@ User Interrupts → Assistant Audio Stopped → backoffSeconds Blocks All Output
613
600
614
601
**Optimized for:** Text-based endpointing with longer timeouts for different speech patterns and international support.
0 commit comments