Skip to content

Commit d7fdc26

Browse files
Fix smart endpointing provider docs
Co-authored-by: Sahil Suman <sahilsuman933@users.noreply.github.com>
1 parent c3aeabd commit d7fdc26

2 files changed

Lines changed: 26 additions & 46 deletions

File tree

fern/customization/speech-configuration.mdx

Lines changed: 8 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -39,27 +39,18 @@ This plan defines the parameters for when the assistant begins speaking after th
3939
- **End-of-turn prediction** - predicting when the current speaker is likely to finish their turn.
4040
- **Backchannel prediction** - detecting moments where a listener may provide short verbal acknowledgments like "uh-huh", "yeah", etc. to show engagement, without intending to take over the speaking turn. This is better handled by the assistant's stopSpeakingPlan.
4141

42-
We offer different providers that can be audio-based, text-based, or audio-text based:
42+
Supported `smartEndpointingPlan.provider` values are:
4343

44-
**Audio-based providers:**
44+
- **LiveKit (`livekit`)**: Recommended for English conversations. LiveKit can be fine-tuned using the `waitFunction` parameter to adjust response timing based on the probability that the user is still speaking.
45+
- **Vapi (`vapi`)**: Recommended for non-English conversations or as an alternative when LiveKit isn't suitable.
46+
- **Custom endpointing model (`custom-endpointing-model`)**: Use your own endpointing service by setting `smartEndpointingPlan.server.url`.
4547

46-
- **Krisp**: Audio-based model that analyzes prosodic and acoustic features such as changes in intonation, pitch, and rhythm to detect when users finish speaking. Since it's audio-based, it always notifies when the user is done speaking, even for brief acknowledgments. Vapi offers configurable acknowledgement words and a well-configured stop speaking plan to handle this properly.
48+
If you want smart endpointing off, omit `smartEndpointingPlan`. In that case, the system uses transcriber end-of-turn detection when available.
4749

48-
Configure Krisp with a threshold between 0 and 1 (default 0.5), where 1 means the user definitely stopped speaking and 0 means they're still speaking. Use lower values for snappier conversations and higher values for more conservative detection.
50+
**Transcriber-native end-of-turn (not `smartEndpointingPlan.provider` values):**
4951

50-
When interacting with an AI agent, users may genuinely want to interrupt to ask a question or shift the conversation, or they might simply be using backchannel cues like "right" or "okay" to signal they're actively listening. The core challenge lies in distinguishing meaningful interruptions from casual acknowledgments. Since the audio-based model signals end-of-turn after each word, configure the stop speaking plan with the right number of words to interrupt, interruption settings, and acknowledgement phrases to handle backchanneling properly.
51-
52-
**Audio-text based providers:**
53-
54-
- **Deepgram Flux**: Deepgram's latest transcriber model with built-in conversational speech recognition. Flux combines high-quality speech-to-text with native turn detection, while delivering ultra-low latency and Nova-3 level accuracy.
55-
56-
- **Assembly**: Transcriber that also reports end-of-turn detection. To use Assembly, choose it as your transcriber without setting a separate smart endpointing plan. As transcripts arrive, we consider the `end_of_turn` flag that Assembly sends to mark the end-of-turn, stream to the LLM, and generate a response.
57-
58-
**Text-based providers:**
59-
60-
- **Off**: Disabled by default. When smart endpointing is set to "Off", the system will automatically use the transcriber's end-of-turn detection if available. If no transcriber EOT detection is available, the system defaults to LiveKit if the language is set to English or to Vapi's standard endpointing mode.
61-
- **LiveKit**: Recommended for English conversations as it provides the most sophisticated solution for detecting natural speech patterns and pauses. LiveKit can be fine-tuned using the `waitFunction` parameter to adjust response timing based on the probability that the user is still speaking.
62-
- **Vapi**: Recommended for non-English conversations or as an alternative when LiveKit isn't suitable
52+
- **Deepgram Flux**: Deepgram's transcriber with built-in conversational turn detection.
53+
- **Assembly**: Assembly transcriber that returns `end_of_turn` signals. Configure this on the transcriber without setting `smartEndpointingPlan`.
6354

6455
![LiveKit Smart Endpointing Configuration](../static/images/advanced-tab/livekit-smart-endpointing.png)
6556

fern/customization/voice-pipeline-configuration.mdx

Lines changed: 18 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -162,7 +162,7 @@ This plan is only used if `smartEndpointingPlan` is not set and transcriber does
162162

163163
### Smart endpointing
164164

165-
Uses AI models to analyze speech patterns, context, and audio cues to predict when users have finished speaking. Only available for English conversations.
165+
Uses supported endpointing providers to predict when users have finished speaking.
166166

167167
**Important:** If your transcriber has built-in end-of-turn detection (like Deepgram Flux or Assembly) and you don't configure a smart endpointing plan, the system will automatically use the transcriber's EOT detection instead of smart endpointing.
168168

@@ -181,27 +181,19 @@ Uses AI models to analyze speech patterns, context, and audio cues to predict wh
181181
```
182182
</Tab>
183183
<Tab title="Providers">
184-
**Text-based providers:**
184+
Supported values for `smartEndpointingPlan.provider`:
185185
- **livekit**: Advanced model trained on conversation data (English only)
186-
- **vapi**: VAPI-trained model (non-English conversations or LiveKit alternative)
187-
188-
**Audio-based providers:**
189-
- **krisp**: Audio-based model analyzing prosodic features (intonation, pitch, rhythm)
190-
191-
**Audio-text based providers:**
192-
- **deepgram-flux**: Deepgram's latest transcriber model with built-in conversational speech recognition. (English only)
193-
- **assembly**: Transcriber with built-in end-of-turn detection (English only)
186+
- **vapi**: Vapi endpointing model for multilingual conversations
187+
- **custom-endpointing-model**: Send endpointing requests to your own model using `smartEndpointingPlan.server.url`
194188

195189
</Tab>
196190
</Tabs>
197191

198192
**When to use:**
199193

200-
- **Deepgram Flux**: English conversations using Deepgram as a transcriber.
201-
- **Assembly**: Best used when Assembly is already your transcriber provider for English conversations with integrated end-of-turn detection
202194
- **LiveKit**: English conversations where Deepgram is not the transcriber of choice.
203195
- **Vapi**: Non-English conversations with default stop speaking plan settings
204-
- **Krisp**: Non-English conversations with a robustly configured stop speaking plan
196+
- **Custom endpointing model**: Bring your own endpointing logic for specialized domains
205197

206198
### Deepgram Flux configuration
207199

@@ -305,31 +297,26 @@ The system continuously analyzes the latest user message and applies the first m
305297
- Scenarios requiring predictable, rule-based endpointing behavior
306298
- Fallback option when other smart endpointing providers aren't suitable
307299

308-
### Krisp threshold configuration
300+
### Custom endpointing model configuration
309301

310-
Krisp's audio-base model returns a probability between 0 and 1, where 1 means the user definitely stopped speaking and 0 means they're still speaking.
311-
312-
**Threshold settings:**
313-
314-
- **0.0-0.3:** Very aggressive detection - responds quickly but may interrupt users mid-sentence
315-
- **0.4-0.6:** Balanced detection (default: 0.5) - good balance between responsiveness and accuracy
316-
- **0.7-1.0:** Conservative detection - waits longer to ensure users have finished speaking
302+
Use `custom-endpointing-model` when you want full control over endpointing decisions.
317303

318304
**Configuration example:**
319305

320306
```json
321307
{
322308
"startSpeakingPlan": {
323309
"smartEndpointingPlan": {
324-
"provider": "krisp",
325-
"threshold": 0.5
310+
"provider": "custom-endpointing-model",
311+
"server": {
312+
"url": "https://your-endpointing-service.example.com/endpointing"
313+
}
326314
}
327315
}
328316
}
329317
```
330318

331-
**Important considerations:**
332-
Since Krisp is audio-based, it always notifies when the user is done speaking, even for brief acknowledgments. Configure the stop speaking plan with appropriate `acknowledgementPhrases` and `numWords` settings to handle backchanneling properly.
319+
When configured, Vapi sends `call.endpointing.request` events to your server URL and expects a timeout-based response.
333320

334321
### Assembly turn detection
335322

@@ -613,15 +600,17 @@ User Interrupts → Assistant Audio Stopped → backoffSeconds Blocks All Output
613600

614601
**Optimized for:** Text-based endpointing with longer timeouts for different speech patterns and international support.
615602

616-
### Audio-based endpointing (Krisp example)
603+
### Custom endpointing model example
617604

618605
```json
619606
{
620607
"startSpeakingPlan": {
621608
"waitSeconds": 0.4,
622609
"smartEndpointingPlan": {
623-
"provider": "krisp",
624-
"threshold": 0.5
610+
"provider": "custom-endpointing-model",
611+
"server": {
612+
"url": "https://your-endpointing-service.example.com/endpointing"
613+
}
625614
}
626615
},
627616
"stopSpeakingPlan": {
@@ -640,7 +629,7 @@ User Interrupts → Assistant Audio Stopped → backoffSeconds Blocks All Output
640629
}
641630
```
642631

643-
**Optimized for:** Non-English conversations with robust backchanneling configuration to handle audio-based detection limitations.
632+
**Optimized for:** Teams that need domain-specific endpointing behavior and want to run their own endpointing model.
644633

645634
### Audio-text based endpointing (Assembly example)
646635

0 commit comments

Comments
 (0)