| layout | default |
|---|---|
| title | Chapter 6: Voice Output |
| nav_order | 6 |
| parent | OpenAI Realtime Agents Tutorial |
Welcome to Chapter 6: Voice Output. In this part of OpenAI Realtime Agents Tutorial: Voice-First AI Systems, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Voice output quality is primarily a timing and interaction problem. Good prosody helps, but responsiveness and interruption behavior matter more.
By the end of this chapter, you should be able to:
- design low-latency output streaming behavior
- handle barge-in cleanly without losing conversation continuity
- monitor core audio response metrics
- tune output policy for different use cases
- response deltas are generated
- audio is synthesized/streamed
- client buffers and plays frames
- playback is interrupted or completed
- session state is updated for next turn
- prefer short, direct phrasing
- avoid dense list-heavy answers in speech mode
- announce long actions briefly before tool calls
- use natural checkpoint phrases for easier interruption
When user speaks during playback:
- stop playback immediately
- mark current response state as interrupted
- prioritize next user input event path
- ensure transcript/state remains coherent after cutover
| Metric | Why It Matters |
|---|---|
| time to first audio | user perceived responsiveness |
| interruption stop latency | user sense of control |
| full response completion latency | overall task pacing |
| playback error rate | trust and reliability |
- rising interruption dissatisfaction despite stable model quality
- increased repeated-user prompts ("hello?", "are you there?")
- higher manual retry rates for basic interactions
- audible clipping or stutter under normal network conditions
You now understand how to tune voice output for perceived speed, clarity, and user control.
Next: Chapter 7: Advanced Patterns
flowchart LR
A[Agent Text Response] --> B[Realtime TTS Engine]
B --> C[PCM Audio Stream]
C --> D[WebRTC Transport]
D --> E[Browser Audio Output]
F[Interrupt Signal] --> G[Cancel Ongoing Audio]
G --> D