layout	default
title	Chapter 6: Voice Output
nav_order	6
parent	OpenAI Realtime Agents Tutorial

Chapter 6: Voice Output

Welcome to Chapter 6: Voice Output. In this part of OpenAI Realtime Agents Tutorial: Voice-First AI Systems, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.

Voice output quality is primarily a timing and interaction problem. Good prosody helps, but responsiveness and interruption behavior matter more.

Learning Goals

By the end of this chapter, you should be able to:

design low-latency output streaming behavior
handle barge-in cleanly without losing conversation continuity
monitor core audio response metrics
tune output policy for different use cases

Output Pipeline

response deltas are generated
audio is synthesized/streamed
client buffers and plays frames
playback is interrupted or completed
session state is updated for next turn

Voice UX Rules of Thumb

prefer short, direct phrasing
avoid dense list-heavy answers in speech mode
announce long actions briefly before tool calls
use natural checkpoint phrases for easier interruption

Barge-In Behavior

When user speaks during playback:

stop playback immediately
mark current response state as interrupted
prioritize next user input event path
ensure transcript/state remains coherent after cutover

Latency Targets (Product-Dependent)

Metric	Why It Matters
time to first audio	user perceived responsiveness
interruption stop latency	user sense of control
full response completion latency	overall task pacing
playback error rate	trust and reliability

Output Regression Signals

rising interruption dissatisfaction despite stable model quality
increased repeated-user prompts ("hello?", "are you there?")
higher manual retry rates for basic interactions
audible clipping or stutter under normal network conditions

Source References

Summary

You now understand how to tune voice output for perceived speed, clarity, and user control.

Next: Chapter 7: Advanced Patterns

How These Components Connect

flowchart LR
    A[Agent Text Response] --> B[Realtime TTS Engine]
    B --> C[PCM Audio Stream]
    C --> D[WebRTC Transport]
    D --> E[Browser Audio Output]
    F[Interrupt Signal] --> G[Cancel Ongoing Audio]
    G --> D

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter 6: Voice Output

Learning Goals

Output Pipeline

Voice UX Rules of Thumb

Barge-In Behavior

Latency Targets (Product-Dependent)

Output Regression Signals

Source References

Summary

How These Components Connect

FilesExpand file tree

06-voice-output.md

Latest commit

History

06-voice-output.md

File metadata and controls

Chapter 6: Voice Output

Learning Goals

Output Pipeline

Voice UX Rules of Thumb

Barge-In Behavior

Latency Targets (Product-Dependent)

Output Regression Signals

Source References

Summary

How These Components Connect