Skip to content

Latest commit

 

History

History
87 lines (60 loc) · 2.61 KB

File metadata and controls

87 lines (60 loc) · 2.61 KB
layout default
title Chapter 6: Voice Output
nav_order 6
parent OpenAI Realtime Agents Tutorial

Chapter 6: Voice Output

Welcome to Chapter 6: Voice Output. In this part of OpenAI Realtime Agents Tutorial: Voice-First AI Systems, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.

Voice output quality is primarily a timing and interaction problem. Good prosody helps, but responsiveness and interruption behavior matter more.

Learning Goals

By the end of this chapter, you should be able to:

  • design low-latency output streaming behavior
  • handle barge-in cleanly without losing conversation continuity
  • monitor core audio response metrics
  • tune output policy for different use cases

Output Pipeline

  1. response deltas are generated
  2. audio is synthesized/streamed
  3. client buffers and plays frames
  4. playback is interrupted or completed
  5. session state is updated for next turn

Voice UX Rules of Thumb

  • prefer short, direct phrasing
  • avoid dense list-heavy answers in speech mode
  • announce long actions briefly before tool calls
  • use natural checkpoint phrases for easier interruption

Barge-In Behavior

When user speaks during playback:

  • stop playback immediately
  • mark current response state as interrupted
  • prioritize next user input event path
  • ensure transcript/state remains coherent after cutover

Latency Targets (Product-Dependent)

Metric Why It Matters
time to first audio user perceived responsiveness
interruption stop latency user sense of control
full response completion latency overall task pacing
playback error rate trust and reliability

Output Regression Signals

  • rising interruption dissatisfaction despite stable model quality
  • increased repeated-user prompts ("hello?", "are you there?")
  • higher manual retry rates for basic interactions
  • audible clipping or stutter under normal network conditions

Source References

Summary

You now understand how to tune voice output for perceived speed, clarity, and user control.

Next: Chapter 7: Advanced Patterns

How These Components Connect

flowchart LR
    A[Agent Text Response] --> B[Realtime TTS Engine]
    B --> C[PCM Audio Stream]
    C --> D[WebRTC Transport]
    D --> E[Browser Audio Output]
    F[Interrupt Signal] --> G[Cancel Ongoing Audio]
    G --> D
Loading