| layout | default |
|---|---|
| title | Chapter 3: Voice Input Processing |
| nav_order | 3 |
| parent | OpenAI Realtime Agents Tutorial |
Welcome to Chapter 3: Voice Input Processing. In this part of OpenAI Realtime Agents Tutorial: Voice-First AI Systems, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Input quality and turn-boundary accuracy are the biggest predictors of perceived voice-agent quality.
By the end of this chapter, you should be able to:
- design a robust audio input pipeline
- tune voice activity detection (VAD) for your environment
- handle interruption and partial-turn scenarios correctly
- track metrics that reveal input regressions early
- microphone capture
- buffering and chunk framing
- optional preprocessing (normalization/noise reduction)
- VAD-based turn detection
- commit audio segment to session
- begin response generation
| Mode | Best For | Risk |
|---|---|---|
| automatic VAD | consumer voice UX with minimal friction | clipping in noisy environments if tuned poorly |
| push-to-talk | controlled enterprise or noisy contexts | higher user interaction cost |
| hybrid | mixed environments and advanced clients | more implementation complexity |
When user speech starts while assistant is speaking:
- stop output quickly
- preserve minimal state needed for continuity
- commit new user input immediately
- avoid long blocking operations before acknowledgement
- enforce expected sample format at ingestion
- cap maximum segment duration to prevent oversized turns
- detect prolonged silence and reset capture state gracefully
- log dropped frames and jitter indicators
| Pitfall | User Impact | Mitigation |
|---|---|---|
| aggressive VAD | clipped speech and repeated clarifications | relax sensitivity and add hysteresis |
| conservative VAD | laggy turn transitions | reduce release delay |
| no interruption support | assistant talks over user | prioritize barge-in cancellation path |
| poor noise handling | wrong intent extraction | add preprocessing and environment presets |
- speech-start to commit latency
- clipped-turn rate
- interruption success rate
- speech-to-first-token latency
- retry rate after misunderstood turns
You now have a robust input architecture pattern that supports low-latency conversation without sacrificing turn accuracy.
Next: Chapter 4: Conversational AI
The App function in src/app/App.tsx handles a key part of this chapter's functionality:
import { useHandleSessionHistory } from "./hooks/useHandleSessionHistory";
function App() {
const searchParams = useSearchParams()!;
// ---------------------------------------------------------------------
// Codec selector – lets you toggle between wide-band Opus (48 kHz)
// and narrow-band PCMU/PCMA (8 kHz) to hear what the agent sounds like on
// a traditional phone line and to validate ASR / VAD behaviour under that
// constraint.
//
// We read the `?codec=` query-param and rely on the `changePeerConnection`
// hook (configured in `useRealtimeSession`) to set the preferred codec
// before the offer/answer negotiation.
// ---------------------------------------------------------------------
const urlCodec = searchParams.get("codec") || "opus";
// Agents SDK doesn't currently support codec selection so it is now forced
// via global codecPatch at module load
const {
addTranscriptMessage,
addTranscriptBreadcrumb,
} = useTranscript();
const { logClientEvent, logServerEvent } = useEvent();
const [selectedAgentName, setSelectedAgentName] = useState<string>("");
const [selectedAgentConfigSet, setSelectedAgentConfigSet] = useState<
RealtimeAgent[] | null
>(null);
const audioElementRef = useRef<HTMLAudioElement | null>(null);This function is important because it defines how OpenAI Realtime Agents Tutorial: Voice-First AI Systems implements the patterns covered in this chapter.
The Transcript function in src/app/components/Transcript.tsx handles a key part of this chapter's functionality:
import React, { useEffect, useRef, useState } from "react";
import ReactMarkdown from "react-markdown";
import { TranscriptItem } from "@/app/types";
import Image from "next/image";
import { useTranscript } from "@/app/contexts/TranscriptContext";
import { DownloadIcon, ClipboardCopyIcon } from "@radix-ui/react-icons";
import { GuardrailChip } from "./GuardrailChip";
export interface TranscriptProps {
userText: string;
setUserText: (val: string) => void;
onSendMessage: () => void;
canSend: boolean;
downloadRecording: () => void;
}
function Transcript({
userText,
setUserText,
onSendMessage,
canSend,
downloadRecording,
}: TranscriptProps) {
const { transcriptItems, toggleTranscriptItemExpand } = useTranscript();
const transcriptRef = useRef<HTMLDivElement | null>(null);
const [prevLogs, setPrevLogs] = useState<TranscriptItem[]>([]);
const [justCopied, setJustCopied] = useState(false);
const inputRef = useRef<HTMLInputElement | null>(null);
function scrollToBottom() {
if (transcriptRef.current) {
transcriptRef.current.scrollTop = transcriptRef.current.scrollHeight;This function is important because it defines how OpenAI Realtime Agents Tutorial: Voice-First AI Systems implements the patterns covered in this chapter.
The scrollToBottom function in src/app/components/Transcript.tsx handles a key part of this chapter's functionality:
const inputRef = useRef<HTMLInputElement | null>(null);
function scrollToBottom() {
if (transcriptRef.current) {
transcriptRef.current.scrollTop = transcriptRef.current.scrollHeight;
}
}
useEffect(() => {
const hasNewMessage = transcriptItems.length > prevLogs.length;
const hasUpdatedMessage = transcriptItems.some((newItem, index) => {
const oldItem = prevLogs[index];
return (
oldItem &&
(newItem.title !== oldItem.title || newItem.data !== oldItem.data)
);
});
if (hasNewMessage || hasUpdatedMessage) {
scrollToBottom();
}
setPrevLogs(transcriptItems);
}, [transcriptItems]);
// Autofocus on text box input on load
useEffect(() => {
if (canSend && inputRef.current) {
inputRef.current.focus();
}
}, [canSend]);This function is important because it defines how OpenAI Realtime Agents Tutorial: Voice-First AI Systems implements the patterns covered in this chapter.
The TranscriptProps interface in src/app/components/Transcript.tsx handles a key part of this chapter's functionality:
import { GuardrailChip } from "./GuardrailChip";
export interface TranscriptProps {
userText: string;
setUserText: (val: string) => void;
onSendMessage: () => void;
canSend: boolean;
downloadRecording: () => void;
}
function Transcript({
userText,
setUserText,
onSendMessage,
canSend,
downloadRecording,
}: TranscriptProps) {
const { transcriptItems, toggleTranscriptItemExpand } = useTranscript();
const transcriptRef = useRef<HTMLDivElement | null>(null);
const [prevLogs, setPrevLogs] = useState<TranscriptItem[]>([]);
const [justCopied, setJustCopied] = useState(false);
const inputRef = useRef<HTMLInputElement | null>(null);
function scrollToBottom() {
if (transcriptRef.current) {
transcriptRef.current.scrollTop = transcriptRef.current.scrollHeight;
}
}
useEffect(() => {
const hasNewMessage = transcriptItems.length > prevLogs.length;
const hasUpdatedMessage = transcriptItems.some((newItem, index) => {This interface is important because it defines how OpenAI Realtime Agents Tutorial: Voice-First AI Systems implements the patterns covered in this chapter.
flowchart TD
A[App]
B[Transcript]
C[scrollToBottom]
D[TranscriptProps]
E[useRealtimeSession]
A --> B
B --> C
C --> D
D --> E