You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p><b>Abstract.</b>Full-duplex voice interaction enables users and agents to speak simultaneously, supporting barge-in for lifelike dialogue, and is critical for AI assistants and customer service. Existing approaches are either end-to-end models that handle turn-taking, complex to design and hard to control, or modular pipelines governed by turn-taking controllers that upgrade existing systems and allow per-module optimization. Prior frameworks embrace modularity but depend on non-open components and external model providers, impeding holistic optimization. In this work, we present a complete, practical full-duplex system comprising a turn-taking controller, an interaction module, and a dialogue manager. The controller integrates personalized VAD (pVAD) that suppresses false barge-ins from background noise and non-primary speakers, accurately timestamps primary-speaker segments, and explicitly enables barge-in triggered by the primary speaker; a semantic end-of-turn (EoT) detector improves stop decisions. With this controller, heterogeneous half-duplex pipelines, cascaded, semi-cascaded, or speech-to-speech, are seamlessly upgraded to full duplex. With our internal models, we implement cascaded and semi-cascaded variants: the former benefits from mature deployment; the latter perceives emotional and paralinguistic cues, yields more coherent responses, reduces latency and error propagation, and improves robustness. A dialogue manager extends capabilities via tool invocation and context management. We further propose three system-level metrics, barge-in, end-of-turn detection accuracy, and end-to-end latency, to assess naturalness, control accuracy, and efficiency of full-duplex interaction, with the aim of guiding subsequent improvements.
72
+
<p><b>Abstract.</b> Full-duplex voice interaction allows users and agents to speak simultaneously with controllable barge-in, enabling lifelike assistants and customer service. Existing solutions are either end-to-end, difficult to design and hard to control, or modular pipelines governed by turn-taking controllers that ease upgrades and per-module optimization; however, prior modular frameworks depend on non-open components and external providers, limiting holistic optimization. In this work, we present a complete, practical full-duplex voice interaction system comprising a turn-taking controller, an interaction module, and a dialogue manager. The controller integrates streaming personalized VAD (pVAD) to suppress false barge-ins from noise and non-primary speakers, precisely timestamp primary-speaker segments, and explicitly enable primary-speaker barge-ins; a semantic end-of-turn detector improves stop decisions. It upgrades heterogeneous half-duplex pipelines, cascaded, semi-cascaded, and speech-to-speech, to full duplex. Using internal models, we implement cascaded and semi-cascaded variants; the semi-cascaded one captures emotional and paralinguistic cues, yields more coherent responses, lowers latency and error propagation, and improves robustness. A dialogue manager extends capabilities via tool invocation and context management. We also propose three system-level metrics, barge-in, end-of-turn detection accuracy, and end-to-end latency, to assess naturalness, control accuracy, and efficiency. Experiments show fewer false interruptions, more accurate semantic ends, and lower latency approaching industrial systems, enabling robust, natural, real-time full-duplex interaction.
0 commit comments