Skip to content

Commit dd9f7d0

Browse files
author
lijunjie
committed
chat demo
1 parent 6b0f091 commit dd9f7d0

9 files changed

Lines changed: 18 additions & 27 deletions

File tree

.DS_Store

0 Bytes
Binary file not shown.

demos/firered_chat/index.html

Lines changed: 15 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
<meta name="generator" content="Hugo 0.88.1" />
77
<meta name="viewport" content="width=device-width, initial-scale=1">
88
<link href="https://fonts.googleapis.com/css?family=Roboto:300,400,700" rel="stylesheet" type="text/css">
9-
<link rel="stylesheet" href="" https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.4/styles/github.min.css">
9+
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.4/styles/github.min.css">
1010
<link rel="stylesheet" href="css/custom.css">
1111
<link rel="stylesheet" href="css/normalize.css">
1212

@@ -58,21 +58,23 @@
5858
<p style="text-align: left;">
5959
</p>
6060
<div class="text-center">
61-
<h2>FireRedChat: Toward Lifelike Full-Duplex Voice Interaction: A Pluggable System with Cascaded and Semi-Cascaded Implementations</h2>
62-
[<a href="http://arxiv.org/abs/2501.14350">Paper</a>]
63-
[<a href="https://github.com/FireRedTeam/FireRedASR">Code</a>]
61+
<h2>FireRedChat: A Pluggable, Full-Duplex Voice Interaction System with Cascaded and Semi-Cascaded Implementations</h2>
62+
[<a href="http://arxiv.org/abs/2501.14350" target="_blank">Paper</a>]
63+
[<a href="https://github.com/FireRedTeam/FireRedASR" target="_blank">Code</a>]
64+
[<a href="https://firered-chat.xiaohongshu.com" target="_blank">Try FireRedChat Online</a>]
6465

6566
<p class="fst-italic mb-0">
6667
<br>
67-
<b><a href="https://fireredteam.github.io">FireRed Team</a></b>
68+
<b><a href="https://fireredteam.github.io" target="_blank">FireRed Team</a></b>
6869
<p></p>
6970
</p>
7071
</div>
71-
<p><b>Abstract.</b>Full-duplex voice interaction enables users and agents to speak simultaneously, supporting barge-in for lifelike dialogue, and is critical for AI assistants and customer service. Existing approaches are either end-to-end models that handle turn-taking, complex to design and hard to control, or modular pipelines governed by turn-taking controllers that upgrade existing systems and allow per-module optimization. Prior frameworks embrace modularity but depend on non-open components and external model providers, impeding holistic optimization. In this work, we present a complete, practical full-duplex system comprising a turn-taking controller, an interaction module, and a dialogue manager. The controller integrates personalized VAD (pVAD) that suppresses false barge-ins from background noise and non-primary speakers, accurately timestamps primary-speaker segments, and explicitly enables barge-in triggered by the primary speaker; a semantic end-of-turn (EoT) detector improves stop decisions. With this controller, heterogeneous half-duplex pipelines, cascaded, semi-cascaded, or speech-to-speech, are seamlessly upgraded to full duplex. With our internal models, we implement cascaded and semi-cascaded variants: the former benefits from mature deployment; the latter perceives emotional and paralinguistic cues, yields more coherent responses, reduces latency and error propagation, and improves robustness. A dialogue manager extends capabilities via tool invocation and context management. We further propose three system-level metrics, barge-in, end-of-turn detection accuracy, and end-to-end latency, to assess naturalness, control accuracy, and efficiency of full-duplex interaction, with the aim of guiding subsequent improvements.
72+
<p><b>Abstract.</b> Full-duplex voice interaction allows users and agents to speak simultaneously with controllable barge-in, enabling lifelike assistants and customer service. Existing solutions are either end-to-end, difficult to design and hard to control, or modular pipelines governed by turn-taking controllers that ease upgrades and per-module optimization; however, prior modular frameworks depend on non-open components and external providers, limiting holistic optimization. In this work, we present a complete, practical full-duplex voice interaction system comprising a turn-taking controller, an interaction module, and a dialogue manager. The controller integrates streaming personalized VAD (pVAD) to suppress false barge-ins from noise and non-primary speakers, precisely timestamp primary-speaker segments, and explicitly enable primary-speaker barge-ins; a semantic end-of-turn detector improves stop decisions. It upgrades heterogeneous half-duplex pipelines, cascaded, semi-cascaded, and speech-to-speech, to full duplex. Using internal models, we implement cascaded and semi-cascaded variants; the semi-cascaded one captures emotional and paralinguistic cues, yields more coherent responses, lowers latency and error propagation, and improves robustness. A dialogue manager extends capabilities via tool invocation and context management. We also propose three system-level metrics, barge-in, end-of-turn detection accuracy, and end-to-end latency, to assess naturalness, control accuracy, and efficiency. Experiments show fewer false interruptions, more accurate semantic ends, and lower latency approaching industrial systems, enabling robust, natural, real-time full-duplex interaction.
7273
</p>
7374
<p>
7475
<b>Contents</b>
7576
<ul>
77+
<li><a href="#Demo">Demo</a></li>
7678
<li><a href="#system-overview">System Overview</a></li>
7779
<li><a href="#workflow-overview">Workflow Overview</a></li>
7880
<li><a href="#config">Configurations between different systems.</a></li>
@@ -86,77 +88,63 @@ <h2>FireRedChat: Toward Lifelike Full-Duplex Voice Interaction: A Pluggable Syst
8688

8789
<div class="container pt-5 mt-5 shadow-lg p-5 mb-5 bg-white rounded">
8890
<h2 id="Demo" style="text-align: center;">Demo</h2>
89-
<body>
9091
<p style="text-align: center;">
91-
<video src="" height="1200" width="1200"></video>
92+
<video src="video/chat.mp4" controls style="max-width:100%; height:auto;"></video>
9293
</p>
93-
</body>
9494
<!-- <p style="text-align: center;">
9595
<b>Figure 1.</b> FireRedChat System Modules.
9696
</p> -->
9797
</div>
9898
<div class="container pt-5 mt-5 shadow-lg p-5 mb-5 bg-white rounded">
9999
<h2 id="system-overview" style="text-align: center;">System Overview</h2>
100-
<body>
101100
<p style="text-align: center;">
102-
<img src="pics/arc.png" height="1200" width="1200">
101+
<img src="pics/arc.png" style="max-width:90%; height:auto;">
103102
</p>
104-
</body>
105103
<p style="text-align: center;">
106104
<b>Figure 1.</b> FireRedChat System Modules.
107105
</p>
108106
</div>
109107
<div class="container pt-5 mt-5 shadow-lg p-5 mb-5 bg-white rounded">
110108
<h2 id="workflow-overview" style="text-align: center;">Workflow</h2>
111-
<body>
112109
<p style="text-align: center;">
113-
<img src="pics/flow.png" height="1200" width="1200">
110+
<img src="pics/flow.png" style="max-width:60%; height:auto;">
114111
</p>
115-
</body>
116112
<p style="text-align: center;">
117113
<b>Figure 3.</b> FireRedChat Voice Interaction Flow.
118114
</p>
119115
</div>
120116
<div class="container pt-5 mt-5 shadow-lg p-5 mb-5 bg-white rounded">
121117
<h2 id="config" style="text-align: center;">Configurations Between Different Systems</h2>
122-
<body>
123118
<p style="text-align: center;">
124-
<img src="pics/sys_config.png" height="1200" width="1200">
119+
<img src="pics/sys_config.png" style="max-width:40%; height:auto;">
125120
</p>
126-
</body>
127121
<p style="text-align: center;">
128122
<b>Table 1.</b> Configurations between different systems.
129123
</p>
130124
</div>
131125
<div class="container pt-5 mt-5 shadow-lg p-5 mb-5 bg-white rounded">
132126
<h2 id="barge-in" style="text-align: center;">Barge-In Evaluation Results</h2>
133-
<body>
134127
<p style="text-align: center;">
135-
<img src="pics/cer2.png" height="1200" width="1200">
128+
<img src="pics/exp_barge_in.png" style="max-width:45%; height:auto;">
136129
</p>
137-
</body>
138130
<p style="text-align: center;">
139131
<b>Table 2.</b> Barge-In evaluation results.
140132
</p>
141133
</div>
142134
<div class="container pt-5 mt-5 shadow-lg p-5 mb-5 bg-white rounded">
143135
<h2 id="EoT" style="text-align: center;">End-of-turn Detection Evaluation Results</h2>
144-
<body>
145136
<p style="text-align: center;">
146-
<img src="pics/exp_eot.png" height="1200" width="1200">
137+
<img src="pics/exp_eot.png" style="max-width:85%; height:auto;">
147138
</p>
148-
</body>
149139
<p style="text-align: center;">
150140
<b>Table 3.</b> End-of-turn Detection evaluation results.
151141
</p>
152142
</div>
153143
<div class="container pt-5 mt-5 shadow-lg p-5 mb-5 bg-white rounded">
154144
<h2 id="Latency" style="text-align: center;">Latency of Different Systems</h2>
155-
<body>
156145
<p style="text-align: center;">
157-
<img src="pics/exp_latency.png" height="1200" width="1200">
146+
<img src="pics/exp_latency.png" style="max-width:45%; height:auto;">
158147
</p>
159-
</body>
160148
<p style="text-align: center;">
161149
<b>Table 4.</b> Latency of different systems.
162150
</p>

demos/firered_chat/pics/arc.png

129 KB
Loading
165 KB
Loading
88.1 KB
Loading
173 KB
Loading

demos/firered_chat/pics/flow.png

84.3 KB
Loading
123 KB
Loading

demos/firered_chat/video/chat.mp4

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:42968d2b454e1fb803448e1e51ab5def2e2f9d20a2622a969d81e068802fd020
3+
size 329866207

0 commit comments

Comments
 (0)