Skip to content

Commit 027c2ed

Browse files
author
xiekun
committed
update abstract
1 parent 93fb529 commit 027c2ed

1 file changed

Lines changed: 15 additions & 17 deletions

File tree

demos/firered_tts_2/index.html

Lines changed: 15 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -81,23 +81,21 @@ <h2>FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and
8181
<p></p>
8282
</p>
8383
</div>
84-
<p><b>Abstract.</b> Conventional monologue text-to-speech systems can synthesis natural-sounding
85-
single-speaker utterances. They can be adapted to multi-speaker dialogue generation by
86-
segmenting the text into utterances and synthesizing each fragment,but their limited awareness
87-
of text and speech context often leads to incoherent prosody. Current dialogue-generation
88-
approaches typically require the full dialogue text upfront before synthesis and produce a
89-
single mixed speech containing all voices, which hinders their use in interactive chat
90-
scenarios. They also suffer from unstable synthesis, inaccurate speaker transitions and
91-
incoherent prosody. In this work, we present FireRedTTS-2, a conversational speech generation
92-
system well suited to downstream chat and podcast applications. It features a low frame rate
93-
streaming speech tokenizer with enhanced semantic representations and a dual transformer
94-
text-to-speech model operating on text-speech interleaving format. Its design allows for
95-
flexible sentence-by-sentence generation with first packet latency lower than 100ms.
96-
Experimental results show that FireRedTTS-2 integrates seamlessly into interactive chat
97-
frameworks, producing emotional speech response inferred from implicit context. It delivers more
98-
stable synthesis, more accurate speaker transitions, and more contextually coherent prosody,
99-
surpassing state-of-the-art dialogue generation models such as MoonCast, ZipVoice-Dialogue, and
100-
MOSS-TTSD.
84+
<p><b>Abstract.</b> Existing dialogue text-to-speech requires the full dialogue script and emits one
85+
monolithic waveform, blocking interactive extension and yielding unstable synthesis,
86+
speaker-transition errors, and incoherent prosody. In this work, we present FireRedTTS‑2, a
87+
long-form streaming TTS system for multi-speaker dialogue generation, delivering stable, natural
88+
speech with reliable speaker switching and context-aware prosody. A new 12.5Hz streaming speech
89+
tokenizer accelerates training and inference, extends maximum dialogue length, encodes richer
90+
semantics to stabilize text-to-token modeling and supports high-fidelity streaming generation
91+
for real-time applications. We adopt a text–speech interleaved format, concatenating
92+
speaker-labeled text with aligned speech tokens in chronological order, and model it with a
93+
dual-transformer: a large decoder-only transformer predicts tokens at the first layer, and a
94+
smaller one completes subsequent layers. Experimental results show that FireRedTTS‑2 integrates
95+
seamlessly with chat frameworks and, with minimal fine-tuning, produces emotionally expressive
96+
speech guided by implicit contextual cues. In podcast generation, it surpasses existing systems
97+
including MoonCast, Zipvoice-Dialogue, and MOSS-TTSD in objective intelligibility, speaker-turn
98+
reliability, and perceived naturalness with context-consistent prosody.
10199
</p>
102100

103101
<p>

0 commit comments

Comments
 (0)