@@ -81,23 +81,21 @@ <h2>FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and
8181 < p > </ p >
8282 </ p >
8383 </ div >
84- < p > < b > Abstract.</ b > Conventional monologue text-to-speech systems can synthesis natural-sounding
85- single-speaker utterances. They can be adapted to multi-speaker dialogue generation by
86- segmenting the text into utterances and synthesizing each fragment,but their limited awareness
87- of text and speech context often leads to incoherent prosody. Current dialogue-generation
88- approaches typically require the full dialogue text upfront before synthesis and produce a
89- single mixed speech containing all voices, which hinders their use in interactive chat
90- scenarios. They also suffer from unstable synthesis, inaccurate speaker transitions and
91- incoherent prosody. In this work, we present FireRedTTS-2, a conversational speech generation
92- system well suited to downstream chat and podcast applications. It features a low frame rate
93- streaming speech tokenizer with enhanced semantic representations and a dual transformer
94- text-to-speech model operating on text-speech interleaving format. Its design allows for
95- flexible sentence-by-sentence generation with first packet latency lower than 100ms.
96- Experimental results show that FireRedTTS-2 integrates seamlessly into interactive chat
97- frameworks, producing emotional speech response inferred from implicit context. It delivers more
98- stable synthesis, more accurate speaker transitions, and more contextually coherent prosody,
99- surpassing state-of-the-art dialogue generation models such as MoonCast, ZipVoice-Dialogue, and
100- MOSS-TTSD.
84+ < p > < b > Abstract.</ b > Existing dialogue text-to-speech requires the full dialogue script and emits one
85+ monolithic waveform, blocking interactive extension and yielding unstable synthesis,
86+ speaker-transition errors, and incoherent prosody. In this work, we present FireRedTTS‑2, a
87+ long-form streaming TTS system for multi-speaker dialogue generation, delivering stable, natural
88+ speech with reliable speaker switching and context-aware prosody. A new 12.5Hz streaming speech
89+ tokenizer accelerates training and inference, extends maximum dialogue length, encodes richer
90+ semantics to stabilize text-to-token modeling and supports high-fidelity streaming generation
91+ for real-time applications. We adopt a text–speech interleaved format, concatenating
92+ speaker-labeled text with aligned speech tokens in chronological order, and model it with a
93+ dual-transformer: a large decoder-only transformer predicts tokens at the first layer, and a
94+ smaller one completes subsequent layers. Experimental results show that FireRedTTS‑2 integrates
95+ seamlessly with chat frameworks and, with minimal fine-tuning, produces emotionally expressive
96+ speech guided by implicit contextual cues. In podcast generation, it surpasses existing systems
97+ including MoonCast, Zipvoice-Dialogue, and MOSS-TTSD in objective intelligibility, speaker-turn
98+ reliability, and perceived naturalness with context-consistent prosody.
10199 </ p >
102100
103101 < p >
0 commit comments