@@ -234,17 +234,20 @@ <h3>MOSS-Audio-Tokenizer</h3>
234234
235235 < h3 > MOSS TTS Nano</ h3 >
236236 < p class ="arch-copy ">
237- On top of the tokenizer, MOSS-TTS-Nano uses a single Transformer
238- backbone with RVQ-aware delayed alignment to autoregressively
239- predict text and audio tokens together. Each delayed step sums the
240- embeddings from all RVQ layers, and the backbone output is sent
241- directly to < strong > 17 prediction heads</ strong > : one text-or-pad
242- head plus 16 audio heads.
237+ On top of the tokenizer, MOSS-TTS-Nano can adopt a hierarchical
238+ token modeling design built around a Local Transformer. Instead of
239+ using RVQ-aware temporal delays, the model sums the embeddings from
240+ all RVQ layers at each aligned time step and feeds that hidden
241+ state into a single Transformer backbone. The backbone then
242+ produces one global latent per step, which a lightweight
243+ autoregressive < strong > Local Transformer</ strong > expands into the
244+ within-step token block, sequentially predicting one text-or-pad
245+ token and 16 RVQ audio tokens.
243246 </ p >
244247 < div class ="arch-chip-row ">
245- < span class ="arch-chip "> 1 backbone </ span >
246- < span class ="arch-chip "> 17 heads </ span >
247- < span class ="arch-chip "> simple decode path </ span >
248+ < span class ="arch-chip "> 100 M params </ span >
249+ < span class ="arch-chip "> Local Transformer </ span >
250+ < span class ="arch-chip "> Tiny, Fast and Powerful </ span >
248251 </ div >
249252 </ div >
250253
@@ -254,11 +257,6 @@ <h3>MOSS TTS Nano</h3>
254257 < section class ="paper-section " id ="demo ">
255258 < h2 > Demo</ h2 >
256259
257- < p class ="section-note ">
258- Each card shows the prompt speech, the text to be spoken, and the
259- generated output side-by-side.
260- </ p >
261-
262260 <!-- Tab bar -->
263261 < div class ="demo-tabs " role ="tablist " aria-label ="Language category ">
264262 < button
0 commit comments