OpenMOSS
diff --git a/‎assets/moss-tts-nano.png‎
1.18 MB b/‎assets/moss-tts-nano.png‎
1.18 MB
diff --git a/‎index.html‎
Lines changed: 12 additions & 14 deletions b/‎index.html‎
Lines changed: 12 additions & 14 deletions
@@ -234,17 +234,20 @@ <h3>MOSS-Audio-Tokenizer</h3>
 
           <h3>MOSS TTS Nano</h3>
           <p class="arch-copy">
-            On top of the tokenizer, MOSS-TTS-Nano uses a single Transformer
-            backbone with RVQ-aware delayed alignment to autoregressively
-            predict text and audio tokens together. Each delayed step sums the
-            embeddings from all RVQ layers, and the backbone output is sent
-            directly to <strong>17 prediction heads</strong>: one text-or-pad
-            head plus 16 audio heads.
+            On top of the tokenizer, MOSS-TTS-Nano can adopt a hierarchical
+            token modeling design built around a Local Transformer. Instead of
+            using RVQ-aware temporal delays, the model sums the embeddings from
+            all RVQ layers at each aligned time step and feeds that hidden
+            state into a single Transformer backbone. The backbone then
+            produces one global latent per step, which a lightweight
+            autoregressive <strong>Local Transformer</strong> expands into the
+            within-step token block, sequentially predicting one text-or-pad
+            token and 16 RVQ audio tokens.
           </p>
           <div class="arch-chip-row">
-            <span class="arch-chip">1 backbone</span>
-            <span class="arch-chip">17 heads</span>
-            <span class="arch-chip">simple decode path</span>
+            <span class="arch-chip">100 M params</span>
+            <span class="arch-chip">Local Transformer</span>
+            <span class="arch-chip">Tiny, Fast and Powerful</span>
           </div>
         </div>
 
@@ -254,11 +257,6 @@ <h3>MOSS TTS Nano</h3>
       <section class="paper-section" id="demo">
         <h2>Demo</h2>
 
-        <p class="section-note">
-          Each card shows the prompt speech, the text to be spoken, and the
-          generated output side-by-side.
-        </p>
-
         <!-- Tab bar -->
         <div class="demo-tabs" role="tablist" aria-label="Language category">
           <button