VoiceChat EA STT training reproducible features by ankitapasad · Pull Request #15558 · NVIDIA-NeMo/NeMo

ankitapasad · 2026-03-27T19:50:35Z

What does this PR do ?

Adds following features to the dataset class to support VoiceChat EA STT training and fine-tuning

Correct agent EOS placement
Clean implementation of token IDs and update user BOS ID to match EA
MCQ system prompt, disabled by default
Filler responses for ASR training data, disabled by default
Number normalization, enabled by default
Corresponding tests

Internal-access document with training, inference recipes, and notes on parity.

Collection: speechlm2

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

PR Type:

New Feature
Bugfix

If you haven't finished some of the above items you can still open "Draft" PR.

Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Ankita Pasad <apasad@nvidia.com>

…ization, clean-up token ID init, and corresponding tests Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Ankita Pasad <apasad@nvidia.com>

+import os
+
+import pytest
+import torch


+    assert (target_tokens == eos).sum().item() == 0, "skip_eos=True should not place any EOS"
+
+    # Now collate source tokens, passing in the target channel for EOS placement
+    source_tokens, source_token_lens = collate_token_channel(


+    assert (target_tokens == eos).sum().item() == 0, "skip_eos=True should not place any EOS"
+
+    # Now collate source tokens, passing in the target channel for EOS placement
+    source_tokens, source_token_lens = collate_token_channel(


+        skip_eos=True,
+    )
+
+    source_tokens, source_token_lens = collate_token_channel(


+        skip_eos=True,
+    )
+
+    source_tokens, source_token_lens = collate_token_channel(


+
+from nemo.collections.common.tokenizers import AutoTokenizer
+from nemo.collections.speechlm2.data.duplex_stt_dataset import DuplexSTTDataset
+from nemo.collections.speechlm2.data.utils import get_pad_id


+    train_batch = train_ds[cuts]
+    val_batch = val_ds[cuts]
+
+    train_targets = train_batch["audio_data"]["target_tokens"]


+
+    # Force aligner should be created but never called during validation
+    val_ds.force_aligner = MagicMock()
+    val_ds[cuts]


+    # Mock the force aligner to avoid loading wav2vec2
+    train_ds.force_aligner = MagicMock()
+    train_ds.force_aligner.batch_force_align_user_audio.side_effect = lambda cuts, **kwargs: cuts
+    train_ds[cuts]


+  - is_mcq_cut_train / is_mcq_cut_val / is_asr_cut
+"""
+
+import pytest


kevinhu-nv · 2026-03-30T19:36:25Z

        assert tokenizer.bos is not None, "BOS support in the tokenizer is required."
        assert tokenizer.eos is not None, "EOS support in the tokenizer is required."
+
+        user_bos_token = '^'


I use the same bos and eos for user and agent channels. I feel that is cleaner and I verified that does not impact model performance. I see you want to make exactly match EA, let's make these as configurable and one can set ^ and $? When we release the EA ckpt, we will release a config anyway to make it use ^ and $.

kevinhu-nv · 2026-03-30T19:38:39Z

+    Prompt selection priority:
+      1. Per-cut custom prompt (cut.custom['system_prompt'])
+      2. MCQ training cut -> THINK prompt for think-cuts, NOTHINK prompt for others
+      3. MCQ validation cut (when add_mcq_prompt=True) -> THINK prompt


Can you also add a support for custom prompt? We can then easily evaluate different demo setups we have used.

kevinhu-nv · 2026-03-30T19:49:38Z

A high-level question: Can you also share a training script/wandb to make sure metrics look roughly good? I think additional efforts may be needed to catch the EA ckpt but it is better to check at intermediate steps as well.

pzelasko · 2026-04-20T15:07:34Z

+        tokenizer: TokenizerSpec,
+        train_dataset: torch.utils.data.Dataset = None,
+        val_dataset: torch.utils.data.Dataset = None,
+        dataset: torch.utils.data.Dataset = None,


That's too many datasets. It's OK to remove dataset parameter and property, and update the code across collection to use train_dataset instead.

pzelasko · 2026-04-20T15:09:33Z

@@ -11,9 +11,26 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.


Move the changes in this file to a new file nemo/collections/speechlm2/data/mcq.py and document what is the purpose and the top-level entry-point, with expected usage. Let's make this re-usable across models/projects.

pzelasko · 2026-04-20T15:14:29Z

@@ -11,6 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+import json


I skipped the review of this file - I don't understand the logic being added here from a quick skim.

Something for your consideration: this file has 1k+ lines of complex data preparation logic that isn't well documented. Users trying to train or finetune VoiceChat may have a hard time understanding how to prepare data and which options to use. Try to build this documentation, with examples showing what's the expected input and output for each of these steps (and the entire pipeline).

ankitapasad added 2 commits March 27, 2026 12:39

Separate train, val datasets

0a857c4

Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Ankita Pasad <apasad@nvidia.com>

Correct EOS placement, MCQ prompt, ASR filler response, number normal…

a633150

…ization, clean-up token ID init, and corresponding tests Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Ankita Pasad <apasad@nvidia.com>

ankitapasad requested review from kevinhu-nv and zhehuaichen March 27, 2026 19:50

github-advanced-security AI found potential problems Mar 27, 2026

View reviewed changes

kevinhu-nv reviewed Mar 30, 2026

View reviewed changes

ankitapasad requested a review from pzelasko April 15, 2026 02:11

pzelasko reviewed Apr 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VoiceChat EA STT training reproducible features#15558

VoiceChat EA STT training reproducible features#15558
ankitapasad wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
ankitapasad:stt_vc_ea_parity

ankitapasad commented Mar 27, 2026 •

edited

Loading

Uh oh!

Check notice

Check notice

Check notice

Check notice

Check notice

Check notice

Check notice

Check notice

Check notice

Check notice

kevinhu-nv Mar 30, 2026 •

edited

Loading

Uh oh!

kevinhu-nv Mar 30, 2026

Uh oh!

kevinhu-nv commented Mar 30, 2026

Uh oh!

pzelasko Apr 20, 2026

Uh oh!

pzelasko Apr 20, 2026

Uh oh!

pzelasko Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -11,9 +11,26 @@
		# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Conversation

ankitapasad commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Usage

Uh oh!

Check notice

Check notice

Check notice

Check notice

Check notice

Check notice

Check notice

Check notice

Check notice

Check notice

kevinhu-nv Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinhu-nv Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

kevinhu-nv commented Mar 30, 2026

Uh oh!

pzelasko Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

pzelasko Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

pzelasko Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ankitapasad commented Mar 27, 2026 •

edited

Loading

kevinhu-nv Mar 30, 2026 •

edited

Loading