[TTS] Add code for training semantic codec by rlangman · Pull Request #15524 · NVIDIA-NeMo/NeMo

rlangman · 2026-03-19T22:58:19Z

What does this PR do ?

Add code needed to train a single codebook semantic token and embed it inside a multi-codebook codec.

Collection: [TTS]

Changelog

Add semantic distillation using w2v-bert
Add inference logic to embed semantic token as first codebook in multi-codebook codec
Add option in data loader to resample audio
Remove dead commented out code related to ASR loss. The PhonemeASR module it references is not defined in NeMo.

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

rfejgin · 2026-03-20T17:19:27Z

nemo/collections/tts/modules/audio_codec_modules.py

+        hidden_layer: Index of hidden layer to extract embeddings from.
+            Defaults to 16, which for research suggests is effective for w2v-bert and TTS.
+        padding: Number of audio samples to pad before encoding to ensure output has a frame rate compatible with the audio codec.
+        scaling_factor: Constant factor to scale output embedding by.


It looks like in practice we divide by this factor, not multiply. Maybe either change the contract to provide 1/scaling and then multiply, or change the comment to say we divide by it

Also, why is this scaling needed? Okay to keep it, just curious.

All loss functions in the codec training are are implemented to produce approximately the same scale by default (around 0.1 - 0.3). The w2v embedding scale is random based on watever scale the layer norm in the model learned. It ends up producing an embedding where the max value is around 5, so I scale the embedding down to have a max value of about 1, which also reduces the scale of the SLM loss to about 0.2 to be comparable to the other losses.

The alternative would be to have the SLM loss scale in the AudioCodecModel class default to something like 0.1 or 0.2.

rfejgin · 2026-03-20T17:24:19Z

nemo/collections/tts/modules/audio_codec_modules.py

+        return slm_emb
+
+
+class SLMDecoder(NeuralModule):


I'm not sure if SLMDecoder is the best name for what this class does; on the face of it, it would look like it's decoding the outputs of the SLM encoder, but that's not what really does. Maybe SLM Predictor? Or any name that you think makes sense.

That said, I hope changing this won't cause too much trouble in invalidating existing checkpoints etc...

I think the terms "predict" and "decode" are usually interchangeable. It is however a bit confusing in this instance because the target being generated by the decoder happens to be the latent space of a different encoder's output.

This can be safely changed, as I deleted the SLM encoder and decoders from the checkpoints I am using for inference anyways.

rfejgin · 2026-03-20T17:27:49Z

nemo/collections/tts/data/vocoder_dataset.py

        self,
        dataset_meta: Dict,
        sample_rate: int,
+        resample_rate: Optional[int] = None,


could you update the docstring to add this argument?

Added docstring. The functionality might be a bit confusing, because the feature that is actually being added is the option to resample using batched NeMo code instead of librosa.

rfejgin · 2026-03-20T17:33:47Z

nemo/collections/tts/models/audio_codec.py

        return state_dict

    def load_state_dict(self, state_dict, strict=True):
        # Override to load all the keys except .speaker_encoder. and WavLM model


Could you update the comment to say why we are skipping some keys?

rfejgin · 2026-03-20T17:38:21Z

nemo/collections/tts/models/audio_codec.py

+            semantic_codec_cfg = cfg.get("semantic_codec")
+            semantic_codec = AudioCodecModel(cfg=semantic_codec_cfg)
+        elif cfg.get("semantic_codec_path"):
+            semantic_codec_path = cfg.get("semantic_codec_path")


are semantic_codec and semantic_codec_path mutually excludive? If so, maybe we can add an error check.

Also, a question: how does training of the semantic codec itself happen?

I tried to describe it succinctly in the comment. But what happens is that the first time you train this model there is no "semantic_codec" config, and it reads it from the "semantic_codec_path" you provide in your yaml file. It then loads the checkpoint, and stores it in a new config called "semantic_codec". On all future training runs, or during inference, both config values are present, but it prioritizes using the submodule instead of reading the checkpoint again. The "semantic_codec" is only auto-generated in this way, never defined by a user.

This was the only way I could find to get register_nemo_submodule to work in this way, and it feels like an awkward interface. If anyone knows a cleaner way to implement this, I would be happy to hear it.

Also, a question: how does training of the semantic codec itself happen?

You run the recipe using a config file that has 1 codebook, no discriminator, and the SLM loss enabled.

What is this register_nemo_submodule function?

In Magpie, we have a codecmodel_path which is a local path to a .nemo checkpoint. Magpie checkpoints cannot be loaded until this path is manually overwritten by the user. register_nemo_submodule is how this can be avoided, so that the sub-codec is loaded automatically without needing the original .nemo path.

https://github.com/NVIDIA-NeMo/NeMo/blame/main/nemo/core/classes/modelPT.py#L315
https://github.com/NVIDIA-NeMo/NeMo/blob/main/nemo/core/classes/modelPT.py#L275-L279

rfejgin · 2026-03-20T17:44:51Z

nemo/collections/tts/models/audio_codec.py

+        if self.discriminator is None:
+            schedulers.step()
+        else:
+            schedulers[0].step()


maybe it would be cleaner to iterate on all schedulers in the list

rfejgin

Looks good and clean overall. See some generally minor comments.

rfejgin · 2026-03-20T17:49:01Z

nemo/collections/tts/models/audio_codec.py

+        encoded, encoded_len = self.audio_encoder(audio=audio_preprocessed, audio_len=audio_preprocessed_len)
+
+        if self.semantic_codec is not None:
+            semantic, _ = self.semantic_codec.encode_audio(audio=audio, audio_len=audio_len, sample_rate=sample_rate)


should we add with torch.no_grad()?

rfejgin · 2026-03-20T17:51:11Z

nemo/collections/tts/models/audio_codec.py

+        if schedulers is None or self.lr_schedule_interval != interval:
+            return
+
+        if self.discriminator is None:


maybe instead just check if it's a list or single item, then this function doesn't need to know about discriminators

rfejgin · 2026-03-20T17:53:47Z

nemo/collections/tts/models/audio_codec.py

+            semantic_codec_cfg = cfg.get("semantic_codec")
+            semantic_codec = AudioCodecModel(cfg=semantic_codec_cfg)
+        elif cfg.get("semantic_codec_path"):
+            semantic_codec_path = cfg.get("semantic_codec_path")


Also, a question: how does training of the semantic codec itself happen?

nemo/collections/tts/data/vocoder_dataset.py

blisc · 2026-03-30T15:23:47Z

nemo/collections/tts/models/audio_codec.py

+            semantic_codec_cfg = cfg.get("semantic_codec")
+            semantic_codec = AudioCodecModel(cfg=semantic_codec_cfg)
+        elif cfg.get("semantic_codec_path"):
+            semantic_codec_path = cfg.get("semantic_codec_path")


What is this register_nemo_submodule function?

blisc · 2026-03-30T15:24:18Z

nemo/collections/tts/models/audio_codec.py

+        # If 'semantic_codec_path' is provided, the semantic codec will be initialized from the provided path.
+        # It will then be registered as a submodule and automatically loaded from the 'semantic_codec' field
+        if cfg.get("semantic_codec"):


Can you add a test path that includes the use of a semantic codec? I'm concerned that this is no L0 nor L2 test coverage of this pattern

We currently have no L0 or L2 tests for codec models in general, which makes it difficult to add one for just the semantic codec. Creating these tests looks fairly involved. Should I create a separate PR to add automated tests for codec model training and inference?

I'm not as concerned about the other codecs as the inference paths are covered during the magpie tests. For this specific training setup where we load a semantic codec and train a complimentary acoustic codec, we should add a fastdevrun using an interim checkpoint that you have.

Added L0 and L2 tests. Will wait and verify that they run successfully in the CI pipeline.

github-actions · 2026-03-30T17:15:25Z

[🤖]: Hi @rlangman 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

nemo/collections/tts/models/audio_codec.py

Signed-off-by: Ryan <rlangman@nvidia.com>

Signed-off-by: rlangman <rlangman@users.noreply.github.com>

Signed-off-by: Ryan <rlangman@nvidia.com>

github-actions · 2026-04-09T17:13:24Z

[🤖]: Hi @rlangman 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

Signed-off-by: Ryan <rlangman@nvidia.com>

github-actions · 2026-04-09T20:50:47Z

[🤖]: Hi @rlangman 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

rlangman requested review from Edresson, blisc and rfejgin March 19, 2026 22:58

rlangman self-assigned this Mar 19, 2026

rlangman added TTS CI labels Mar 19, 2026

github-actions bot removed the CI label Mar 19, 2026

rfejgin reviewed Mar 20, 2026

View reviewed changes

rlangman force-pushed the codec_semantic branch 2 times, most recently from 12fc883 to ca1c5dd Compare March 20, 2026 22:17

blisc added the Run CICD label Mar 30, 2026

blisc temporarily deployed to test March 30, 2026 14:54 — with GitHub Actions Inactive

blisc reviewed Mar 30, 2026

View reviewed changes

github-actions bot removed the Run CICD label Mar 30, 2026

github-advanced-security bot found potential problems Apr 8, 2026

View reviewed changes

nemo/collections/tts/models/audio_codec.py Dismissed Show dismissed Hide dismissed

rlangman added the CI label Apr 9, 2026

rlangman and others added 4 commits April 8, 2026 20:44

[TTS] Add code for training semantic codec

e845094

Signed-off-by: Ryan <rlangman@nvidia.com>

Apply isort and black reformatting

d9e8b46

Signed-off-by: rlangman <rlangman@users.noreply.github.com>

Rename slm decoder to predictor

08db7b2

Signed-off-by: Ryan <rlangman@nvidia.com>

Add audio codec model tests

4a8f4c8

Signed-off-by: Ryan <rlangman@nvidia.com>

rlangman force-pushed the codec_semantic branch from 4ee0cfe to 4a8f4c8 Compare April 9, 2026 03:44

github-actions bot removed the CI label Apr 9, 2026

rlangman added CI Run CICD and removed CI labels Apr 9, 2026

rlangman temporarily deployed to test April 9, 2026 14:10 — with GitHub Actions Inactive

github-actions bot removed the Run CICD label Apr 9, 2026

blisc previously approved these changes Apr 9, 2026

View reviewed changes

Add HF_HUB_OFFLINE to audio codec tests

b2f5f4d

Signed-off-by: Ryan <rlangman@nvidia.com>

rlangman dismissed blisc’s stale review via b2f5f4d April 9, 2026 18:23

rlangman added the Run CICD label Apr 9, 2026

rlangman temporarily deployed to test April 9, 2026 18:25 — with GitHub Actions Inactive

github-actions bot removed the Run CICD label Apr 9, 2026

blisc approved these changes Apr 9, 2026

View reviewed changes

rlangman merged commit 66bb353 into main Apr 9, 2026
130 checks passed

rlangman deleted the codec_semantic branch April 9, 2026 21:58

Conversation

rlangman commented Mar 19, 2026

What does this PR do ?

Changelog

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rlangman Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rfejgin Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rfejgin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rlangman Mar 30, 2026 •

edited

Loading

rfejgin Mar 20, 2026 •

edited

Loading