[TTS] Add code for training semantic codec#15524
Conversation
| hidden_layer: Index of hidden layer to extract embeddings from. | ||
| Defaults to 16, which for research suggests is effective for w2v-bert and TTS. | ||
| padding: Number of audio samples to pad before encoding to ensure output has a frame rate compatible with the audio codec. | ||
| scaling_factor: Constant factor to scale output embedding by. |
There was a problem hiding this comment.
It looks like in practice we divide by this factor, not multiply. Maybe either change the contract to provide 1/scaling and then multiply, or change the comment to say we divide by it
There was a problem hiding this comment.
Also, why is this scaling needed? Okay to keep it, just curious.
There was a problem hiding this comment.
All loss functions in the codec training are are implemented to produce approximately the same scale by default (around 0.1 - 0.3). The w2v embedding scale is random based on watever scale the layer norm in the model learned. It ends up producing an embedding where the max value is around 5, so I scale the embedding down to have a max value of about 1, which also reduces the scale of the SLM loss to about 0.2 to be comparable to the other losses.
The alternative would be to have the SLM loss scale in the AudioCodecModel class default to something like 0.1 or 0.2.
| return slm_emb | ||
|
|
||
|
|
||
| class SLMDecoder(NeuralModule): |
There was a problem hiding this comment.
I'm not sure if SLMDecoder is the best name for what this class does; on the face of it, it would look like it's decoding the outputs of the SLM encoder, but that's not what really does. Maybe SLM Predictor? Or any name that you think makes sense.
That said, I hope changing this won't cause too much trouble in invalidating existing checkpoints etc...
There was a problem hiding this comment.
I think the terms "predict" and "decode" are usually interchangeable. It is however a bit confusing in this instance because the target being generated by the decoder happens to be the latent space of a different encoder's output.
This can be safely changed, as I deleted the SLM encoder and decoders from the checkpoints I am using for inference anyways.
| self, | ||
| dataset_meta: Dict, | ||
| sample_rate: int, | ||
| resample_rate: Optional[int] = None, |
There was a problem hiding this comment.
could you update the docstring to add this argument?
There was a problem hiding this comment.
Added docstring. The functionality might be a bit confusing, because the feature that is actually being added is the option to resample using batched NeMo code instead of librosa.
| return state_dict | ||
|
|
||
| def load_state_dict(self, state_dict, strict=True): | ||
| # Override to load all the keys except .speaker_encoder. and WavLM model |
There was a problem hiding this comment.
Could you update the comment to say why we are skipping some keys?
| semantic_codec_cfg = cfg.get("semantic_codec") | ||
| semantic_codec = AudioCodecModel(cfg=semantic_codec_cfg) | ||
| elif cfg.get("semantic_codec_path"): | ||
| semantic_codec_path = cfg.get("semantic_codec_path") |
There was a problem hiding this comment.
are semantic_codec and semantic_codec_path mutually excludive? If so, maybe we can add an error check.
There was a problem hiding this comment.
Also, a question: how does training of the semantic codec itself happen?
There was a problem hiding this comment.
I tried to describe it succinctly in the comment. But what happens is that the first time you train this model there is no "semantic_codec" config, and it reads it from the "semantic_codec_path" you provide in your yaml file. It then loads the checkpoint, and stores it in a new config called "semantic_codec". On all future training runs, or during inference, both config values are present, but it prioritizes using the submodule instead of reading the checkpoint again. The "semantic_codec" is only auto-generated in this way, never defined by a user.
This was the only way I could find to get register_nemo_submodule to work in this way, and it feels like an awkward interface. If anyone knows a cleaner way to implement this, I would be happy to hear it.
There was a problem hiding this comment.
Also, a question: how does training of the semantic codec itself happen?
You run the recipe using a config file that has 1 codebook, no discriminator, and the SLM loss enabled.
There was a problem hiding this comment.
What is this register_nemo_submodule function?
There was a problem hiding this comment.
In Magpie, we have a codecmodel_path which is a local path to a .nemo checkpoint. Magpie checkpoints cannot be loaded until this path is manually overwritten by the user. register_nemo_submodule is how this can be avoided, so that the sub-codec is loaded automatically without needing the original .nemo path.
https://github.com/NVIDIA-NeMo/NeMo/blame/main/nemo/core/classes/modelPT.py#L315
https://github.com/NVIDIA-NeMo/NeMo/blob/main/nemo/core/classes/modelPT.py#L275-L279
| if self.discriminator is None: | ||
| schedulers.step() | ||
| else: | ||
| schedulers[0].step() |
There was a problem hiding this comment.
maybe it would be cleaner to iterate on all schedulers in the list
rfejgin
left a comment
There was a problem hiding this comment.
Looks good and clean overall. See some generally minor comments.
| encoded, encoded_len = self.audio_encoder(audio=audio_preprocessed, audio_len=audio_preprocessed_len) | ||
|
|
||
| if self.semantic_codec is not None: | ||
| semantic, _ = self.semantic_codec.encode_audio(audio=audio, audio_len=audio_len, sample_rate=sample_rate) |
There was a problem hiding this comment.
should we add with torch.no_grad()?
| if schedulers is None or self.lr_schedule_interval != interval: | ||
| return | ||
|
|
||
| if self.discriminator is None: |
There was a problem hiding this comment.
maybe instead just check if it's a list or single item, then this function doesn't need to know about discriminators
| semantic_codec_cfg = cfg.get("semantic_codec") | ||
| semantic_codec = AudioCodecModel(cfg=semantic_codec_cfg) | ||
| elif cfg.get("semantic_codec_path"): | ||
| semantic_codec_path = cfg.get("semantic_codec_path") |
There was a problem hiding this comment.
Also, a question: how does training of the semantic codec itself happen?
12fc883 to
ca1c5dd
Compare
| semantic_codec_cfg = cfg.get("semantic_codec") | ||
| semantic_codec = AudioCodecModel(cfg=semantic_codec_cfg) | ||
| elif cfg.get("semantic_codec_path"): | ||
| semantic_codec_path = cfg.get("semantic_codec_path") |
There was a problem hiding this comment.
What is this register_nemo_submodule function?
| # If 'semantic_codec_path' is provided, the semantic codec will be initialized from the provided path. | ||
| # It will then be registered as a submodule and automatically loaded from the 'semantic_codec' field | ||
| if cfg.get("semantic_codec"): |
There was a problem hiding this comment.
Can you add a test path that includes the use of a semantic codec? I'm concerned that this is no L0 nor L2 test coverage of this pattern
There was a problem hiding this comment.
We currently have no L0 or L2 tests for codec models in general, which makes it difficult to add one for just the semantic codec. Creating these tests looks fairly involved. Should I create a separate PR to add automated tests for codec model training and inference?
There was a problem hiding this comment.
I'm not as concerned about the other codecs as the inference paths are covered during the magpie tests. For this specific training setup where we load a semantic codec and train a complimentary acoustic codec, we should add a fastdevrun using an interim checkpoint that you have.
There was a problem hiding this comment.
Added L0 and L2 tests. Will wait and verify that they run successfully in the CI pipeline.
|
[🤖]: Hi @rlangman 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: rlangman <rlangman@users.noreply.github.com>
Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: Ryan <rlangman@nvidia.com>
|
[🤖]: Hi @rlangman 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
Signed-off-by: Ryan <rlangman@nvidia.com>
|
[🤖]: Hi @rlangman 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
What does this PR do ?
Add code needed to train a single codebook semantic token and embed it inside a multi-codebook codec.
Collection: [TTS]
Changelog
PhonemeASRmodule it references is not defined in NeMo.Pre checks:
PR Type: