Skip to content

Commit 8876e8c

Browse files
subhankar-ghoshchtruong814artbataevko3n1gblisc
authored
[TTS][Docs][MagpieTTS] Add MagpieTTS finetuning docs (#15546)
* Add MagpieTTS finetuning docs Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com> * Finetuning docs review changes Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com> * ci: Update docs build job to exclude cu12 extra (#15553) * Test no extras docs build Signed-off-by: Charlie Truong <chtruong@nvidia.com> * ci: Update docs job to use 0.83.0 templates Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Uncomment push cases for build-docs github action Signed-off-by: Charlie Truong <chtruong@nvidia.com> --------- Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Rename index for attention prior weights (#15551) Signed-off-by: Subhankar Ghosh <subhankarg@nvidia.com> * Ignore PnC for WER calculation: streaming ASR inference (#15550) Signed-off-by: Vladimir Bataev <vbataev@nvidia.com> * ci: upgrade GitHub Actions for Node.js 24 compatibility (#15537) Upgrades actions to versions compatible with the Node.js 24 runtime: - actions/checkout: → v6 - actions/upload-artifact: → v6 - actions/download-artifact: → v7 - actions/github-script: → v8 - actions/setup-python: → v6 Mirrors: NVIDIA/Megatron-LM@1d5e68b Signed-off-by: oliver könig <okoenig@nvidia.com> * Add VoiceChat to README (#15547) * Update README.md Signed-off-by: Jason <jasoli@nvidia.com> * Revise Nemotron VoiceChat release details in README Updated the release information for Nemotron VoiceChat and added details about its features and early access. Signed-off-by: zhehuaichen <139396994+zhehuaichen@users.noreply.github.com> --------- Signed-off-by: Jason <jasoli@nvidia.com> Signed-off-by: zhehuaichen <139396994+zhehuaichen@users.noreply.github.com> Co-authored-by: zhehuaichen <139396994+zhehuaichen@users.noreply.github.com> * Add ASR-EOU models and training/eval scripts (#14740) * initial commit for end-of-utterance detection Signed-off-by: Weiqing Wang <weiqingw@nvidia.com> * change targets to long() type Signed-off-by: Weiqing Wang <weiqingw@nvidia.com> * change output_types() Signed-off-by: Weiqing Wang <weiqingw@nvidia.com> * add random padding and refactor for multiple utterances per sample Signed-off-by: stevehuang52 <heh@nvidia.com> * add handling multiple text groundtruth Signed-off-by: stevehuang52 <heh@nvidia.com> * update and add eval scripts Signed-off-by: stevehuang52 <heh@nvidia.com> * drop sou label and add eob label Signed-off-by: stevehuang52 <heh@nvidia.com> * update hybrid-rnnt-ctc and rnnt models to use eou dataset Signed-off-by: stevehuang52 <heh@nvidia.com> * set default return eou frame label to false Signed-off-by: stevehuang52 <heh@nvidia.com> * handle empty utterance Signed-off-by: stevehuang52 <heh@nvidia.com> * add script for injecting special eou tokens into SPE tokenizer Signed-off-by: stevehuang52 <heh@nvidia.com> * refactor eou eval utils Signed-off-by: stevehuang52 <heh@nvidia.com> * add eou rnnt training Signed-off-by: stevehuang52 <heh@nvidia.com> * update doc Signed-off-by: stevehuang52 <heh@nvidia.com> * update data augmentation Signed-off-by: stevehuang52 <heh@nvidia.com> * update data related functions Signed-off-by: stevehuang52 <heh@nvidia.com> * fix tokenizer with eou tokens Signed-off-by: stevehuang52 <heh@nvidia.com> * adding eou force aligner Signed-off-by: Weiqing Wang <weiqingw@nvidia.com> * update for eou Signed-off-by: stevehuang52 <heh@nvidia.com> * fix the case when 'segments_level_ctm_filepath' is not produced Signed-off-by: Weiqing Wang <weiqingw@nvidia.com> * fix force aligner Signed-off-by: stevehuang52 <heh@nvidia.com> * fix aligner Signed-off-by: stevehuang52 <heh@nvidia.com> * update for asr-eou Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up and update infer Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * fix rnnt_decoding for empty string Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update padding augment Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * fix eob metric logging Signed-off-by: stevehuang52 <heh@nvidia.com> * refactor and add hybrid model Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update EOU models Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * refactor percentile calculation Signed-off-by: stevehuang52 <heh@nvidia.com> * update augmentation Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update model and cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update frame eou Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * add adapter to eou Signed-off-by: stevehuang52 <heh@nvidia.com> * remove pdb Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * add cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * fix eou metric Signed-off-by: stevehuang52 <heh@nvidia.com> * update adapter Signed-off-by: stevehuang52 <heh@nvidia.com> * add scripts Signed-off-by: stevehuang52 <heh@nvidia.com> * update docstring Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update generate eval data Signed-off-by: stevehuang52 <heh@nvidia.com> * update eou val Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * add drop_pnc=true as default for dataloading Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * fix miss rate Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * add ignore_eob_label Signed-off-by: stevehuang52 <heh@nvidia.com> * fix and update Signed-off-by: stevehuang52 <heh@nvidia.com> * improve lhotse augmentation Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * add debug info Signed-off-by: stevehuang52 <heh@nvidia.com> * improve data augmentation Signed-off-by: stevehuang52 <heh@nvidia.com> * update utils Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update dataloader Signed-off-by: stevehuang52 <heh@nvidia.com> * update oomptimizer Signed-off-by: stevehuang52 <heh@nvidia.com> * update oomptimizer Signed-off-by: stevehuang52 <heh@nvidia.com> * update eou model Signed-off-by: stevehuang52 <heh@nvidia.com> * update eou model Signed-off-by: stevehuang52 <heh@nvidia.com> * update eou model Signed-off-by: stevehuang52 <heh@nvidia.com> * update augmentation Signed-off-by: stevehuang52 <heh@nvidia.com> * update aug Signed-off-by: stevehuang52 <heh@nvidia.com> * update augment Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update drop pnc func Signed-off-by: stevehuang52 <heh@nvidia.com> * update eou finetune Signed-off-by: stevehuang52 <heh@nvidia.com> * update transcribe Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * fix cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up for PR Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> * Potential fix for code scanning alert no. 16191: Explicit returns mixed with implicit (fall through) returns Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * Potential fix for code scanning alert no. 16190: Explicit returns mixed with implicit (fall through) returns Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: stevehuang52 <stevehuang52@users.noreply.github.com> * Potential fix for code scanning alert no. 16185: File is not always closed Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> * fix pylint&flake8 Signed-off-by: stevehuang52 <heh@nvidia.com> * fix pylint Signed-off-by: stevehuang52 <heh@nvidia.com> * refactor Signed-off-by: stevehuang52 <heh@nvidia.com> * update pr Signed-off-by: stevehuang52 <heh@nvidia.com> * update adapter Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> * update readme, test, etc Signed-off-by: He Huang <heh@nvidia.com> * Apply isort and black reformatting Signed-off-by: stevehuang52 <stevehuang52@users.noreply.github.com> * update doc Signed-off-by: He Huang <heh@nvidia.com> * clean up Signed-off-by: He Huang <heh@nvidia.com> * fix and rename Signed-off-by: He Huang <heh@nvidia.com> * update doc Signed-off-by: He Huang <heh@nvidia.com> * clean up Signed-off-by: He Huang <heh@nvidia.com> * move all length aug to invalid Signed-off-by: He Huang <heh@nvidia.com> * fix typo Signed-off-by: He Huang <heh@nvidia.com> * rename and move to scripts/asr_eou Signed-off-by: He Huang <heh@nvidia.com> * fix ci Signed-off-by: He Huang <heh@nvidia.com> * fix ci Signed-off-by: He Huang <heh@nvidia.com> * clean up Signed-off-by: He Huang <heh@nvidia.com> * clean up Signed-off-by: He Huang <heh@nvidia.com> * fix linting Signed-off-by: He Huang <heh@nvidia.com> * fix ci Signed-off-by: He Huang <heh@nvidia.com> * Apply isort and black reformatting Signed-off-by: stevehuang52 <stevehuang52@users.noreply.github.com> * Potential fix for code scanning alert no. 17270: Explicit export is not defined Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * Potential fix for code scanning alert no. 17271: Explicit export is not defined Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * Potential fix for code scanning alert no. 17272: Explicit export is not defined Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> --------- Signed-off-by: Weiqing Wang <weiqingw@nvidia.com> Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Signed-off-by: stevehuang52 <stevehuang52@users.noreply.github.com> Signed-off-by: He Huang <heh@nvidia.com> Co-authored-by: Weiqing Wang <weiqingw@nvidia.com> Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Co-authored-by: stevehuang52 <stevehuang52@users.noreply.github.com> * Fix freesound url Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com> * Fix freesound url Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com> --------- Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Subhankar Ghosh <subhankarg@nvidia.com> Signed-off-by: Vladimir Bataev <vbataev@nvidia.com> Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Jason <jasoli@nvidia.com> Signed-off-by: zhehuaichen <139396994+zhehuaichen@users.noreply.github.com> Signed-off-by: Weiqing Wang <weiqingw@nvidia.com> Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Signed-off-by: stevehuang52 <stevehuang52@users.noreply.github.com> Signed-off-by: He Huang <heh@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Vladimir Bataev <vbataev@nvidia.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Jason <jasoli@nvidia.com> Co-authored-by: zhehuaichen <139396994+zhehuaichen@users.noreply.github.com> Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Co-authored-by: Weiqing Wang <weiqingw@nvidia.com> Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Co-authored-by: stevehuang52 <stevehuang52@users.noreply.github.com>
1 parent cae54b3 commit 8876e8c

4 files changed

Lines changed: 223 additions & 2 deletions

File tree

docs/source/asr/speech_classification/datasets.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ If you have your own data and want to preprocess it to use with NeMo ASR models,
1111
Freesound
1212
-----------
1313

14-
`Freesound <http://www.freesound.org/>`_ is a website that aims to create a huge open collaborative database of audio snippets, samples, recordings, bleeps.
14+
`Freesound <https://freesound.org/>`_ is a website that aims to create a huge open collaborative database of audio snippets, samples, recordings, bleeps.
1515
Most audio samples are released under Creative Commons licenses that allow their reuse.
1616
Researchers and developers can access Freesound content using the Freesound API to retrieve meaningful sound information such as metadata, analysis files, and the sounds themselves.
1717

docs/source/tts/intro.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ We will illustrate details in the following sections.
1717
configs
1818
g2p
1919
magpietts
20+
magpietts-finetuning
2021
magpietts-po
2122
magpietts-longform
2223

Lines changed: 220 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
.. _magpie-tts-finetuning:
2+
3+
======================
4+
Magpie-TTS Finetuning
5+
======================
6+
7+
Finetuning a pretrained Magpie-TTS checkpoint lets you adapt the model to new voices or new languages without training from scratch. The pretrained model has already learned general speech patterns, prosody, and acoustic modeling, so finetuning requires far less data and compute than pretraining. This guide covers two common finetuning scenarios:
8+
9+
- **Adding new speakers in an existing language** — adapt the model to speak in voices not seen during pretraining, using a small dataset of target-speaker audio.
10+
- **Adding a new language** — extend the model to synthesize speech in a language absent from the pretraining data, using a multilingual dataset configuration.
11+
12+
For preference optimization (DPO/GRPO) on top of a finetuned checkpoint, see :doc:`Magpie-TTS Preference Optimization <magpietts-po>`.
13+
14+
15+
Prerequisites
16+
#############
17+
18+
Before finetuning, you will need:
19+
20+
- A pretrained Magpie-TTS checkpoint (``pretrained.ckpt`` or ``pretrained.nemo``). Public checkpoints (``https://huggingface.co/nvidia/magpie_tts_multilingual_357m``) are available on Hugging Face.
21+
- The audio codec model (``https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps``), available on Hugging Face alongside the TTS checkpoint.
22+
- A prepared dataset. For faster finetuning audio codec tokens must be pre-extracted from your audio files. See the *Dataset Preparation* section below.
23+
- NeMo installed from source or via the NeMo container. See the `NeMo GitHub page <https://github.com/NVIDIA/NeMo>`_ for installation instructions.
24+
25+
26+
Dataset Preparation
27+
-------------------
28+
29+
Training uses ``MagpieTTSDataset`` with ``dataset_meta`` entries (see ``DatasetMeta`` in ``nemo/collections/tts/data/text_to_speech_dataset.py``). Each line in ``manifest_path`` file is one training example.
30+
31+
**Optional cached codec codes.** If each line includes ``target_audio_codes_path`` and ``context_audio_codes_path`` (paths to saved tensors) and ``model.load_cached_codes_if_available=true``, the dataloader can skip on-the-fly codec encoding. If those keys are absent, the codec runs during training and loads waveform from ``audio_filepath`` and ``context_audio_filepath`` (slower but no separate extraction step).
32+
33+
**Minimum fields** (paths relative to ``audio_dir`` / ``feature_dir`` unless you use absolute paths):
34+
35+
.. code-block:: json
36+
37+
{
38+
"audio_filepath": "relative/path/to/audio.wav",
39+
"text": "transcript of the utterance",
40+
"duration": 5.2,
41+
"context_audio_filepath": "relative/path/to/context.wav",
42+
"context_text": "transcript of the context audio",
43+
"target_audio_codes_path": "/optional/path/to/target_codes.pt",
44+
"context_audio_codes_path": "/optional/path/to/context_codes.pt"
45+
}
46+
47+
The ``context_audio_filepath`` is the reference audio used for voice cloning during training. It should come from the same speaker as ``audio_filepath``. A minimum context duration of about 3 seconds and a high speaker similarity (for example ≥ 0.6 with TitaNet) are recommended for best results.
48+
49+
**Registering datasets in config.** For each named split in ``train_ds_meta`` and ``val_ds_meta``, set ``manifest_path``, ``audio_dir``, ``feature_dir``, ``sample_weight`` (training), and ``tokenizer_names``: a list of keys that exist under ``model.text_tokenizers`` in the config. The dataloader picks the tokenizer for each sample from that list (see ``DatasetMeta``).
50+
51+
52+
.. _magpie-tts-new-speaker:
53+
54+
Adding New Speakers in an Existing Language
55+
###########################################
56+
57+
This scenario adapts a pretrained checkpoint to new speakers in a language the model already supports (for example adding new English speakers to a checkpoint trained on English data). You are teaching new voice characteristics while keeping the same text tokenizer. Mixing in some public Magpie-TTS data can reduce regression; see the `Magpie-TTS dataset <https://huggingface.co/nvidia/magpie_tts_multilingual_357m#training-dataset>`_ on Hugging Face.
58+
59+
Key training choices:
60+
61+
- **Low learning rate** (``5e-6``): the pretrained model is already well-converged; a high LR can destroy learned representations.
62+
- **Disable alignment prior** (``alignment_loss_scale=0.0``, ``prior_scaling_factor=null``): the prior helps pretraining but can over-constrain finetuning.
63+
- **Tokenizer**: use ``tokenizer_names: [english_phoneme]`` (or the tokenizer that matches your transcripts) on each ``train_ds_meta`` / ``val_ds_meta`` entry.
64+
65+
``magpietts.yaml`` trains with ``max_epochs`` and top-level ``batch_size``. Validation mixes all ``val_ds_meta`` entries in a single dataloader (joint validation metrics).
66+
67+
.. code-block:: bash
68+
69+
python examples/tts/magpietts.py \
70+
--config-path=examples/tts/conf/magpietts \
71+
--config-name=magpietts \
72+
+init_from_ptl_ckpt=/path/to/pretrained.ckpt \
73+
exp_manager.exp_dir=/path/to/output \
74+
+train_ds_meta.en_sft.manifest_path=/path/to/train.json \
75+
+train_ds_meta.en_sft.audio_dir=/path/to/audio \
76+
+train_ds_meta.en_sft.feature_dir=/path/to/features \
77+
+train_ds_meta.en_sft.sample_weight=1.0 \
78+
"+train_ds_meta.en_sft.tokenizer_names=[english_phoneme]" \
79+
+val_ds_meta.en_val.manifest_path=/path/to/val.json \
80+
+val_ds_meta.en_val.audio_dir=/path/to/audio \
81+
+val_ds_meta.en_val.feature_dir=/path/to/audio \
82+
+val_ds_meta.en_val.sample_weight=1.0 \
83+
"+val_ds_meta.en_val.tokenizer_names=[english_phoneme]" \
84+
model.codecmodel_path=nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps \
85+
model.context_duration_min=5.0 \
86+
model.context_duration_max=5.0 \
87+
model.alignment_loss_scale=0.0 \
88+
model.prior_scaling_factor=null \
89+
model.optim.lr=5e-6 \
90+
~model.optim.sched \
91+
model.load_cached_codes_if_available=true \
92+
trainer.precision=32 \
93+
trainer.devices=8 \
94+
trainer.num_nodes=1 \
95+
batch_size=16 \
96+
max_epochs=500
97+
98+
The ``+init_from_ptl_ckpt`` flag loads the pretrained checkpoint weights before training begins. The ``+`` prefix is required because this key is not present in the base config.
99+
100+
``~model.optim.sched`` removes the learning rate schedule so the LR stays constant during finetuning.
101+
102+
``trainer.precision=32`` is recommended for finetuning stability. Mixed precision (``bf16`` or ``16``) can cause loss instability on small datasets.
103+
104+
105+
.. _magpie-tts-new-language:
106+
107+
Adding a New Language
108+
#####################
109+
110+
This scenario extends the model to one or more languages not present in the pretraining data. Use the same ``magpietts`` config and combine multiple manifests with per-language ``sample_weight``.
111+
112+
**Tokenizers**
113+
114+
- Define each new tokenizer under ``model.text_tokenizers`` (for example an ``AutoTokenizer`` with ``google/byt5-small`` for scripts outside the IPA vocabulary).
115+
- **How it is applied:** each ``train_ds_meta`` / ``val_ds_meta`` entry lists ``tokenizer_names`` (keys under ``model.text_tokenizers``). The dataloader uses those names to select which tokenizer encodes each sample’s transcript (see ``DatasetMeta`` in ``nemo/collections/tts/data/text_to_speech_dataset.py``).
116+
117+
**Per-language entries**
118+
119+
Each language is a separate key under ``train_ds_meta`` / ``val_ds_meta`` with ``manifest_path``, ``audio_dir``, ``feature_dir``, ``sample_weight``, and ``tokenizer_names``.
120+
121+
**Sample weights**
122+
123+
Upsample low-resource languages with a higher ``sample_weight`` so they are not drowned out by high-resource languages.
124+
125+
Align transcript format with the tokenizer you choose (IPA phonemes for ``english_phoneme`` / IPA-style tokenizers, raw text for byte-level models, and so on). Audio codes can be cached as in *Dataset Preparation*.
126+
127+
.. code-block:: bash
128+
129+
python examples/tts/magpietts.py \
130+
--config-name=magpietts \
131+
+init_from_ptl_ckpt=/path/to/pretrained.ckpt \
132+
exp_manager.exp_dir=/path/to/output \
133+
+model.text_tokenizers.your_language_chartokenizer._target_=AutoTokenizer \
134+
+model.text_tokenizers.your_language_chartokenizer.pretrained_model="google/byt5-small" \
135+
+train_ds_meta.your_language.manifest_path=/path/to/your_lang_train.json \
136+
+train_ds_meta.your_language.audio_dir=/path/to/your_lang_audio \
137+
+train_ds_meta.your_language.feature_dir=/path/to/your_lang_audio \
138+
+train_ds_meta.your_language.sample_weight=1.0 \
139+
"+train_ds_meta.your_language.tokenizer_names=[your_language_chartokenizer]" \
140+
+val_ds_meta.your_language_dev.manifest_path=/path/to/your_lang_val.json \
141+
+val_ds_meta.your_language_dev.audio_dir=/path/to/your_lang_audio \
142+
+val_ds_meta.your_language_dev.feature_dir=/path/to/your_lang_audio \
143+
+val_ds_meta.your_language_dev.sample_weight=1.0 \
144+
"+val_ds_meta.your_language_dev.tokenizer_names=[your_language_chartokenizer]" \
145+
model.codecmodel_path=nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps \
146+
model.context_duration_min=5.0 \
147+
model.context_duration_max=5.0 \
148+
model.alignment_loss_scale=0.0 \
149+
model.prior_scaling_factor=null \
150+
model.optim.lr=1e-5 \
151+
~model.optim.sched \
152+
model.load_cached_codes_if_available=true \
153+
trainer.precision=32 \
154+
trainer.devices=8 \
155+
trainer.num_nodes=1 \
156+
max_epochs=500
157+
158+
159+
Mixing Multiple Languages
160+
--------------------------
161+
162+
Add one ``train_ds_meta`` entry per language. Increase ``sample_weight`` for low-resource languages. You can mix public Magpie-TTS data with your own; see the `Magpie-TTS dataset <https://huggingface.co/nvidia/magpie_tts_multilingual_357m#training-dataset>`_ on Hugging Face.
163+
164+
.. code-block:: bash
165+
166+
# High-resource languages — standard weight
167+
+train_ds_meta.spanish.manifest_path=/path/to/spanish_train.json \
168+
+train_ds_meta.spanish.audio_dir=/path/to/spanish_audio \
169+
+train_ds_meta.spanish.feature_dir=/path/to/spanish_audio \
170+
+train_ds_meta.spanish.sample_weight=1.0 \
171+
"+train_ds_meta.spanish.tokenizer_names=[spanish_phoneme_or_chartokenizer]" \
172+
+train_ds_meta.french.manifest_path=/path/to/french_train.json \
173+
+train_ds_meta.french.audio_dir=/path/to/french_audio \
174+
+train_ds_meta.french.feature_dir=/path/to/french_audio \
175+
+train_ds_meta.french.sample_weight=1.0 \
176+
"+train_ds_meta.french.tokenizer_names=[french_chartokenizer]" \
177+
# Low-resource language — upsampled 10x
178+
+train_ds_meta.low_resource_lang.manifest_path=/path/to/low_resource_train.json \
179+
+train_ds_meta.low_resource_lang.audio_dir=/path/to/low_resource_audio \
180+
+train_ds_meta.low_resource_lang.feature_dir=/path/to/low_resource_audio \
181+
+train_ds_meta.low_resource_lang.sample_weight=5.0 \
182+
"+train_ds_meta.low_resource_lang.tokenizer_names=[low_resource_chartokenizer]"
183+
184+
With ``model.load_cached_codes_if_available=true``, precomputed ``target_audio_codes_path`` / ``context_audio_codes_path`` in the manifest avoid recomputing codec codes at train time.
185+
186+
187+
Preference Optimization After Finetuning
188+
#########################################
189+
190+
After supervised finetuning, you can further improve quality with GRPO. For commands and hyperparameters, see :doc:`Magpie-TTS Preference Optimization <magpietts-po>` (the GRPO example uses ``--config-name=magpietts`` with ``+mode=onlinepo_train``).
191+
192+
193+
Key Hyperparameter Reference
194+
#############################
195+
196+
.. list-table::
197+
:widths: 35 25 40
198+
:header-rows: 1
199+
200+
* - Parameter
201+
- Typical Value
202+
- Notes
203+
* - ``model.optim.lr``
204+
- ``5e-6`` (same-language speakers), ``1e-5`` (multilingual)
205+
- Much lower than pretraining LR to preserve learned features
206+
* - ``max_epochs``
207+
- tens to hundreds
208+
- Shorter runs for small datasets; monitor validation loss
209+
* - ``model.alignment_loss_scale``
210+
- ``0.0``
211+
- Disable alignment prior during finetuning
212+
* - ``model.prior_scaling_factor``
213+
- ``null``
214+
- Disable alignment prior during finetuning
215+
* - ``trainer.precision``
216+
- ``32``
217+
- Recommended for finetuning stability
218+
* - ``model.cfg_unconditional_prob``
219+
- ``0.1``
220+
- Classifier-free guidance dropout rate during training

docs/source/tts/models.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ End-to-End LLM-based TTS
3333

3434
MagpieTTS
3535
~~~~~~~~~
36-
MagpieTTS is an encoder-decoder transformer TTS model that operates on discrete audio tokens from a neural audio codec. It uses monotonic alignment (CTC loss and attention priors) to reduce hallucinations and supports voice cloning via audio or text context conditioning. For architecture, training, inference, and preference optimization (DPO/GRPO), see :doc:`Magpie-TTS documentation <magpietts>`.
36+
MagpieTTS is an encoder-decoder transformer TTS model that operates on discrete audio tokens from a neural audio codec. It uses monotonic alignment (CTC loss and attention priors) to reduce hallucinations and supports voice cloning via audio or text context conditioning. For architecture, training, inference, and preference optimization (DPO/GRPO), see :doc:`Magpie-TTS documentation <magpietts>`. To adapt a pretrained checkpoint to new speakers or new languages, see :doc:`Magpie-TTS Finetuning <magpietts-finetuning>`.
3737

3838

3939
Vocoders

0 commit comments

Comments
 (0)