[TTS][Docs][MagpieTTS] Add MagpieTTS finetuning docs (#15546)

subhankar-ghosh · chtruong814 · artbataev · web-flow · commit 8876e8c23f74 · 2026-04-07T16:00:55.000-04:00
* Add MagpieTTS finetuning docs Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com> * Finetuning docs review changes Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com> * ci: Update docs build job to exclude cu12 extra (#15553) * Test no extras docs build Signed-off-by: Charlie Truong <chtruong@nvidia.com> * ci: Update docs job to use 0.83.0 templates Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Uncomment push cases for build-docs github action Signed-off-by: Charlie Truong <chtruong@nvidia.com> --------- Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Rename index for attention prior weights (#15551) Signed-off-by: Subhankar Ghosh <subhankarg@nvidia.com> * Ignore PnC for WER calculation: streaming ASR inference (#15550) Signed-off-by: Vladimir Bataev <vbataev@nvidia.com> * ci: upgrade GitHub Actions for Node.js 24 compatibility (#15537) Upgrades actions to versions compatible with the Node.js 24 runtime: - actions/checkout: → v6 - actions/upload-artifact: → v6 - actions/download-artifact: → v7 - actions/github-script: → v8 - actions/setup-python: → v6 Mirrors: NVIDIA/Megatron-LM@1d5e68b Signed-off-by: oliver könig <okoenig@nvidia.com> * Add VoiceChat to README (#15547) * Update README.md Signed-off-by: Jason <jasoli@nvidia.com> * Revise Nemotron VoiceChat release details in README Updated the release information for Nemotron VoiceChat and added details about its features and early access. Signed-off-by: zhehuaichen <139396994+zhehuaichen@users.noreply.github.com> --------- Signed-off-by: Jason <jasoli@nvidia.com> Signed-off-by: zhehuaichen <139396994+zhehuaichen@users.noreply.github.com> Co-authored-by: zhehuaichen <139396994+zhehuaichen@users.noreply.github.com> * Add ASR-EOU models and training/eval scripts (#14740) * initial commit for end-of-utterance detection Signed-off-by: Weiqing Wang <weiqingw@nvidia.com> * change targets to long() type Signed-off-by: Weiqing Wang <weiqingw@nvidia.com> * change output_types() Signed-off-by: Weiqing Wang <weiqingw@nvidia.com> * add random padding and refactor for multiple utterances per sample Signed-off-by: stevehuang52 <heh@nvidia.com> * add handling multiple text groundtruth Signed-off-by: stevehuang52 <heh@nvidia.com> * update and add eval scripts Signed-off-by: stevehuang52 <heh@nvidia.com> * drop sou label and add eob label Signed-off-by: stevehuang52 <heh@nvidia.com> * update hybrid-rnnt-ctc and rnnt models to use eou dataset Signed-off-by: stevehuang52 <heh@nvidia.com> * set default return eou frame label to false Signed-off-by: stevehuang52 <heh@nvidia.com> * handle empty utterance Signed-off-by: stevehuang52 <heh@nvidia.com> * add script for injecting special eou tokens into SPE tokenizer Signed-off-by: stevehuang52 <heh@nvidia.com> * refactor eou eval utils Signed-off-by: stevehuang52 <heh@nvidia.com> * add eou rnnt training Signed-off-by: stevehuang52 <heh@nvidia.com> * update doc Signed-off-by: stevehuang52 <heh@nvidia.com> * update data augmentation Signed-off-by: stevehuang52 <heh@nvidia.com> * update data related functions Signed-off-by: stevehuang52 <heh@nvidia.com> * fix tokenizer with eou tokens Signed-off-by: stevehuang52 <heh@nvidia.com> * adding eou force aligner Signed-off-by: Weiqing Wang <weiqingw@nvidia.com> * update for eou Signed-off-by: stevehuang52 <heh@nvidia.com> * fix the case when 'segments_level_ctm_filepath' is not produced Signed-off-by: Weiqing Wang <weiqingw@nvidia.com> * fix force aligner Signed-off-by: stevehuang52 <heh@nvidia.com> * fix aligner Signed-off-by: stevehuang52 <heh@nvidia.com> * update for asr-eou Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up and update infer Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * fix rnnt_decoding for empty string Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update padding augment Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * fix eob metric logging Signed-off-by: stevehuang52 <heh@nvidia.com> * refactor and add hybrid model Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update EOU models Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * refactor percentile calculation Signed-off-by: stevehuang52 <heh@nvidia.com> * update augmentation Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update model and cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update frame eou Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * add adapter to eou Signed-off-by: stevehuang52 <heh@nvidia.com> * remove pdb Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * add cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * fix eou metric Signed-off-by: stevehuang52 <heh@nvidia.com> * update adapter Signed-off-by: stevehuang52 <heh@nvidia.com> * add scripts Signed-off-by: stevehuang52 <heh@nvidia.com> * update docstring Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update generate eval data Signed-off-by: stevehuang52 <heh@nvidia.com> * update eou val Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * add drop_pnc=true as default for dataloading Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * fix miss rate Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * add ignore_eob_label Signed-off-by: stevehuang52 <heh@nvidia.com> * fix and update Signed-off-by: stevehuang52 <heh@nvidia.com> * improve lhotse augmentation Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * add debug info Signed-off-by: stevehuang52 <heh@nvidia.com> * improve data augmentation Signed-off-by: stevehuang52 <heh@nvidia.com> * update utils Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update dataloader Signed-off-by: stevehuang52 <heh@nvidia.com> * update oomptimizer Signed-off-by: stevehuang52 <heh@nvidia.com> * update oomptimizer Signed-off-by: stevehuang52 <heh@nvidia.com> * update eou model Signed-off-by: stevehuang52 <heh@nvidia.com> * update eou model Signed-off-by: stevehuang52 <heh@nvidia.com> * update eou model Signed-off-by: stevehuang52 <heh@nvidia.com> * update augmentation Signed-off-by: stevehuang52 <heh@nvidia.com> * update aug Signed-off-by: stevehuang52 <heh@nvidia.com> * update augment Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update drop pnc func Signed-off-by: stevehuang52 <heh@nvidia.com> * update eou finetune Signed-off-by: stevehuang52 <heh@nvidia.com> * update transcribe Signed-off-by: stevehuang52 <heh@nvidia.com> * update cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * fix cfg Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up for PR Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> * Potential fix for code scanning alert no. 16191: Explicit returns mixed with implicit (fall through) returns Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * Potential fix for code scanning alert no. 16190: Explicit returns mixed with implicit (fall through) returns Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: stevehuang52 <stevehuang52@users.noreply.github.com> * Potential fix for code scanning alert no. 16185: File is not always closed Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> * fix pylint&flake8 Signed-off-by: stevehuang52 <heh@nvidia.com> * fix pylint Signed-off-by: stevehuang52 <heh@nvidia.com> * refactor Signed-off-by: stevehuang52 <heh@nvidia.com> * update pr Signed-off-by: stevehuang52 <heh@nvidia.com> * update adapter Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> * update readme, test, etc Signed-off-by: He Huang <heh@nvidia.com> * Apply isort and black reformatting Signed-off-by: stevehuang52 <stevehuang52@users.noreply.github.com> * update doc Signed-off-by: He Huang <heh@nvidia.com> * clean up Signed-off-by: He Huang <heh@nvidia.com> * fix and rename Signed-off-by: He Huang <heh@nvidia.com> * update doc Signed-off-by: He Huang <heh@nvidia.com> * clean up Signed-off-by: He Huang <heh@nvidia.com> * move all length aug to invalid Signed-off-by: He Huang <heh@nvidia.com> * fix typo Signed-off-by: He Huang <heh@nvidia.com> * rename and move to scripts/asr_eou Signed-off-by: He Huang <heh@nvidia.com> * fix ci Signed-off-by: He Huang <heh@nvidia.com> * fix ci Signed-off-by: He Huang <heh@nvidia.com> * clean up Signed-off-by: He Huang <heh@nvidia.com> * clean up Signed-off-by: He Huang <heh@nvidia.com> * fix linting Signed-off-by: He Huang <heh@nvidia.com> * fix ci Signed-off-by: He Huang <heh@nvidia.com> * Apply isort and black reformatting Signed-off-by: stevehuang52 <stevehuang52@users.noreply.github.com> * Potential fix for code scanning alert no. 17270: Explicit export is not defined Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * Potential fix for code scanning alert no. 17271: Explicit export is not defined Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * Potential fix for code scanning alert no. 17272: Explicit export is not defined Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> --------- Signed-off-by: Weiqing Wang <weiqingw@nvidia.com> Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Signed-off-by: stevehuang52 <stevehuang52@users.noreply.github.com> Signed-off-by: He Huang <heh@nvidia.com> Co-authored-by: Weiqing Wang <weiqingw@nvidia.com> Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Co-authored-by: stevehuang52 <stevehuang52@users.noreply.github.com> * Fix freesound url Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com> * Fix freesound url Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com> --------- Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Subhankar Ghosh <subhankarg@nvidia.com> Signed-off-by: Vladimir Bataev <vbataev@nvidia.com> Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Jason <jasoli@nvidia.com> Signed-off-by: zhehuaichen <139396994+zhehuaichen@users.noreply.github.com> Signed-off-by: Weiqing Wang <weiqingw@nvidia.com> Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Signed-off-by: stevehuang52 <stevehuang52@users.noreply.github.com> Signed-off-by: He Huang <heh@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Vladimir Bataev <vbataev@nvidia.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Jason <jasoli@nvidia.com> Co-authored-by: zhehuaichen <139396994+zhehuaichen@users.noreply.github.com> Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Co-authored-by: Weiqing Wang <weiqingw@nvidia.com> Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Co-authored-by: stevehuang52 <stevehuang52@users.noreply.github.com>
diff --git a/docs/source/asr/speech_classification/datasets.rst b/docs/source/asr/speech_classification/datasets.rst
@@ -11,7 +11,7 @@ If you have your own data and want to preprocess it to use with NeMo ASR models,
 Freesound
 -----------
 
-`Freesound <http://www.freesound.org/>`_ is a website that aims to create a huge open collaborative database of audio snippets, samples, recordings, bleeps. 
+`Freesound <https://freesound.org/>`_ is a website that aims to create a huge open collaborative database of audio snippets, samples, recordings, bleeps. 
 Most audio samples are released under Creative Commons licenses that allow their reuse. 
 Researchers and developers can access Freesound content using the Freesound API to retrieve meaningful sound information such as metadata, analysis files, and the sounds themselves. 
 
diff --git a/docs/source/tts/intro.rst b/docs/source/tts/intro.rst
@@ -17,6 +17,7 @@ We will illustrate details in the following sections.
     configs
     g2p
     magpietts
+    magpietts-finetuning
     magpietts-po
     magpietts-longform
 
diff --git a/docs/source/tts/magpietts-finetuning.rst b/docs/source/tts/magpietts-finetuning.rst
@@ -0,0 +1,220 @@
+.. _magpie-tts-finetuning:
+
+======================
+Magpie-TTS Finetuning
+======================
+
+Finetuning a pretrained Magpie-TTS checkpoint lets you adapt the model to new voices or new languages without training from scratch. The pretrained model has already learned general speech patterns, prosody, and acoustic modeling, so finetuning requires far less data and compute than pretraining. This guide covers two common finetuning scenarios:
+
+- **Adding new speakers in an existing language** — adapt the model to speak in voices not seen during pretraining, using a small dataset of target-speaker audio.
+- **Adding a new language** — extend the model to synthesize speech in a language absent from the pretraining data, using a multilingual dataset configuration.
+
+For preference optimization (DPO/GRPO) on top of a finetuned checkpoint, see :doc:`Magpie-TTS Preference Optimization <magpietts-po>`.
+
+
+Prerequisites
+#############
+
+Before finetuning, you will need:
+
+- A pretrained Magpie-TTS checkpoint (``pretrained.ckpt`` or ``pretrained.nemo``). Public checkpoints (``https://huggingface.co/nvidia/magpie_tts_multilingual_357m``) are available on Hugging Face.
+- The audio codec model (``https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps``), available on Hugging Face alongside the TTS checkpoint.
+- A prepared dataset. For faster finetuning audio codec tokens must be pre-extracted from your audio files. See the *Dataset Preparation* section below.
+- NeMo installed from source or via the NeMo container. See the `NeMo GitHub page <https://github.com/NVIDIA/NeMo>`_ for installation instructions.
+
+
+Dataset Preparation
+-------------------
+
+Training uses ``MagpieTTSDataset`` with ``dataset_meta`` entries (see ``DatasetMeta`` in ``nemo/collections/tts/data/text_to_speech_dataset.py``). Each line in ``manifest_path`` file is one training example.
+
+**Optional cached codec codes.** If each line includes ``target_audio_codes_path`` and ``context_audio_codes_path`` (paths to saved tensors) and ``model.load_cached_codes_if_available=true``, the dataloader can skip on-the-fly codec encoding. If those keys are absent, the codec runs during training and loads waveform from ``audio_filepath`` and ``context_audio_filepath`` (slower but no separate extraction step).
+
+**Minimum fields** (paths relative to ``audio_dir`` / ``feature_dir`` unless you use absolute paths):
+
+.. code-block:: json
+
+    {
+      "audio_filepath": "relative/path/to/audio.wav",
+      "text": "transcript of the utterance",
+      "duration": 5.2,
+      "context_audio_filepath": "relative/path/to/context.wav",
+      "context_text": "transcript of the context audio",
+      "target_audio_codes_path": "/optional/path/to/target_codes.pt",
+      "context_audio_codes_path": "/optional/path/to/context_codes.pt"
+    }
+
+The ``context_audio_filepath`` is the reference audio used for voice cloning during training. It should come from the same speaker as ``audio_filepath``. A minimum context duration of about 3 seconds and a high speaker similarity (for example ≥ 0.6 with TitaNet) are recommended for best results.
+
+**Registering datasets in config.** For each named split in ``train_ds_meta`` and ``val_ds_meta``, set ``manifest_path``, ``audio_dir``, ``feature_dir``, ``sample_weight`` (training), and ``tokenizer_names``: a list of keys that exist under ``model.text_tokenizers`` in the config. The dataloader picks the tokenizer for each sample from that list (see ``DatasetMeta``).
+
+
+.. _magpie-tts-new-speaker:
+
+Adding New Speakers in an Existing Language
+###########################################
+
+This scenario adapts a pretrained checkpoint to new speakers in a language the model already supports (for example adding new English speakers to a checkpoint trained on English data). You are teaching new voice characteristics while keeping the same text tokenizer. Mixing in some public Magpie-TTS data can reduce regression; see the `Magpie-TTS dataset <https://huggingface.co/nvidia/magpie_tts_multilingual_357m#training-dataset>`_ on Hugging Face.
+
+Key training choices:
+
+- **Low learning rate** (``5e-6``): the pretrained model is already well-converged; a high LR can destroy learned representations.
+- **Disable alignment prior** (``alignment_loss_scale=0.0``, ``prior_scaling_factor=null``): the prior helps pretraining but can over-constrain finetuning.
+- **Tokenizer**: use ``tokenizer_names: [english_phoneme]`` (or the tokenizer that matches your transcripts) on each ``train_ds_meta`` / ``val_ds_meta`` entry.
+
+``magpietts.yaml`` trains with ``max_epochs`` and top-level ``batch_size``. Validation mixes all ``val_ds_meta`` entries in a single dataloader (joint validation metrics).
+
+.. code-block:: bash
+
+    python examples/tts/magpietts.py \
+        --config-path=examples/tts/conf/magpietts \
+        --config-name=magpietts \
+        +init_from_ptl_ckpt=/path/to/pretrained.ckpt \
+        exp_manager.exp_dir=/path/to/output \
+        +train_ds_meta.en_sft.manifest_path=/path/to/train.json \
+        +train_ds_meta.en_sft.audio_dir=/path/to/audio \
+        +train_ds_meta.en_sft.feature_dir=/path/to/features \
+        +train_ds_meta.en_sft.sample_weight=1.0 \
+        "+train_ds_meta.en_sft.tokenizer_names=[english_phoneme]" \
+        +val_ds_meta.en_val.manifest_path=/path/to/val.json \
+        +val_ds_meta.en_val.audio_dir=/path/to/audio \
+        +val_ds_meta.en_val.feature_dir=/path/to/audio \
+        +val_ds_meta.en_val.sample_weight=1.0 \
+        "+val_ds_meta.en_val.tokenizer_names=[english_phoneme]" \
+        model.codecmodel_path=nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps \
+        model.context_duration_min=5.0 \
+        model.context_duration_max=5.0 \
+        model.alignment_loss_scale=0.0 \
+        model.prior_scaling_factor=null \
+        model.optim.lr=5e-6 \
+        ~model.optim.sched \
+        model.load_cached_codes_if_available=true \
+        trainer.precision=32 \
+        trainer.devices=8 \
+        trainer.num_nodes=1 \
+        batch_size=16 \
+        max_epochs=500
+
+The ``+init_from_ptl_ckpt`` flag loads the pretrained checkpoint weights before training begins. The ``+`` prefix is required because this key is not present in the base config.
+
+``~model.optim.sched`` removes the learning rate schedule so the LR stays constant during finetuning.
+
+``trainer.precision=32`` is recommended for finetuning stability. Mixed precision (``bf16`` or ``16``) can cause loss instability on small datasets.
+
+
+.. _magpie-tts-new-language:
+
+Adding a New Language
+#####################
+
+This scenario extends the model to one or more languages not present in the pretraining data. Use the same ``magpietts`` config and combine multiple manifests with per-language ``sample_weight``.
+
+**Tokenizers**
+
+- Define each new tokenizer under ``model.text_tokenizers`` (for example an ``AutoTokenizer`` with ``google/byt5-small`` for scripts outside the IPA vocabulary).
+- **How it is applied:** each ``train_ds_meta`` / ``val_ds_meta`` entry lists ``tokenizer_names`` (keys under ``model.text_tokenizers``). The dataloader uses those names to select which tokenizer encodes each sample’s transcript (see ``DatasetMeta`` in ``nemo/collections/tts/data/text_to_speech_dataset.py``).
+
+**Per-language entries**
+
+Each language is a separate key under ``train_ds_meta`` / ``val_ds_meta`` with ``manifest_path``, ``audio_dir``, ``feature_dir``, ``sample_weight``, and ``tokenizer_names``.
+
+**Sample weights**
+
+Upsample low-resource languages with a higher ``sample_weight`` so they are not drowned out by high-resource languages.
+
+Align transcript format with the tokenizer you choose (IPA phonemes for ``english_phoneme`` / IPA-style tokenizers, raw text for byte-level models, and so on). Audio codes can be cached as in *Dataset Preparation*.
+
+.. code-block:: bash
+
+    python examples/tts/magpietts.py \
+        --config-name=magpietts \
+        +init_from_ptl_ckpt=/path/to/pretrained.ckpt \
+        exp_manager.exp_dir=/path/to/output \
+        +model.text_tokenizers.your_language_chartokenizer._target_=AutoTokenizer \
+        +model.text_tokenizers.your_language_chartokenizer.pretrained_model="google/byt5-small" \
+        +train_ds_meta.your_language.manifest_path=/path/to/your_lang_train.json \
+        +train_ds_meta.your_language.audio_dir=/path/to/your_lang_audio \
+        +train_ds_meta.your_language.feature_dir=/path/to/your_lang_audio \
+        +train_ds_meta.your_language.sample_weight=1.0 \
+        "+train_ds_meta.your_language.tokenizer_names=[your_language_chartokenizer]" \
+        +val_ds_meta.your_language_dev.manifest_path=/path/to/your_lang_val.json \
+        +val_ds_meta.your_language_dev.audio_dir=/path/to/your_lang_audio \
+        +val_ds_meta.your_language_dev.feature_dir=/path/to/your_lang_audio \
+        +val_ds_meta.your_language_dev.sample_weight=1.0 \
+        "+val_ds_meta.your_language_dev.tokenizer_names=[your_language_chartokenizer]" \
+        model.codecmodel_path=nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps \
+        model.context_duration_min=5.0 \
+        model.context_duration_max=5.0 \
+        model.alignment_loss_scale=0.0 \
+        model.prior_scaling_factor=null \
+        model.optim.lr=1e-5 \
+        ~model.optim.sched \
+        model.load_cached_codes_if_available=true \
+        trainer.precision=32 \
+        trainer.devices=8 \
+        trainer.num_nodes=1 \
+        max_epochs=500
+
+
+Mixing Multiple Languages
+--------------------------
+
+Add one ``train_ds_meta`` entry per language. Increase ``sample_weight`` for low-resource languages. You can mix public Magpie-TTS data with your own; see the `Magpie-TTS dataset <https://huggingface.co/nvidia/magpie_tts_multilingual_357m#training-dataset>`_ on Hugging Face.
+
+.. code-block:: bash
+
+        # High-resource languages — standard weight
+        +train_ds_meta.spanish.manifest_path=/path/to/spanish_train.json \
+        +train_ds_meta.spanish.audio_dir=/path/to/spanish_audio \
+        +train_ds_meta.spanish.feature_dir=/path/to/spanish_audio \
+        +train_ds_meta.spanish.sample_weight=1.0 \
+        "+train_ds_meta.spanish.tokenizer_names=[spanish_phoneme_or_chartokenizer]" \
+        +train_ds_meta.french.manifest_path=/path/to/french_train.json \
+        +train_ds_meta.french.audio_dir=/path/to/french_audio \
+        +train_ds_meta.french.feature_dir=/path/to/french_audio \
+        +train_ds_meta.french.sample_weight=1.0 \
+        "+train_ds_meta.french.tokenizer_names=[french_chartokenizer]" \
+        # Low-resource language — upsampled 10x
+        +train_ds_meta.low_resource_lang.manifest_path=/path/to/low_resource_train.json \
+        +train_ds_meta.low_resource_lang.audio_dir=/path/to/low_resource_audio \
+        +train_ds_meta.low_resource_lang.feature_dir=/path/to/low_resource_audio \
+        +train_ds_meta.low_resource_lang.sample_weight=5.0 \
+        "+train_ds_meta.low_resource_lang.tokenizer_names=[low_resource_chartokenizer]"
+
+With ``model.load_cached_codes_if_available=true``, precomputed ``target_audio_codes_path`` / ``context_audio_codes_path`` in the manifest avoid recomputing codec codes at train time.
+
+
+Preference Optimization After Finetuning
+#########################################
+
+After supervised finetuning, you can further improve quality with GRPO. For commands and hyperparameters, see :doc:`Magpie-TTS Preference Optimization <magpietts-po>` (the GRPO example uses ``--config-name=magpietts`` with ``+mode=onlinepo_train``).
+
+
+Key Hyperparameter Reference
+#############################
+
+.. list-table::
+   :widths: 35 25 40
+   :header-rows: 1
+
+   * - Parameter
+     - Typical Value
+     - Notes
+   * - ``model.optim.lr``
+     - ``5e-6`` (same-language speakers), ``1e-5`` (multilingual)
+     - Much lower than pretraining LR to preserve learned features
+   * - ``max_epochs``
+     - tens to hundreds
+     - Shorter runs for small datasets; monitor validation loss
+   * - ``model.alignment_loss_scale``
+     - ``0.0``
+     - Disable alignment prior during finetuning
+   * - ``model.prior_scaling_factor``
+     - ``null``
+     - Disable alignment prior during finetuning
+   * - ``trainer.precision``
+     - ``32``
+     - Recommended for finetuning stability
+   * - ``model.cfg_unconditional_prob``
+     - ``0.1``
+     - Classifier-free guidance dropout rate during training
diff --git a/docs/source/tts/models.rst b/docs/source/tts/models.rst
@@ -33,7 +33,7 @@ End-to-End LLM-based TTS
 
 MagpieTTS
 ~~~~~~~~~
-MagpieTTS is an encoder-decoder transformer TTS model that operates on discrete audio tokens from a neural audio codec. It uses monotonic alignment (CTC loss and attention priors) to reduce hallucinations and supports voice cloning via audio or text context conditioning. For architecture, training, inference, and preference optimization (DPO/GRPO), see :doc:`Magpie-TTS documentation <magpietts>`.
+MagpieTTS is an encoder-decoder transformer TTS model that operates on discrete audio tokens from a neural audio codec. It uses monotonic alignment (CTC loss and attention priors) to reduce hallucinations and supports voice cloning via audio or text context conditioning. For architecture, training, inference, and preference optimization (DPO/GRPO), see :doc:`Magpie-TTS documentation <magpietts>`. To adapt a pretrained checkpoint to new speakers or new languages, see :doc:`Magpie-TTS Finetuning <magpietts-finetuning>`.
 
 
 Vocoders