NVIDIA-NeMo
diff --git a/‎.claude/skills/nemo-speech-asr-finetune/SKILL.md‎
Lines changed: 78 additions & 0 deletions b/‎.claude/skills/nemo-speech-asr-finetune/SKILL.md‎
Lines changed: 78 additions & 0 deletions
diff --git a/‎.claude/skills/nemo-speech-asr-finetune/assets/experiment-ledger-template.md‎
Lines changed: 92 additions & 0 deletions b/‎.claude/skills/nemo-speech-asr-finetune/assets/experiment-ledger-template.md‎
Lines changed: 92 additions & 0 deletions
diff --git a/‎.claude/skills/nemo-speech-asr-finetune/references/architecture-tokenizer-metrics.md‎
Lines changed: 153 additions & 0 deletions b/‎.claude/skills/nemo-speech-asr-finetune/references/architecture-tokenizer-metrics.md‎
Lines changed: 153 additions & 0 deletions
@@ -0,0 +1,78 @@
+---
+name: nemo-speech-asr-finetune
+description: Guide NeMo Speech users through ASR fine-tuning with container setup and Lhotse training.
+---
+
+# NeMo Speech ASR Fine-Tuning
+
+Use this skill when a user wants to fine-tune a NeMo Speech ASR model, choose a checkpoint, adapt a tokenizer,
+configure Lhotse dataloading, train, average checkpoints, or evaluate a fine-tuned ASR `.nemo` checkpoint.
+Also use it for post-run refinement planning after fine-tuning.
+
+Default posture:
+
+- Use the NeMo container unless the user explicitly asks for local execution.
+- Prefer Lhotse for train and validation dataloaders.
+- Use `trainer.max_steps`, not `trainer.max_epochs`.
+- Use `val_wer` as the checkpoint monitor for validation.
+- By default, evaluate WER without capitalization and punctuation effects. Change that only when the user explicitly
+  asks for raw/cased/punctuated scoring.
+- Report final quality from standalone evaluation, not only in-training validation logs.
+
+## Staged Workflow
+
+Load only the reference file needed for the current stage:
+
+1. Setup and checkpoint selection: read `references/setup-checkpoints.md`.
+2. Data prep, transcript-style preflight, Lhotse, bucketing, validation dataloader, and blends: read
+   `references/data-lhotse.md`.
+3. Architecture detection, tokenizer changes, and AED/Canary multitask metrics: read
+   `references/architecture-tokenizer-metrics.md`.
+4. Training, checkpoint averaging, and evaluation: read `references/training-evaluation.md` and, when reporting WER,
+   `references/evaluation-style-contract.md`.
+5. Post-run refinement, error analysis, curriculum, and general-vs-domain evaluation: read
+   `references/refinement-iteration.md`.
+
+If the user explicitly asks for parallel/sub-agent work, split the work by these same stages. Keep each agent scoped to
+one stage and have the main agent integrate the final command/config.
+
+## Core Commands
+
+Generic fine-tuning uses `examples/asr/speech_to_text_finetune.py`. For architecture-specific recipes, route to:
+
+- CTC: `examples/asr/asr_ctc/speech_to_text_ctc_bpe.py`
+- RNNT: `examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py`
+- Hybrid RNNT/CTC or TDT/CTC: `examples/asr/asr_hybrid_transducer_ctc/speech_to_text_hybrid_rnnt_ctc_bpe.py`
+- AED/Canary: `examples/asr/speech_multitask/speech_to_text_aed.py`
+
+Always check the current repo docs before giving version-sensitive claims:
+
+- `README.md`
+- `docs/source/asr/fine_tuning.rst`
+- `docs/source/asr/datasets.rst`
+- `docs/source/dataloaders.rst`
+- `docs/source/asr/featured_models.rst`
+- `docs/source/asr/asr_checkpoints.rst`
+- `nemo/collections/common/data/lhotse/dataloader.py`
+
+## Non-Negotiable Pitfalls
+
+- When changing Lhotse batch modes, explicitly null conflicting options. For OOMptimizer profiles, set
+  `batch_size=null`, `batch_duration=null`, and `quadratic_duration=null` when adding `bucket_batch_size`.
+- Set `model.validation_ds.use_lhotse=true`, but prefer static validation `batch_size` with bucketing disabled.
+- Do not use fused loss/WER or tune `fused_batch_size` for RNNT/TDT fine-tuning guidance from this skill.
+- Run the first OOMptimizer pass with default CLI settings; lower `--memory-fraction` only after a real training OOM.
+- Run preflight checks before long jobs: disk space, free GPUs, manifest validity, and duration/text sanity.
+- Before any fine-tuning, audit transcript style within and across all fine-tuning/validation/test sources. Do not
+  train on mixed casing, punctuation, inverse-text-normalization, or symbol conventions; choose and fix one target style
+  first, and compare it with the original checkpoint's prediction style when applicable.
+- For small domain adaptation, start with a lower LR than large-data fine-tuning; do not blindly use `1e-4`.
+- Do not train a tokenizer on validation or test transcripts.
+- Do not ignore silent Lhotse filtering from `min_duration`, `max_duration`, `min_tps`, and `max_tps`.
+- Do not use `amp=true` for inference/evaluation; use `amp=false compute_dtype=bfloat16`.
+- Unless the user asks otherwise, report the default WER with capitalization and punctuation removed, and record any raw
+  WER separately when it helps diagnose transcript-style mismatch.
+- For AED/Canary, configure `multitask_metrics_cfg` so ASR and translation/task-specific samples are evaluated with
+  the right constrained metrics.
+- If checkpoint averaging is used, evaluate the averaged checkpoint and keep it only if it beats the best individual
+  checkpoint.
@@ -0,0 +1,92 @@
+# ASR Fine-Tuning Experiment Ledger
+
+## Goal
+
+- User goal:
+- Target checkpoint:
+- Architecture:
+- Success metric:
+- Optional guardrail metrics:
+
+## Data
+
+- Train manifests or input config:
+- Validation manifest:
+- Test manifest:
+- Data sources and weights:
+- Tarred/non-tarred:
+- Manifest sharding:
+- Transcript style policy:
+- Style transform artifact:
+- Original checkpoint output style:
+- Style mismatch decision:
+
+## Preflight
+
+- Disk space:
+- GPUs:
+- Container/image:
+- Manifest validation:
+- Duration distribution:
+- Text/token distribution:
+- Duration/token filters:
+- Examples/hours filtered:
+
+## Lhotse And OOMptimizer
+
+- Lhotse train:
+- Lhotse validation:
+- Bucketing mode:
+- Duration bins:
+- Bucket batch sizes:
+- Static batch size:
+- `bucket_buffer_size`:
+- `shuffle_buffer_size`:
+- `seed`:
+- `shard_seed`:
+- OOMptimizer settings:
+- Training pilot utilization:
+- CPU memory notes:
+
+## Training
+
+- Init checkpoint:
+- Script/config:
+- Precision:
+- `sync_batchnorm`:
+- `max_steps`:
+- `limit_train_batches`:
+- `val_check_interval`:
+- LR:
+- Scheduler:
+- Warmup:
+- Min LR:
+- Save top K:
+- Command/log path:
+
+## Evaluation
+
+| Model | Artifact | Prediction Manifest | Raw WER | Default WER | CER | Notes |
+| --- | --- | --- | --- | --- | --- | --- |
+| baseline |  |  |  |  |  |  |
+| final `.nemo` |  |  |  |  |  |  |
+| best single |  |  |  |  |  |  |
+| averaged |  |  |  |  |  |  |
+
+Default WER uses capitalization and punctuation removal unless the user requested a different metric.
+
+## Error Analysis
+
+- Raw vs default WER gap:
+- Worst sources/domains:
+- Worst categories:
+- Label/audio defects:
+- Decoding findings:
+
+## Decision
+
+- Keep artifact:
+- Drop artifacts:
+- Next intervention:
+- Reason:
+- If validation/test influenced data or weights, blind holdout plan:
@@ -0,0 +1,153 @@
+# Stage 3: Architecture, Tokenizer, And AED Metrics
+
+## Architecture Detection
+
+Inspect the model config before choosing scripts and overrides:
+
+```python
+from nemo.collections.asr.models import ASRModel
+
+cfg = ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v3", return_config=True)
+print(cfg.target)
+print(cfg.get("decoder", None))
+print(cfg.get("joint", None))
+print(cfg.get("loss", None))
+print(cfg.get("decoding", None))
+```
+
+Classify:
+
+- CTC: `EncDecCTC*`, `ConvASRDecoder`, no RNNT-style `joint`.
+- RNNT: `EncDecRNNT*` with `decoder` and `joint`.
+- TDT: RNNT-family config with `loss.loss_name: tdt`, `decoding.model_type: tdt`, durations, or extra duration
+  outputs.
+- Hybrid RNNT/CTC or TDT/CTC: `EncDecHybridRNNTCTC*`, `aux_ctc`, `ctc_decoder`.
+- AED/Canary: `EncDecMultiTaskModel`, Transformer decoder, `prompt_format`.
+
+Use `examples/asr/speech_to_text_finetune.py` for compatible-architecture fine-tuning. For architecture-specific
+recipes:
+
+- CTC: `examples/asr/asr_ctc/speech_to_text_ctc_bpe.py`
+- RNNT: `examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py`
+- Hybrid RNNT/CTC or TDT/CTC: `examples/asr/asr_hybrid_transducer_ctc/speech_to_text_hybrid_rnnt_ctc_bpe.py`
+- AED/Canary: `examples/asr/speech_multitask/speech_to_text_aed.py`
+
+Reference configs to inspect before writing overrides:
+
+- CTC: `examples/asr/conf/fastconformer/fast-conformer_ctc_bpe.yaml`
+- RNNT: `examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml`
+- TDT: `examples/asr/conf/conformer/tdt/conformer_tdt_bpe.yaml`
+- Hybrid TDT/CTC: `examples/asr/conf/fastconformer/hybrid_transducer_ctc/fastconformer_hybrid_tdt_ctc_bpe.yaml`
+- AED/Canary: `examples/asr/conf/speech_multitask/fast-conformer_aed.yaml`
+
+## Tokenizer Decisions
+
+Keep the pretrained tokenizer when language/script, casing, punctuation, and symbols match. Replace or extend it when
+the target language/script is new, important symbols are missing, normalization changes substantially, the run is
+multilingual/code-switching, or Canary/AED prompt/special tokens change.
+
+Train from training text only:
+
+```bash
+python scripts/tokenizers/process_asr_text_tokenizer.py \
+  --manifest=/data/train.json \
+  --data_root=/data/tokenizers/my_tokenizer \
+  --vocab_size=1024 \
+  --tokenizer=spe \
+  --spe_type=unigram \
+  --log
+```
+
+Replace in generic fine-tuning:
+
+```bash
+model.tokenizer.update_tokenizer=true \
+model.tokenizer.dir=/data/tokenizers/my_tokenizer/tokenizer_spe_unigram_v1024 \
+model.tokenizer.type=bpe
+```
+
+Changing tokenizer size usually reinitializes decoder-side parameters such as CTC projection or RNNT/TDT
+decoder/joint pieces. Use conservative LR and validate early.
+
+Aggregate tokenizer:
+
+```yaml
+tokenizer:
+  type: agg
+  langs:
+    en:
+      dir: /data/tokenizers/en/tokenizer_spe_unigram_v1024
+      type: bpe
+    es:
+      dir: /data/tokenizers/es/tokenizer_spe_unigram_v1024
+      type: bpe
+```
+
+Standard aggregate ASR configs expect a manifest language field such as `lang`. AED/Canary configs use their own
+prompt and language fields; follow `examples/asr/conf/speech_multitask/fast-conformer_aed.yaml`.
+
+## Architecture-Specific Knobs
+
+CTC:
+
+- Main tokenizer-sensitive module is the decoder projection.
+- Useful when alignments or non-autoregressive decoding matter.
+- Long transcripts can violate CTC length constraints; subword tokenization helps reduce target length.
+
+RNNT:
+
+- Disable fused loss/WER for this skill: `model.joint.fuse_loss_wer=false`.
+- Do not tune `fused_batch_size`; use Lhotse bucketing plus OOMptimizer-generated `bucket_batch_size`.
+- `model.compute_eval_loss=false` is common when validation samples are long and WER is the main metric.
+- Use CUDA graphs for inference/evaluation when supported.
+
+TDT:
+
+- Preserve `loss.loss_name=tdt`, duration settings, extra outputs, and `decoding.model_type=tdt`.
+- Disable fused loss/WER and use Lhotse bucketing plus OOMptimizer.
+- Use CUDA graphs for inference/evaluation when supported.
+
+Hybrid:
+
+- Check `model.aux_ctc.ctc_loss_weight`; reference configs often use `0.3`.
+- Evaluate both decoder paths when relevant with `decoder_type=ctc` and `decoder_type=rnnt`.
+
+AED/Canary:
+
+- Use `examples/asr/speech_multitask/speech_to_text_aed.py`.
+- Preserve `prompt_format` and expected manifest fields.
+- Prefer 2D Lhotse buckets plus OOMptimizer.
+
+## AED/Canary Multitask Metrics
+
+`EncDecMultiTaskModel` reads `model.multitask_metrics_cfg` and constructs `MultiTaskMetric`
+(`nemo/collections/asr/metrics/multitask.py`). Metric constraints are evaluated against each Lhotse cut's `custom`
+dict, including manifest fields and `input_cfg.tags`.
+
+Reference config:
+
+```yaml
+model:
+  multitask_metrics_cfg:
+    log_predictions: true
+    metrics:
+      wer:
+        _target_: nemo.collections.asr.metrics.WER
+        constraint: ".source_lang==.target_lang"
+      bleu:
+        _target_: nemo.collections.asr.metrics.BLEU
+        constraint: ".source_lang!=.target_lang"
+        bleu_tokenizer: 13a
+        check_cuts_for_bleu_tokenizers: false
+```
+
+Use constraints to route ASR samples to WER and translation samples to BLEU. Add dataset/task/domain metadata through
+manifest fields or `input_cfg.tags`, then reference it in constraints such as `.domain==target` or
+`.task==asr and .source_lang==.target_lang`.
+
+Current implementation supports only one instance of each metric class in a single `multitask_metrics_cfg`. For
+multiple WER slices by language/domain, prefer separate validation manifests/dataloaders or extend metric aggregation
+rather than defining duplicate WER metrics.
+
+For AED validation data, set `use_lhotse: true`, `use_bucketing: false`, static `batch_size`, `text_field: "text"`,
+and `lang_field: "target_lang"`.