|
| 1 | +# Stage 3: Architecture, Tokenizer, And AED Metrics |
| 2 | + |
| 3 | +## Architecture Detection |
| 4 | + |
| 5 | +Inspect the model config before choosing scripts and overrides: |
| 6 | + |
| 7 | +```python |
| 8 | +from nemo.collections.asr.models import ASRModel |
| 9 | + |
| 10 | +cfg = ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v3", return_config=True) |
| 11 | +print(cfg.target) |
| 12 | +print(cfg.get("decoder", None)) |
| 13 | +print(cfg.get("joint", None)) |
| 14 | +print(cfg.get("loss", None)) |
| 15 | +print(cfg.get("decoding", None)) |
| 16 | +``` |
| 17 | + |
| 18 | +Classify: |
| 19 | + |
| 20 | +- CTC: `EncDecCTC*`, `ConvASRDecoder`, no RNNT-style `joint`. |
| 21 | +- RNNT: `EncDecRNNT*` with `decoder` and `joint`. |
| 22 | +- TDT: RNNT-family config with `loss.loss_name: tdt`, `decoding.model_type: tdt`, durations, or extra duration |
| 23 | + outputs. |
| 24 | +- Hybrid RNNT/CTC or TDT/CTC: `EncDecHybridRNNTCTC*`, `aux_ctc`, `ctc_decoder`. |
| 25 | +- AED/Canary: `EncDecMultiTaskModel`, Transformer decoder, `prompt_format`. |
| 26 | + |
| 27 | +Use `examples/asr/speech_to_text_finetune.py` for compatible-architecture fine-tuning. For architecture-specific |
| 28 | +recipes: |
| 29 | + |
| 30 | +- CTC: `examples/asr/asr_ctc/speech_to_text_ctc_bpe.py` |
| 31 | +- RNNT: `examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py` |
| 32 | +- Hybrid RNNT/CTC or TDT/CTC: `examples/asr/asr_hybrid_transducer_ctc/speech_to_text_hybrid_rnnt_ctc_bpe.py` |
| 33 | +- AED/Canary: `examples/asr/speech_multitask/speech_to_text_aed.py` |
| 34 | + |
| 35 | +Reference configs to inspect before writing overrides: |
| 36 | + |
| 37 | +- CTC: `examples/asr/conf/fastconformer/fast-conformer_ctc_bpe.yaml` |
| 38 | +- RNNT: `examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml` |
| 39 | +- TDT: `examples/asr/conf/conformer/tdt/conformer_tdt_bpe.yaml` |
| 40 | +- Hybrid TDT/CTC: `examples/asr/conf/fastconformer/hybrid_transducer_ctc/fastconformer_hybrid_tdt_ctc_bpe.yaml` |
| 41 | +- AED/Canary: `examples/asr/conf/speech_multitask/fast-conformer_aed.yaml` |
| 42 | + |
| 43 | +## Tokenizer Decisions |
| 44 | + |
| 45 | +Keep the pretrained tokenizer when language/script, casing, punctuation, and symbols match. Replace or extend it when |
| 46 | +the target language/script is new, important symbols are missing, normalization changes substantially, the run is |
| 47 | +multilingual/code-switching, or Canary/AED prompt/special tokens change. |
| 48 | + |
| 49 | +Train from training text only: |
| 50 | + |
| 51 | +```bash |
| 52 | +python scripts/tokenizers/process_asr_text_tokenizer.py \ |
| 53 | + --manifest=/data/train.json \ |
| 54 | + --data_root=/data/tokenizers/my_tokenizer \ |
| 55 | + --vocab_size=1024 \ |
| 56 | + --tokenizer=spe \ |
| 57 | + --spe_type=unigram \ |
| 58 | + --log |
| 59 | +``` |
| 60 | + |
| 61 | +Replace in generic fine-tuning: |
| 62 | + |
| 63 | +```bash |
| 64 | +model.tokenizer.update_tokenizer=true \ |
| 65 | +model.tokenizer.dir=/data/tokenizers/my_tokenizer/tokenizer_spe_unigram_v1024 \ |
| 66 | +model.tokenizer.type=bpe |
| 67 | +``` |
| 68 | + |
| 69 | +Changing tokenizer size usually reinitializes decoder-side parameters such as CTC projection or RNNT/TDT |
| 70 | +decoder/joint pieces. Use conservative LR and validate early. |
| 71 | + |
| 72 | +Aggregate tokenizer: |
| 73 | + |
| 74 | +```yaml |
| 75 | +tokenizer: |
| 76 | + type: agg |
| 77 | + langs: |
| 78 | + en: |
| 79 | + dir: /data/tokenizers/en/tokenizer_spe_unigram_v1024 |
| 80 | + type: bpe |
| 81 | + es: |
| 82 | + dir: /data/tokenizers/es/tokenizer_spe_unigram_v1024 |
| 83 | + type: bpe |
| 84 | +``` |
| 85 | +
|
| 86 | +Standard aggregate ASR configs expect a manifest language field such as `lang`. AED/Canary configs use their own |
| 87 | +prompt and language fields; follow `examples/asr/conf/speech_multitask/fast-conformer_aed.yaml`. |
| 88 | + |
| 89 | +## Architecture-Specific Knobs |
| 90 | + |
| 91 | +CTC: |
| 92 | + |
| 93 | +- Main tokenizer-sensitive module is the decoder projection. |
| 94 | +- Useful when alignments or non-autoregressive decoding matter. |
| 95 | +- Long transcripts can violate CTC length constraints; subword tokenization helps reduce target length. |
| 96 | + |
| 97 | +RNNT: |
| 98 | + |
| 99 | +- Disable fused loss/WER for this skill: `model.joint.fuse_loss_wer=false`. |
| 100 | +- Do not tune `fused_batch_size`; use Lhotse bucketing plus OOMptimizer-generated `bucket_batch_size`. |
| 101 | +- `model.compute_eval_loss=false` is common when validation samples are long and WER is the main metric. |
| 102 | +- Use CUDA graphs for inference/evaluation when supported. |
| 103 | + |
| 104 | +TDT: |
| 105 | + |
| 106 | +- Preserve `loss.loss_name=tdt`, duration settings, extra outputs, and `decoding.model_type=tdt`. |
| 107 | +- Disable fused loss/WER and use Lhotse bucketing plus OOMptimizer. |
| 108 | +- Use CUDA graphs for inference/evaluation when supported. |
| 109 | + |
| 110 | +Hybrid: |
| 111 | + |
| 112 | +- Check `model.aux_ctc.ctc_loss_weight`; reference configs often use `0.3`. |
| 113 | +- Evaluate both decoder paths when relevant with `decoder_type=ctc` and `decoder_type=rnnt`. |
| 114 | + |
| 115 | +AED/Canary: |
| 116 | + |
| 117 | +- Use `examples/asr/speech_multitask/speech_to_text_aed.py`. |
| 118 | +- Preserve `prompt_format` and expected manifest fields. |
| 119 | +- Prefer 2D Lhotse buckets plus OOMptimizer. |
| 120 | + |
| 121 | +## AED/Canary Multitask Metrics |
| 122 | + |
| 123 | +`EncDecMultiTaskModel` reads `model.multitask_metrics_cfg` and constructs `MultiTaskMetric` |
| 124 | +(`nemo/collections/asr/metrics/multitask.py`). Metric constraints are evaluated against each Lhotse cut's `custom` |
| 125 | +dict, including manifest fields and `input_cfg.tags`. |
| 126 | + |
| 127 | +Reference config: |
| 128 | + |
| 129 | +```yaml |
| 130 | +model: |
| 131 | + multitask_metrics_cfg: |
| 132 | + log_predictions: true |
| 133 | + metrics: |
| 134 | + wer: |
| 135 | + _target_: nemo.collections.asr.metrics.WER |
| 136 | + constraint: ".source_lang==.target_lang" |
| 137 | + bleu: |
| 138 | + _target_: nemo.collections.asr.metrics.BLEU |
| 139 | + constraint: ".source_lang!=.target_lang" |
| 140 | + bleu_tokenizer: 13a |
| 141 | + check_cuts_for_bleu_tokenizers: false |
| 142 | +``` |
| 143 | + |
| 144 | +Use constraints to route ASR samples to WER and translation samples to BLEU. Add dataset/task/domain metadata through |
| 145 | +manifest fields or `input_cfg.tags`, then reference it in constraints such as `.domain==target` or |
| 146 | +`.task==asr and .source_lang==.target_lang`. |
| 147 | + |
| 148 | +Current implementation supports only one instance of each metric class in a single `multitask_metrics_cfg`. For |
| 149 | +multiple WER slices by language/domain, prefer separate validation manifests/dataloaders or extend metric aggregation |
| 150 | +rather than defining duplicate WER metrics. |
| 151 | + |
| 152 | +For AED validation data, set `use_lhotse: true`, `use_bucketing: false`, static `batch_size`, `text_field: "text"`, |
| 153 | +and `lang_field: "target_lang"`. |
0 commit comments