-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Refactored all ASR collections documentation #15542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
55b65a4
df3e3bc
15bed94
15941f2
63144cb
49319f9
0db40f2
63fed73
a66642e
034ca21
9ae4b21
ccf8365
8769deb
edc5841
6e0e501
894bfdb
7b6822b
2afaa15
4ad4a65
c536d10
df71aeb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,5 @@ | ||
| :orphan: | ||
|
|
||
| All Checkpoints | ||
| =============== | ||
| English | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,318 @@ | ||
| .. _asr-checkpoints-list: | ||
|
|
||
| ======================= | ||
| ASR Model Checkpoints | ||
| ======================= | ||
|
|
||
| This page lists all supported ASR model checkpoints released by NVIDIA NeMo. | ||
| Benchmark scores for each model can be found on its `HuggingFace model card <https://huggingface.co/nvidia>`__. | ||
|
|
||
| Glossary | ||
| -------- | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
|
|
||
| * - Term | ||
| - Definition | ||
| * - **ASR** | ||
| - Automatic Speech Recognition — transcribing speech to text | ||
| * - **AST** | ||
| - Automatic Speech Translation — translating speech from one language to another | ||
| * - **AED** | ||
| - Attention Encoder-Decoder — autoregressive decoder using cross-attention (Canary family) | ||
| * - **CTC** | ||
| - Connectionist Temporal Classification — non-autoregressive decoder | ||
| * - **RNN-T** | ||
| - Recurrent Neural Network Transducer — autoregressive streaming-friendly decoder | ||
| * - **TDT** | ||
| - Token-and-Duration Transducer — extends RNN-T with duration prediction for better timestamps | ||
| * - **Hybrid** | ||
| - Joint RNN-T + CTC model — both decoders trained together, either usable at inference | ||
| * - **PnC** | ||
| - Punctuation and Capitalization in the output | ||
| * - **Streaming** | ||
| - Real-time / cache-aware inference capability | ||
|
|
||
|
|
||
| Canary Models (AED) | ||
| ------------------- | ||
|
|
||
| Multi-task encoder-decoder models supporting ASR, AST, PnC, and timestamps across multiple languages. | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
|
|
||
| * - Model | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. iirc some of these didn't really prioritize PnC no? |
||
| - Decoder | ||
| - Capabilities | ||
| - Languages | ||
| * - `canary-1b-v2 <https://huggingface.co/nvidia/canary-1b-v2>`__ | ||
| - AED | ||
| - ASR, AST, PnC, timestamps | ||
| - English + 24 European languages | ||
| * - `canary-qwen-2.5b <https://huggingface.co/nvidia/canary-qwen-2.5b>`__ | ||
| - SALM | ||
| - ASR, AST, PnC, timestamps | ||
| - English + 24 European languages | ||
| * - `canary-1b-flash <https://huggingface.co/nvidia/canary-1b-flash>`__ | ||
| - AED | ||
| - ASR, AST, PnC, timestamps, fast | ||
| - English, German, Spanish, French | ||
| * - `canary-180m-flash <https://huggingface.co/nvidia/canary-180m-flash>`__ | ||
| - AED | ||
| - ASR, AST, PnC, timestamps, fast | ||
| - English, German, Spanish, French | ||
| * - `canary-1b <https://huggingface.co/nvidia/canary-1b>`__ | ||
| - AED | ||
| - ASR, AST, PnC, timestamps | ||
| - English, German, Spanish, French | ||
|
|
||
|
|
||
| Parakeet Models | ||
| ----------------- | ||
|
|
||
| High-accuracy ASR models built on the FastConformer encoder architecture. | ||
| Parakeet, Nemotron Speech, and the ``stt_*_fastconformer_*`` models below all share the same underlying FastConformer encoder; | ||
| the different names reflect release branding, not architectural differences. | ||
|
|
||
| .. list-table:: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why does this table define language in Size, and the next table of streaming models defines language in Language? Add Language column here. |
||
| :header-rows: 1 | ||
|
|
||
| * - Model | ||
| - Decoder | ||
| - Capabilities | ||
| - Size | ||
| * - `parakeet-tdt-0.6b-v3 <https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3>`__ | ||
| - TDT | ||
| - ASR, PnC, timestamps | ||
| - 0.6B | ||
| * - `parakeet-tdt-0.6b-v2 <https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2>`__ | ||
| - TDT | ||
| - ASR, PnC, timestamps | ||
| - 0.6B | ||
| * - `parakeet-tdt-1.1b <https://huggingface.co/nvidia/parakeet-tdt-1.1b>`__ | ||
| - TDT | ||
| - ASR, timestamps | ||
| - 1.1B | ||
| * - `parakeet-tdt_ctc-1.1b <https://huggingface.co/nvidia/parakeet-tdt_ctc-1.1b>`__ | ||
| - Hybrid TDT+CTC | ||
| - ASR, timestamps | ||
| - 1.1B | ||
| * - `parakeet-tdt_ctc-0.6b-ja <https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja>`__ | ||
| - Hybrid TDT+CTC | ||
| - ASR, timestamps | ||
| - 0.6B (Japanese) | ||
| * - `parakeet-tdt_ctc-110m <https://huggingface.co/nvidia/parakeet-tdt_ctc-110m>`__ | ||
| - Hybrid TDT+CTC | ||
| - ASR, timestamps | ||
| - 110M | ||
| * - `parakeet-rnnt-1.1b <https://huggingface.co/nvidia/parakeet-rnnt-1.1b>`__ | ||
| - RNN-T | ||
| - ASR, timestamps | ||
| - 1.1B | ||
| * - `parakeet-rnnt-0.6b <https://huggingface.co/nvidia/parakeet-rnnt-0.6b>`__ | ||
| - RNN-T | ||
| - ASR, timestamps | ||
| - 0.6B | ||
| * - `parakeet-ctc-1.1b <https://huggingface.co/nvidia/parakeet-ctc-1.1b>`__ | ||
| - CTC | ||
| - ASR | ||
| - 1.1B | ||
| * - `parakeet-ctc-0.6b <https://huggingface.co/nvidia/parakeet-ctc-0.6b>`__ | ||
| - CTC | ||
| - ASR | ||
| - 0.6B | ||
| * - `parakeet-ctc-0.6b-Vietnamese <https://huggingface.co/nvidia/parakeet-ctc-0.6b-Vietnamese>`__ | ||
| - CTC | ||
| - ASR | ||
| - 0.6B (Vietnamese) | ||
| * - `parakeet-rnnt-110m-da-dk <https://huggingface.co/nvidia/parakeet-rnnt-110m-da-dk>`__ | ||
| - RNN-T | ||
| - ASR | ||
| - 110M (Danish) | ||
|
|
||
|
|
||
| Streaming Models | ||
| ----------------- | ||
|
|
||
| Cache-aware models for real-time / low-latency inference. | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
|
|
||
| * - Model | ||
| - Decoder | ||
| - Capabilities | ||
| - Language | ||
| * - `nemotron-speech-streaming-en-0.6b <https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b>`__ | ||
| - Hybrid | ||
| - ASR, streaming | ||
| - English | ||
| * - `multitalker-parakeet-streaming-0.6b-v1 <https://huggingface.co/nvidia/multitalker-parakeet-streaming-0.6b-v1>`__ | ||
| - RNN-T | ||
| - ASR, multitalker, streaming | ||
| - English | ||
| * - `parakeet_realtime_eou_120m-v1 <https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1>`__ | ||
| - RNN-T | ||
| - ASR, end-of-utterance, streaming | ||
| - English | ||
| * - `stt_en_fastconformer_hybrid_large_streaming_multi <https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_large_streaming_multi>`__ | ||
| - Hybrid | ||
| - ASR, streaming, multiple look-aheads | ||
| - English | ||
| * - `stt_en_fastconformer_hybrid_medium_streaming_80ms_pc <https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_medium_streaming_80ms_pc>`__ | ||
| - Hybrid | ||
| - ASR, PnC, streaming | ||
| - English | ||
| * - `stt_en_fastconformer_hybrid_medium_streaming_80ms <https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_medium_streaming_80ms>`__ | ||
| - Hybrid | ||
| - ASR, streaming | ||
| - English | ||
| * - `stt_ka_fastconformer_hybrid_transducer_ctc_large_streaming_80ms_pc <https://huggingface.co/nvidia/stt_ka_fastconformer_hybrid_transducer_ctc_large_streaming_80ms_pc>`__ | ||
| - Hybrid | ||
| - ASR, PnC, streaming | ||
| - Georgian | ||
| * - `stt_en_fastconformer_hybrid_large_streaming_1040ms <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_fastconformer_hybrid_large_streaming_1040ms>`__ | ||
| - Hybrid | ||
| - ASR, streaming | ||
| - English | ||
|
|
||
|
|
||
| FastConformer English Models (Non-Streaming) | ||
| ---------------------------------------------- | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
|
|
||
| * - Model | ||
| - Decoder | ||
| - Capabilities | ||
| - Size | ||
| * - `stt_en_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_large_pc>`__ | ||
| - Hybrid | ||
| - ASR, PnC | ||
| - Large | ||
| * - `stt_en_fastconformer_ctc_large <https://huggingface.co/nvidia/stt_en_fastconformer_ctc_large>`__ | ||
| - CTC | ||
| - ASR | ||
| - Large | ||
| * - `stt_en_fastconformer_ctc_xlarge <https://huggingface.co/nvidia/stt_en_fastconformer_ctc_xlarge>`__ | ||
| - CTC | ||
| - ASR | ||
| - XLarge | ||
| * - `stt_en_fastconformer_ctc_xxlarge <https://huggingface.co/nvidia/stt_en_fastconformer_ctc_xxlarge>`__ | ||
| - CTC | ||
| - ASR | ||
| - XXLarge | ||
| * - `stt_en_fastconformer_transducer_large <https://huggingface.co/nvidia/stt_en_fastconformer_transducer_large>`__ | ||
| - RNN-T | ||
| - ASR | ||
| - Large | ||
| * - `stt_en_fastconformer_transducer_xlarge <https://huggingface.co/nvidia/stt_en_fastconformer_transducer_xlarge>`__ | ||
| - RNN-T | ||
| - ASR | ||
| - XLarge | ||
| * - `stt_en_fastconformer_transducer_xxlarge <https://huggingface.co/nvidia/stt_en_fastconformer_transducer_xxlarge>`__ | ||
| - RNN-T | ||
| - ASR | ||
| - XXLarge | ||
| * - `stt_en_fastconformer_tdt_large <https://huggingface.co/nvidia/stt_en_fastconformer_tdt_large>`__ | ||
| - TDT | ||
| - ASR | ||
| - Large | ||
|
|
||
|
|
||
| FastConformer Multilingual Models | ||
| ---------------------------------- | ||
|
|
||
| Single-language FastConformer Hybrid models. Models with ``_pc`` suffix support punctuation and capitalization. | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
|
|
||
| * - Language | ||
| - Model | ||
| - PnC | ||
| * - Multilingual EU | ||
| - `stt_multilingual_fastconformer_hybrid_large_pc_blend_eu <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_multilingual_fastconformer_hybrid_large_pc_blend_eu>`__ | ||
| - Yes | ||
| * - German | ||
| - `stt_de_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_de_fastconformer_hybrid_large_pc>`__ | ||
| - Yes | ||
| * - Spanish | ||
| - `stt_es_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_es_fastconformer_hybrid_large_pc>`__ | ||
| - Yes | ||
| * - Spanish (no-caps) | ||
| - `stt_es_fastconformer_hybrid_large_pc_nc <https://huggingface.co/nvidia/stt_es_fastconformer_hybrid_large_pc_nc>`__ | ||
| - Punctuation only | ||
| * - French | ||
| - `stt_fr_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_fr_fastconformer_hybrid_large_pc>`__ | ||
| - Yes | ||
| * - Italian | ||
| - `stt_it_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_it_fastconformer_hybrid_large_pc>`__ | ||
| - Yes | ||
| * - Russian | ||
| - `stt_ru_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_ru_fastconformer_hybrid_large_pc>`__ | ||
| - Yes | ||
| * - Ukrainian | ||
| - `stt_ua_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_ua_fastconformer_hybrid_large_pc>`__ | ||
| - Yes | ||
| * - Polish | ||
| - `stt_pl_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_pl_fastconformer_hybrid_large_pc>`__ | ||
| - Yes | ||
| * - Croatian | ||
| - `stt_hr_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_hr_fastconformer_hybrid_large_pc>`__ | ||
| - Yes | ||
| * - Belarusian | ||
| - `stt_be_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_be_fastconformer_hybrid_large_pc>`__ | ||
| - Yes | ||
| * - Dutch | ||
| - `stt_nl_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_nl_fastconformer_hybrid_large_pc>`__ | ||
| - Yes | ||
| * - Portuguese | ||
| - `stt_pt_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_pt_fastconformer_hybrid_large_pc>`__ | ||
| - Yes | ||
| * - Farsi | ||
| - `stt_fa_fastconformer_hybrid_large <https://huggingface.co/nvidia/stt_fa_fastconformer_hybrid_large>`__ | ||
| - No | ||
| * - Georgian | ||
| - `stt_ka_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_ka_fastconformer_hybrid_large_pc>`__ | ||
| - Yes | ||
| * - Armenian | ||
| - `stt_hy_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_hy_fastconformer_hybrid_large_pc>`__ | ||
| - Yes | ||
| * - Arabic | ||
| - `stt_ar_fastconformer_hybrid_large_pc_v1.0 <https://huggingface.co/nvidia/stt_ar_fastconformer_hybrid_large_pc_v1.0>`__ | ||
| - Yes | ||
| * - Arabic (diacritized) | ||
| - `stt_ar_fastconformer_hybrid_large_pcd_v1.0 <https://huggingface.co/nvidia/stt_ar_fastconformer_hybrid_large_pcd_v1.0>`__ | ||
| - Yes | ||
| * - Uzbek | ||
| - `stt_uz_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_uz_fastconformer_hybrid_large_pc>`__ | ||
| - Yes | ||
| * - Kazakh + Russian | ||
| - `stt_kk_ru_fastconformer_hybrid_large <https://huggingface.co/nvidia/stt_kk_ru_fastconformer_hybrid_large>`__ | ||
| - No | ||
|
|
||
|
|
||
| Loading Models | ||
| -------------- | ||
|
|
||
| All models can be loaded via the ``from_pretrained()`` API: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revise: |
||
|
|
||
| .. code-block:: python | ||
|
|
||
| import nemo.collections.asr as nemo_asr | ||
|
|
||
| # From HuggingFace (prefix with nvidia/) | ||
| model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2") | ||
|
|
||
| # From NGC (no prefix) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove NGC from here, we're not posting checkpoints there anymore AFAIK |
||
| model = nemo_asr.models.ASRModel.from_pretrained("stt_en_fastconformer_transducer_large") | ||
|
|
||
| To list all available models programmatically: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| nemo_asr.models.ASRModel.list_available_models() | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is that function always up to date? It should be enough to just list them in the documentation. |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add
SALM - Speech augmented Language Modelin the glossary for canary qwen