Refactored all ASR collections documentation by Ssofja · Pull Request #15542 · NVIDIA-NeMo/NeMo

Ssofja · 2026-03-23T23:34:32Z

What does this PR do

This PR reperesents the ASR collections' full refactoring
Collection: [docs]

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

Signed-off-by: Ssofja <sofiakostandian@gmail.com>

pzelasko · 2026-03-24T14:31:35Z

-
-10) Cleanup step. Compute full batch WER and log. Concatenate loss list and pass to PTL to compute the equivalent of the original (full batch) Joint step. Delete ancillary objects necessary for sub-batching.
-
-Transducer Decoding


Note to self and other reviewers - decoding docs are now placed in Inference and ASR Language Modeling and Customization

pzelasko · 2026-03-24T14:33:09Z

-
-Refer to the :ref:`Audio Augmentors <asr-api-audio-augmentors>` API section for more details.

 Tokenizer Configurations


We need to add one more code block: an example of AggretatedTokenizer

pzelasko · 2026-03-24T14:34:07Z

-
-.. _asr-configs-augmentation-configurations:
-
-Augmentation Configurations


I feel we should keep the SpecAugment part of this section.

pzelasko · 2026-03-24T14:34:27Z


-.. _asr-configs-preprocessor-configuration:
-
-Preprocessor Configuration


I think this should be kept

yeah, users are normally confused by this portion so would need more documentation - if anything.

pzelasko · 2026-03-24T14:35:08Z

+    use_cer: false
+    log_prediction: true

-BLEU Score


I would revert the compaction of this section - I think it's pretty recent and describes various config tweaks introduced by @bonham79

yeah this is deleting a lot of things that are hidden in the code and some improved user functionality. without this you're basically just forcing dependence on torchmetric documentation - and that ain't pretty.

nithinraok · 2026-03-24T15:42:00Z

/claude review

claude · 2026-03-24T15:44:29Z

+* `CTC Fine-tuning README <https://github.com/NVIDIA/NeMo/tree/main/examples/asr/conf/asr_finetune>`_
+* `Transducer Fine-tuning README <https://github.com/NVIDIA/NeMo/tree/main/examples/asr/conf/asr_finetune>`_


Both links point to the exact same URL (examples/asr/conf/asr_finetune). The Transducer link should presumably point to a different location (e.g., examples/asr/asr_transducer or examples/asr/conf/asr_finetune with an anchor for transducer-specific instructions). As-is, labeling two identical URLs as "CTC" and "Transducer" is misleading.

claude · 2026-03-24T15:44:37Z

Overall this is a clean docs refactor. One issue found:

fine_tuning.rst: The CTC and Transducer fine-tuning README links both point to the same URL — one of them likely needs a different target.

Minor note: docs/source/asr/all_chkpt.rst appears to be orphaned after this PR (no remaining references point to it). Consider deleting it or adding a redirect if it was intentionally replaced by asr_checkpoints.rst.

tbartley94 · 2026-03-24T15:56:01Z

+.. list-table::
+   :header-rows: 1
+
+   * - Model


iirc some of these didn't really prioritize PnC no?

tbartley94 · 2026-03-24T15:57:39Z

+   * - `nemotron-speech-streaming-en-0.6b <https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b>`__
+     - Hybrid
+     - ASR, streaming
+     - en


It may be more economical to just list the architecture and configure a list of supported language models, or maybe a matrix?

tbartley94 · 2026-03-24T15:58:35Z

+   * - `stt_ka_fastconformer_hybrid_transducer_ctc_large_streaming_80ms_pc <https://huggingface.co/nvidia/stt_ka_fastconformer_hybrid_transducer_ctc_large_streaming_80ms_pc>`__
+     - Hybrid
+     - ASR, PnC, streaming
+     - ka


Yeah on Piotr's above point, few know the georgian language code off hand.

tbartley94 · 2026-03-24T16:00:30Z

+.. list-table::
+   :header-rows: 1
+
+   * - Model


I'd move all fastconformers underneath parakeet. This'll just lead to confusion.

I think it's OK, the concept here is that fastconformer are the older models and parakeet are the newer models.

ehhh, i think our branding efforts are causing confusion, especially now Nemotron Speech is a thing. In the technical docs there should be clear understanding that these are the same architectures. The naming aspect can be left up to marketing but for devs it should be clear that fastcomformer and parakeet are largely equivalent.

tbartley94 · 2026-03-24T16:03:35Z

+    use_cer: false
+    log_prediction: true

-BLEU Score


yeah this is deleting a lot of things that are hidden in the code and some improved user functionality. without this you're basically just forcing dependence on torchmetric documentation - and that ain't pretty.

tbartley94 · 2026-03-24T16:11:18Z

+2. **Use Lhotse dataloading** for efficient training with dynamic batching. See :doc:`Lhotse Dataloading </dataloaders>`.
+3. **Monitor validation WER** closely — fine-tuning can overfit quickly on small datasets.
+4. **Use spec augmentation** during fine-tuning to improve robustness.
+5. **For multilingual fine-tuning**, consider using ``AggregateTokenizer`` and the Hybrid model with prompt conditioning.


provide link for both

tbartley94 · 2026-03-24T16:11:24Z

+1. **Start with a low learning rate** — fine-tuning with too high a learning rate can destroy pretrained features.
+2. **Use Lhotse dataloading** for efficient training with dynamic batching. See :doc:`Lhotse Dataloading </dataloaders>`.
+3. **Monitor validation WER** closely — fine-tuning can overfit quickly on small datasets.
+4. **Use spec augmentation** during fine-tuning to improve robustness.


link to doc page

tbartley94 · 2026-03-24T16:13:13Z

+
+.. code-block:: python
+
+    config = model.get_transcribe_config()


give example transcribe config. this is a more obfuscated aspect of transcription in the codebase

tbartley94 · 2026-03-24T16:16:32Z

@@ -1,17 +1,9 @@
 Models


move parakeet before canary - more successful so people will be hunting for it

tbartley94 · 2026-03-24T16:17:12Z

-
 .. _Conformer-HAT_model:

-Conformer-HAT


can we keep these on a legacy model page?

Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com> Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>

Signed-off-by: Ssofja <sofiakostandian@gmail.com>

Merge branch 'asr-collections-ref' of github.com:NVIDIA/NeMo into asr-collections-ref Signed-off-by: Ssofja <sofiakostandian@gmail.com>

pzelasko · 2026-04-16T18:01:26Z

+     - ASR, AST, PnC, timestamps
+     - English + 24 European languages
+   * - `canary-qwen-2.5b <https://huggingface.co/nvidia/canary-qwen-2.5b>`__
+     - AED


Suggested change

- AED

- SALM

pzelasko · 2026-04-16T18:01:52Z

+   * - **PnC**
+     - Punctuation and Capitalization in the output
+   * - **Streaming**
+     - Real-time / cache-aware inference capability


Add SALM - Speech augmented Language Model in the glossary for canary qwen

pzelasko · 2026-04-16T18:03:05Z

+Parakeet, Nemotron Speech, and the ``stt_*_fastconformer_*`` models below all share the same underlying FastConformer encoder;
+the different names reflect release branding, not architectural differences.
+
+.. list-table::


Why does this table define language in Size, and the next table of streaming models defines language in Language? Add Language column here.

pzelasko · 2026-04-16T18:05:51Z

+   * - `parakeet-rnnt-110m-da-dk <https://huggingface.co/nvidia/parakeet-rnnt-110m-da-dk>`__
+     - RNN-T
+     - ASR
+     - 110M (Danish)


This comment should not have been resolved, it wasn't addressed. Similar cases above. @Ssofja

pzelasko · 2026-04-16T18:07:14Z

+Loading Models
+--------------
+
+All models can be loaded via the ``from_pretrained()`` API:


Revise:

All models (except SALM) ... # + make SALM linked to SpeechLM2 docs

pzelasko · 2026-04-16T18:30:50Z

@@ -1,102 +1,92 @@
+.. _asr-configs-dataset-configuration:
+
 NeMo ASR Configuration Files


Reviewing this file from scratch again I now see that this PR discards the entire documentation about setting model hyperparameters (how to set a given encoder type, layer dimension, decoder type, loss type, loss hparams, etc.) - we need those back, if anything the documentation was maybe even too obscure in the first place. It's OK to discard OLD things like LSTM encoder but for FastConformer we need a comprehensive doc with available options.

pzelasko · 2026-04-16T18:32:09Z

+
+.. code-block:: bash
+
+    python examples/asr/speech_to_text_finetune.py \


this command actually wouldn't work because it doesn't specify init_from_nemo/pretrained_model. Let's either show a proper example using config, or proper example using CLI options, but make sure that if somebody tries to run it this way, it will work OK.

pzelasko · 2026-04-16T18:34:00Z

+      - joint
+
+
+Enforcing a Single Language During Inference


What does this have to do in fine tuning? Shouldn't this be in inference documentation?

pzelasko · 2026-04-16T18:34:52Z

+Fine-Tuning with HuggingFace Datasets
+---------------------------------------
+
+NeMo supports loading datasets directly from HuggingFace:


Add a note saying this is not currently supported in lhotse dataloader.

pzelasko · 2026-04-16T19:19:40Z

+For the complete configuration reference, see :doc:`Configuration Files <./configs>`.
+
+
+Execution Flow


These link to training execution flow and not finetuning execution flows, do we need these?

pzelasko · 2026-04-16T19:20:17Z

+1. **Start with a low learning rate** — fine-tuning with too high a learning rate can destroy pretrained features. Typical fine-tuning LRs are 1e-4 to 1e-5. If your pretrained config uses the Noam (warmup + decay) scheduler, override it with a constant or cosine-annealing schedule to avoid the warmup phase resetting to a high LR.
+2. **Use Lhotse dataloading** for efficient training with dynamic batching. See :doc:`Lhotse Dataloading </dataloaders>`.
+3. **Use spec augmentation** during fine-tuning to improve robustness. See :ref:`Augmentation Configurations <asr-configs-augmentation-configurations>`.
+4. **For multilingual fine-tuning**, consider using ``AggregateTokenizer`` (see :doc:`Configs <./configs>`) and the :ref:`Hybrid model with prompt conditioning <Hybrid-Transducer-CTC-Prompt_model__Config>`.


I'm not sure that this is a good advice. Where is it coming from?

pzelasko · 2026-04-16T19:20:54Z

+    # HuggingFace (prefix with nvidia/)
+    model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")
+
+    # NGC (no prefix)


Discard NGC

pzelasko · 2026-04-16T19:20:57Z

+    import nemo.collections.asr as nemo_asr
+    model = nemo_asr.models.ASRModel.restore_from("path/to/checkpoint.nemo")
+
+**From HuggingFace or NGC:**


Discard NGC

pzelasko · 2026-04-16T19:21:12Z

+
+.. code-block:: python
+
+    outputs = model.transcribe(audio=["file1.wav", "file2.wav"], batch_size=4)


Suggested change

outputs = model.transcribe(audio=["file1.wav", "file2.wav"], batch_size=4)

outputs = model.transcribe(audio=["file1.wav", "file2.wav"], batch_size=2)

pzelasko · 2026-04-16T19:23:00Z

+
+**Advanced configuration:**
+
+See :doc:`Configs <./configs>` for all available ``decoding`` options and :doc:`ASR Language Modeling and Customization <./asr_language_modeling_and_customization>` for decoding customization (confidence, CUDA graphs, language models, word boosting).


Configs doesn't explain all available decoding options - where can we find them now? Add if missing and link here.

pzelasko · 2026-04-16T19:23:57Z

+
+.. code-block:: json
+
+    {"audio_filepath": "/path/to/audio.wav", "duration": null, "source_lang": "en", "target_lang": "en", "pnc": "yes", "answer": "na"}


This is redefined, link to the page in docs explaining Canary2 manifest format

pzelasko · 2026-04-16T19:26:02Z

@@ -1,518 +1,101 @@
 Models


Rename this whole page to Featured Models

pzelasko · 2026-04-16T19:26:28Z

@@ -1,3 +1,5 @@
+:orphan:


Is this used? If not, remove.

Refactored all ASR collections module

55b65a4

Signed-off-by: Ssofja <sofiakostandian@gmail.com>

Ssofja requested a review from pzelasko March 23, 2026 23:34

github-actions bot added the ASR label Mar 23, 2026

Ssofja requested review from artbataev and nithinraok March 23, 2026 23:34

pzelasko changed the title ~~Refactored all ASR collections module~~ Refactored all ASR collections documentation Mar 23, 2026

pzelasko requested changes Mar 24, 2026

View reviewed changes

claude bot reviewed Mar 24, 2026

View reviewed changes

tbartley94 requested changes Mar 24, 2026

View reviewed changes

Merge branch 'main' into asr-collections-ref

df3e3bc

artbataev mentioned this pull request Mar 25, 2026

Fix building docs #15545

Closed

8 tasks

Ssofja and others added 17 commits March 29, 2026 18:41

Update docs/source/asr/intro.rst

15bed94

Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com> Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>

Update docs/source/asr/models.rst

15941f2

Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com> Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>

Update docs/source/asr/models.rst

63144cb

Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com> Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>

Update docs/source/asr/models.rst

49319f9

Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com> Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>

Update docs/source/asr/asr_checkpoints.rst

0db40f2

Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com> Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>

Update docs/source/asr/asr_checkpoints.rst

63fed73

Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com> Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>

Update docs/source/asr/asr_checkpoints.rst

a66642e

Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com> Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>

Update docs/source/asr/asr_checkpoints.rst

034ca21

Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com> Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>

Update docs/source/asr/asr_checkpoints.rst

9ae4b21

Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com> Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>

Update docs/source/asr/asr_checkpoints.rst

ccf8365

Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com> Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>

Update docs/source/asr/asr_checkpoints.rst

8769deb

Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com> Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>

Update docs/source/asr/asr_checkpoints.rst

edc5841

Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com> Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>

Update docs/source/asr/datasets.rst

6e0e501

Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com> Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>

Merge branch 'main' into asr-collections-ref

894bfdb

Merge remote-tracking branch 'origin/main' into asr-collections-ref

7b6822b

Made changes based on comments on github

2afaa15

Signed-off-by: Ssofja <sofiakostandian@gmail.com>

fixed issues with branch conflict and tests

4ad4a65

Merge branch 'asr-collections-ref' of github.com:NVIDIA/NeMo into asr-collections-ref Signed-off-by: Ssofja <sofiakostandian@gmail.com>

Ssofja force-pushed the asr-collections-ref branch from 17d3941 to 4ad4a65 Compare April 14, 2026 21:37

pzelasko reviewed Apr 16, 2026

View reviewed changes


		10) Cleanup step. Compute full batch WER and log. Concatenate loss list and pass to PTL to compute the equivalent of the original (full batch) Joint step. Delete ancillary objects necessary for sub-batching.

		Transducer Decoding


		Refer to the :ref:`Audio Augmentors <asr-api-audio-augmentors>` API section for more details.

		Tokenizer Configurations


		.. _asr-configs-augmentation-configurations:

		Augmentation Configurations


		.. _asr-configs-preprocessor-configuration:

		Preprocessor Configuration

		* `CTC Fine-tuning README <https://github.com/NVIDIA/NeMo/tree/main/examples/asr/conf/asr_finetune>`_
		* `Transducer Fine-tuning README <https://github.com/NVIDIA/NeMo/tree/main/examples/asr/conf/asr_finetune>`_


		.. code-block:: python

		config = model.get_transcribe_config()

		@@ -1,102 +1,92 @@
		.. _asr-configs-dataset-configuration:

		NeMo ASR Configuration Files


		.. code-block:: bash

		python examples/asr/speech_to_text_finetune.py \

		For the complete configuration reference, see :doc:`Configuration Files <./configs>`.


		Execution Flow


		.. code-block:: python

		outputs = model.transcribe(audio=["file1.wav", "file2.wav"], batch_size=4)


		.. _Conformer-HAT_model:

		Conformer-HAT


		Advanced configuration:

		See :doc:`Configs <./configs>` for all available ``decoding`` options and :doc:`ASR Language Modeling and Customization <./asr_language_modeling_and_customization>` for decoding customization (confidence, CUDA graphs, language models, word boosting).


		.. code-block:: json

		{"audio_filepath": "/path/to/audio.wav", "duration": null, "source_lang": "en", "target_lang": "en", "pnc": "yes", "answer": "na"}

Conversation

Ssofja commented Mar 23, 2026

What does this PR do

Before your PR is "Ready for review"

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nithinraok commented Mar 24, 2026

Uh oh!

claude bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Mar 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment