Skip to content

Refactored all ASR collections documentation#15542

Open
Ssofja wants to merge 19 commits intomainfrom
asr-collections-ref
Open

Refactored all ASR collections documentation#15542
Ssofja wants to merge 19 commits intomainfrom
asr-collections-ref

Conversation

@Ssofja
Copy link
Copy Markdown
Collaborator

@Ssofja Ssofja commented Mar 23, 2026

What does this PR do

This PR reperesents the ASR collections' full refactoring
Collection: [docs]

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

Signed-off-by: Ssofja <sofiakostandian@gmail.com>
@Ssofja Ssofja requested a review from pzelasko March 23, 2026 23:34
@github-actions github-actions bot added the ASR label Mar 23, 2026
@Ssofja Ssofja requested review from artbataev and nithinraok March 23, 2026 23:34
@pzelasko pzelasko changed the title Refactored all ASR collections module Refactored all ASR collections documentation Mar 23, 2026
Comment thread docs/source/asr/intro.rst Outdated
Comment thread docs/source/asr/models.rst
Comment thread docs/source/asr/models.rst Outdated
Comment thread docs/source/asr/models.rst Outdated
Comment thread docs/source/asr/asr_checkpoints.rst

10) Cleanup step. Compute full batch WER and log. Concatenate loss list and pass to PTL to compute the equivalent of the original (full batch) Joint step. Delete ancillary objects necessary for sub-batching.

Transducer Decoding
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self and other reviewers - decoding docs are now placed in Inference and ASR Language Modeling and Customization


Refer to the :ref:`Audio Augmentors <asr-api-audio-augmentors>` API section for more details.

Tokenizer Configurations
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add one more code block: an example of AggretatedTokenizer


.. _asr-configs-augmentation-configurations:

Augmentation Configurations
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should keep the SpecAugment part of this section.


.. _asr-configs-preprocessor-configuration:

Preprocessor Configuration
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be kept

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, users are normally confused by this portion so would need more documentation - if anything.

use_cer: false
log_prediction: true

BLEU Score
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would revert the compaction of this section - I think it's pretty recent and describes various config tweaks introduced by @bonham79

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is deleting a lot of things that are hidden in the code and some improved user functionality. without this you're basically just forcing dependence on torchmetric documentation - and that ain't pretty.

@nithinraok
Copy link
Copy Markdown
Member

/claude review

Comment thread docs/source/asr/fine_tuning.rst Outdated
Comment on lines +150 to +151
* `CTC Fine-tuning README <https://github.com/NVIDIA/NeMo/tree/main/examples/asr/conf/asr_finetune>`_
* `Transducer Fine-tuning README <https://github.com/NVIDIA/NeMo/tree/main/examples/asr/conf/asr_finetune>`_
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both links point to the exact same URL (examples/asr/conf/asr_finetune). The Transducer link should presumably point to a different location (e.g., examples/asr/asr_transducer or examples/asr/conf/asr_finetune with an anchor for transducer-specific instructions). As-is, labeling two identical URLs as "CTC" and "Transducer" is misleading.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 24, 2026

Overall this is a clean docs refactor. One issue found:

  • fine_tuning.rst: The CTC and Transducer fine-tuning README links both point to the same URL — one of them likely needs a different target.

Minor note: docs/source/asr/all_chkpt.rst appears to be orphaned after this PR (no remaining references point to it). Consider deleting it or adding a redirect if it was intentionally replaced by asr_checkpoints.rst.

.. list-table::
:header-rows: 1

* - Model
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc some of these didn't really prioritize PnC no?

Comment thread docs/source/asr/asr_checkpoints.rst Outdated
* - `nemotron-speech-streaming-en-0.6b <https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b>`__
- Hybrid
- ASR, streaming
- en
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be more economical to just list the architecture and configure a list of supported language models, or maybe a matrix?

Comment thread docs/source/asr/asr_checkpoints.rst Outdated
* - `stt_ka_fastconformer_hybrid_transducer_ctc_large_streaming_80ms_pc <https://huggingface.co/nvidia/stt_ka_fastconformer_hybrid_transducer_ctc_large_streaming_80ms_pc>`__
- Hybrid
- ASR, PnC, streaming
- ka
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah on Piotr's above point, few know the georgian language code off hand.

Comment thread docs/source/asr/asr_checkpoints.rst Outdated
.. list-table::
:header-rows: 1

* - Model
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move all fastconformers underneath parakeet. This'll just lead to confusion.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK, the concept here is that fastconformer are the older models and parakeet are the newer models.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ehhh, i think our branding efforts are causing confusion, especially now Nemotron Speech is a thing. In the technical docs there should be clear understanding that these are the same architectures. The naming aspect can be left up to marketing but for devs it should be clear that fastcomformer and parakeet are largely equivalent.

use_cer: false
log_prediction: true

BLEU Score
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is deleting a lot of things that are hidden in the code and some improved user functionality. without this you're basically just forcing dependence on torchmetric documentation - and that ain't pretty.

Comment thread docs/source/asr/fine_tuning.rst Outdated
2. **Use Lhotse dataloading** for efficient training with dynamic batching. See :doc:`Lhotse Dataloading </dataloaders>`.
3. **Monitor validation WER** closely — fine-tuning can overfit quickly on small datasets.
4. **Use spec augmentation** during fine-tuning to improve robustness.
5. **For multilingual fine-tuning**, consider using ``AggregateTokenizer`` and the Hybrid model with prompt conditioning.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provide link for both

Comment thread docs/source/asr/fine_tuning.rst Outdated
1. **Start with a low learning rate** — fine-tuning with too high a learning rate can destroy pretrained features.
2. **Use Lhotse dataloading** for efficient training with dynamic batching. See :doc:`Lhotse Dataloading </dataloaders>`.
3. **Monitor validation WER** closely — fine-tuning can overfit quickly on small datasets.
4. **Use spec augmentation** during fine-tuning to improve robustness.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to doc page


.. code-block:: python

config = model.get_transcribe_config()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

give example transcribe config. this is a more obfuscated aspect of transcription in the codebase

@@ -1,17 +1,9 @@
Models
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move parakeet before canary - more successful so people will be hunting for it


.. _Conformer-HAT_model:

Conformer-HAT
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep these on a legacy model page?

@artbataev artbataev mentioned this pull request Mar 25, 2026
8 tasks
Ssofja and others added 17 commits March 29, 2026 18:41
Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>
Signed-off-by: Ssofja <sofiakostandian@gmail.com>
Merge branch 'asr-collections-ref' of github.com:NVIDIA/NeMo into asr-collections-ref

Signed-off-by: Ssofja <sofiakostandian@gmail.com>
@Ssofja Ssofja force-pushed the asr-collections-ref branch from 17d3941 to 4ad4a65 Compare April 14, 2026 21:37
- ASR, AST, PnC, timestamps
- English + 24 European languages
* - `canary-qwen-2.5b <https://huggingface.co/nvidia/canary-qwen-2.5b>`__
- AED
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- AED
- SALM

* - **PnC**
- Punctuation and Capitalization in the output
* - **Streaming**
- Real-time / cache-aware inference capability
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add SALM - Speech augmented Language Model in the glossary for canary qwen

Parakeet, Nemotron Speech, and the ``stt_*_fastconformer_*`` models below all share the same underlying FastConformer encoder;
the different names reflect release branding, not architectural differences.

.. list-table::
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this table define language in Size, and the next table of streaming models defines language in Language? Add Language column here.

* - `parakeet-rnnt-110m-da-dk <https://huggingface.co/nvidia/parakeet-rnnt-110m-da-dk>`__
- RNN-T
- ASR
- 110M (Danish)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should not have been resolved, it wasn't addressed. Similar cases above. @Ssofja

Loading Models
--------------

All models can be loaded via the ``from_pretrained()`` API:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revise:

All models (except SALM) ...  # + make SALM linked to SpeechLM2 docs

@@ -1,102 +1,92 @@
.. _asr-configs-dataset-configuration:

NeMo ASR Configuration Files
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing this file from scratch again I now see that this PR discards the entire documentation about setting model hyperparameters (how to set a given encoder type, layer dimension, decoder type, loss type, loss hparams, etc.) - we need those back, if anything the documentation was maybe even too obscure in the first place. It's OK to discard OLD things like LSTM encoder but for FastConformer we need a comprehensive doc with available options.


.. code-block:: bash

python examples/asr/speech_to_text_finetune.py \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this command actually wouldn't work because it doesn't specify init_from_nemo/pretrained_model. Let's either show a proper example using config, or proper example using CLI options, but make sure that if somebody tries to run it this way, it will work OK.

- joint


Enforcing a Single Language During Inference
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this have to do in fine tuning? Shouldn't this be in inference documentation?

Fine-Tuning with HuggingFace Datasets
---------------------------------------

NeMo supports loading datasets directly from HuggingFace:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note saying this is not currently supported in lhotse dataloader.

For the complete configuration reference, see :doc:`Configuration Files <./configs>`.


Execution Flow
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These link to training execution flow and not finetuning execution flows, do we need these?

1. **Start with a low learning rate** — fine-tuning with too high a learning rate can destroy pretrained features. Typical fine-tuning LRs are 1e-4 to 1e-5. If your pretrained config uses the Noam (warmup + decay) scheduler, override it with a constant or cosine-annealing schedule to avoid the warmup phase resetting to a high LR.
2. **Use Lhotse dataloading** for efficient training with dynamic batching. See :doc:`Lhotse Dataloading </dataloaders>`.
3. **Use spec augmentation** during fine-tuning to improve robustness. See :ref:`Augmentation Configurations <asr-configs-augmentation-configurations>`.
4. **For multilingual fine-tuning**, consider using ``AggregateTokenizer`` (see :doc:`Configs <./configs>`) and the :ref:`Hybrid model with prompt conditioning <Hybrid-Transducer-CTC-Prompt_model__Config>`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this is a good advice. Where is it coming from?

# HuggingFace (prefix with nvidia/)
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")

# NGC (no prefix)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discard NGC

import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.restore_from("path/to/checkpoint.nemo")

**From HuggingFace or NGC:**
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discard NGC


.. code-block:: python

outputs = model.transcribe(audio=["file1.wav", "file2.wav"], batch_size=4)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
outputs = model.transcribe(audio=["file1.wav", "file2.wav"], batch_size=4)
outputs = model.transcribe(audio=["file1.wav", "file2.wav"], batch_size=2)


**Advanced configuration:**

See :doc:`Configs <./configs>` for all available ``decoding`` options and :doc:`ASR Language Modeling and Customization <./asr_language_modeling_and_customization>` for decoding customization (confidence, CUDA graphs, language models, word boosting).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configs doesn't explain all available decoding options - where can we find them now? Add if missing and link here.


.. code-block:: json

{"audio_filepath": "/path/to/audio.wav", "duration": null, "source_lang": "en", "target_lang": "en", "pnc": "yes", "answer": "na"}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redefined, link to the page in docs explaining Canary2 manifest format

@@ -1,518 +1,101 @@
Models
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this whole page to Featured Models

@@ -1,3 +1,5 @@
:orphan:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used? If not, remove.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants