Use correct values to update encoder KV Cache for streaming models by MahmoudAshraf97 · Pull Request #15323 · NVIDIA-NeMo/NeMo

MahmoudAshraf97 · 2026-01-27T14:46:42Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

The cache update includes invalid parts of the input that are discarded in streaming_post_process as invalid, but are included in the cache updates, this PR uses the correct values to update the cache, this PR resulted in much lower WER on our internal dataset

Collection: ASR

PR Type:

New Feature
Bugfix
Documentation

Who can review?

@nithinraok

Signed-off-by: MahmoudAshraf97 <hassouna97.ma@gmail.com>

MahmoudAshraf97 · 2026-01-27T23:40:57Z

linting failure is unrelated to this PR

Signed-off-by: KunalDhawan <KunalDhawan@users.noreply.github.com>

KunalDhawan · 2026-02-07T00:22:30Z

Thanks for opening this PR, @MahmoudAshraf97, great catch! The changes look good to me. I’ve scheduled CI tests to make sure the updates don’t break any existing pipelines, and I’m also running internal WER evaluations to assess the impact on accuracy and performance.

It would be great if you could also share any benchmarks you have on WER and latency before vs. after these changes to help validate the improvements.

MahmoudAshraf97 · 2026-02-07T04:47:52Z

Hi @KunalDhawan , the impact of this PR is felt the most with ctc models or rnnt models with long files, where the wrong cache effect starts to accumulate, the symptoms are increase in deletion errors and missing chunks from the transcript

I also suggest adding tests to verify that the encoder output with cache is identical to the encoder output with the actual audio passed as a context

MahmoudAshraf97 · 2026-02-20T19:27:39Z

@nithinraok @KunalDhawan this is a gentle reminder as this PR is blocking follow up work related to caching

KunalDhawan · 2026-03-03T01:48:20Z

Thanks for the PR @MahmoudAshraf97! I ran an evaluation using the nemotron-speech-streaming-en-0.6b model on hour-long audio files from the Earnings22 dataset using speech_to_text_cache_aware_streaming_infer.py with att_context_size=[70,13] and both batch_size=1 and batch_size=4 to verify behavior under batched inference. In all cases, the predictions from main and this branch were byte-for-byte identical.

My understanding is that this is expected for this model, since it uses att_context_style: chunked_limited, which sets cache_drop_size = 0. In that case, the old and new cache update logic should be functionally equivalent:

MHA: query.shape[1] - 0 == valid_query_length - 0, since the entire chunk is valid in chunked_limited (i.e., no frames are discarded by streaming_post_process, so valid_query_length == query.shape[1]).
CausalConv1D: similarly, when cache_drop_size = 0 and _right_padding = 0 (causal convolution), the cache slicing behavior remains unchanged.

The fix would likely have an observable impact with att_context_style="regular", where cache_drop_size = lookahead_steps > 0 and query.shape[1] > valid_out_len due to the extra lookahead frames that are discarded in streaming_post_process. However, cache-aware streaming models like Nemotron Speech are trained with att_context_style: chunked_limited, which is the most relevant deployment configuration for these models.

Could you share which model and att_context_style configuration you used for the internal evaluation that showed improved WER, along with the magnitude of the improvement and any observed latency impact? Please let me know if I’m overlooking anything in my setup or analysis.

MahmoudAshraf97 · 2026-03-05T22:56:28Z

Thank you for your reply, I verified the scripts we use for inference against the simulator script you shared, and I found that we are passing 160ms as pre-encoder context instead of 90ms, that alone caused the WER to rise from 10% on an internal dataset to 90%.
I tried running the nemotron model under the same conditions and managed to reproduce the error, albeit to a much lesser extent because it has a larger left context (70) while our model uses 20 so the effect is more noticeable

for future reference this is how to stream an actual audio file without using the CacheAwareStreamingAudioBuffer as it assumes that you have the whole audio file, and it's not usable if you receive audio in realtime

import nemo.collections.asr as nemo_asr
import torch

model = nemo_asr.models.ASRModel.from_pretrained(
    "nvidia/nemotron-speech-streaming-en-0.6b"
)
model = model.eval()

with open("test_clip.wav", "rb") as f:
    f.seek(40)
    audio_signal = (
        torch.frombuffer(f.read(), dtype=torch.int16).float().unsqueeze(0) / 32768
    )

batch_size = 1

attention_cache, conv_cache, attention_cache_len = (
    model.encoder.get_initial_cache_state(batch_size=batch_size)
)
outputs = []
chunk_size = int(
    (model.encoder.att_context_size[1] + 1)
    * model.cfg.preprocessor.window_stride
    * model.cfg.encoder.subsampling_factor
    * model.cfg.preprocessor.sample_rate
)
chunk_overlap = int(
    1
    * model.cfg.preprocessor.window_stride
    * (model.cfg.encoder.subsampling_factor + 1)
    * model.cfg.preprocessor.sample_rate
)
padded_audio = torch.nn.functional.pad(
    audio_signal, (chunk_overlap, chunk_size - audio_signal.shape[1] % chunk_size)
)
with torch.inference_mode():
    for i in range(chunk_overlap, padded_audio.shape[1], chunk_size):
        chunk = padded_audio[:, i - chunk_overlap : i + chunk_size].to(model.device)
        features, features_len = model.preprocessor(
            input_signal=chunk, length=torch.tensor([chunk.shape[1]]).to(model.device)
        )
        features = features[:, :, :-1]

        (
            encoded,
            encoded_len,
            attention_cache,
            conv_cache,
            attention_cache_len,
        ) = model.encoder.cache_aware_stream_step(
            processed_signal=features,
            processed_signal_length=features_len,
            cache_last_channel=attention_cache,
            cache_last_time=conv_cache,
            cache_last_channel_len=attention_cache_len,
            keep_all_outputs=False,
            drop_extra_pre_encoded=2,
            bypass_pre_encode=False,
        )

        outputs.append(encoded)
    projected_encoder_output = torch.cat(outputs, dim=-1)

model.decoding.rnnt_decoder_predictions_tensor(
    encoder_output=projected_encoder_output,
    encoded_lengths=torch.tensor([projected_encoder_output.shape[2]]).to(model.device),
    return_hypotheses=True,
    partial_hypotheses=None,
)[0].text

github-actions Bot added the ASR label Jan 27, 2026

use correct update indices

07f9734

Signed-off-by: MahmoudAshraf97 <hassouna97.ma@gmail.com>

MahmoudAshraf97 force-pushed the fix_cache branch from f777ce4 to 07f9734 Compare January 27, 2026 14:47

nithinraok requested a review from KunalDhawan January 27, 2026 15:00

nithinraok added the Run CICD label Jan 27, 2026

github-actions Bot added the community-request label Jan 27, 2026

chtruong814 added the needs-follow-up Issue needs follow-up label Jan 30, 2026

Merge branch 'main' into fix_cache

6b0ee60

chtruong814 added Run CICD and removed Run CICD labels Feb 7, 2026

Apply isort and black reformatting

7850301

Signed-off-by: KunalDhawan <KunalDhawan@users.noreply.github.com>

chtruong814 added Run CICD and removed Run CICD labels Feb 7, 2026

chtruong814 removed the needs-follow-up Issue needs follow-up label Feb 7, 2026

chtruong814 added the needs-follow-up Issue needs follow-up label Feb 9, 2026

chtruong814 added needs-follow-up Issue needs follow-up and removed needs-follow-up Issue needs follow-up labels Feb 20, 2026

chtruong814 removed the needs-follow-up Issue needs follow-up label Mar 3, 2026

chtruong814 added the needs-follow-up Issue needs follow-up label Mar 7, 2026

MahmoudAshraf97 closed this Apr 6, 2026

chtruong814 removed the needs-follow-up Issue needs follow-up label Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use correct values to update encoder KV Cache for streaming models#15323

Use correct values to update encoder KV Cache for streaming models#15323
MahmoudAshraf97 wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
MahmoudAshraf97:fix_cache

MahmoudAshraf97 commented Jan 27, 2026

Uh oh!

MahmoudAshraf97 commented Jan 27, 2026

Uh oh!

KunalDhawan commented Feb 7, 2026

Uh oh!

MahmoudAshraf97 commented Feb 7, 2026

Uh oh!

MahmoudAshraf97 commented Feb 20, 2026

Uh oh!

KunalDhawan commented Mar 3, 2026

Uh oh!

MahmoudAshraf97 commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

MahmoudAshraf97 commented Jan 27, 2026

What does this PR do ?

Who can review?

Uh oh!

MahmoudAshraf97 commented Jan 27, 2026

Uh oh!

KunalDhawan commented Feb 7, 2026

Uh oh!

MahmoudAshraf97 commented Feb 7, 2026

Uh oh!

MahmoudAshraf97 commented Feb 20, 2026

Uh oh!

KunalDhawan commented Mar 3, 2026

Uh oh!

MahmoudAshraf97 commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants