Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
9514223
Secure Multi-Architecture Containers with Trivy on Azure Cobalt 100
odidev Jan 28, 2026
cfcdc9d
new learning path: Unleashing leading On-Device AI performance with l…
Feb 5, 2026
9c461b5
First draft of PyTorch fine-tuning on DGX Spark
mhall119 Feb 9, 2026
42b7ba9
Complete tech review of DGX Spark voice assistant.
jasonrandrews Feb 9, 2026
3a63779
Merge pull request #2884 from jasonrandrews/review2
jasonrandrews Feb 9, 2026
6624721
Remove tune-pytorch-cpu-perf-with-threads Learning Path - no longer r…
pareenaverma Feb 9, 2026
62c88b1
added alt text to svgs
Feb 9, 2026
7311f18
Merge branch 'ArmDeveloperEcosystem:main' into main
zachlasiuk Feb 9, 2026
160f6d8
x
Feb 9, 2026
a8cd073
x
Feb 9, 2026
0cf1426
Merge pull request #2885 from pareenaverma/content_review
pareenaverma Feb 9, 2026
bea2d1a
Minor updates to NX related LPs
annietllnd Feb 9, 2026
ff6e34e
Merge pull request #2887 from annietllnd/fix
pareenaverma Feb 10, 2026
70fd2e2
Merge pull request #2886 from zachlasiuk/main
pareenaverma Feb 10, 2026
2ce0cd2
Add note about changing tokenizer_class to support transformers<5.0.0
mhall119 Feb 10, 2026
58d1050
Merge pull request #2850 from odidev/trivy_LP
jasonrandrews Feb 11, 2026
69970be
Trivy to draft mode for tech review
jasonrandrews Feb 11, 2026
545dcfc
Merge pull request #2890 from jasonrandrews/review2
jasonrandrews Feb 11, 2026
32360de
Merge pull request #2888 from mhall119/mhall/spark-finetuning
jasonrandrews Feb 11, 2026
d93d2b7
Finetuning with PyTorch to draft mode for tech review.
jasonrandrews Feb 11, 2026
6c4bf30
Merge pull request #2891 from jasonrandrews/review2
jasonrandrews Feb 11, 2026
a177dd3
Update _index.md
pareenaverma Feb 11, 2026
de80412
Merge pull request #2876 from zenonxiu81/performance_llama_cpp_sme2
pareenaverma Feb 11, 2026
a22ed15
Refine documentation for voice chatbot setup and integration, improvi…
madeline-underwood Feb 12, 2026
88c8511
Update the learning path for the DGX Spark Voice Chatbot to include t…
madeline-underwood Feb 12, 2026
25ee43f
Enhance documentation for offline voice chatbot setup and integration…
madeline-underwood Feb 12, 2026
913c500
Refine documentation for offline voice assistant, correcting phrasing…
madeline-underwood Feb 12, 2026
0d9fe2b
Enhance documentation on context-aware dialogue by extending short-te…
madeline-underwood Feb 12, 2026
bd36b2d
Updates
madeline-underwood Feb 12, 2026
029a395
tech review of fine tuning on DGX Spark Learning Path
jasonrandrews Feb 13, 2026
f213786
Merge pull request #2892 from madeline-underwood/offline_voice
jasonrandrews Feb 13, 2026
ba81967
Merge pull request #2893 from jasonrandrews/review2
jasonrandrews Feb 13, 2026
a98fa1f
spelling updates
jasonrandrews Feb 13, 2026
0c671ce
spelling updates
jasonrandrews Feb 13, 2026
0c14f81
Merge pull request #2894 from jasonrandrews/review2
jasonrandrews Feb 13, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion .wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5644,4 +5644,16 @@ Numbat
SKUs
asct
geminicli
passwordless
passwordless
AWQ
Coqui
GPTQ
PortAudio
PyAudio
Riva
UX
actionability
customizations
pyaudio
sounddevice
webrtcvad
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
---
title: Learn about offline voice assistants
title: Build an offline voice assistant with whisper and vLLM
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Why build an offline voice assistant?
## Benefits of running a voice assistant offline

Voice-based AI assistants are becoming essential in customer support, productivity tools, and embedded interfaces. For example, a retail kiosk might need to answer product-related questions verbally without relying on internet access. However, many of these systems depend heavily on cloud services for speech recognition and language understanding, raising concerns around latency, cost, and data privacy.

Expand All @@ -16,16 +16,16 @@ You avoid unpredictable latency caused by network fluctuations, prevent sensitiv

By combining local speech-to-text (STT) with a locally hosted large language model (LLM), you gain complete control over the pipeline and eliminate API dependencies. You can experiment, customize, and scale without relying on external services.

## What are some common development challenges?
## Challenges of building a local voice assistant

While the benefits are clear, building a local voice assistant involves several engineering challenges.

Real-time audio segmentation requires reliably identifying when users start and stop speaking, accounting for natural pauses and background noise. Timing mismatches between STT and LLM components can cause delayed responses or repeated input, reducing conversational quality. You also need to balance CPU/GPU workloads to keep the pipeline responsive without overloading resources or blocking audio capture.

## Why use Arm and DGX Spark?
## Why run offline voice AI on Arm-based DGX Spark?

Arm-powered platforms like [DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) allow efficient parallelism: use CPU cores for audio preprocessing and whisper inference, while offloading LLM reasoning to powerful GPUs. This architecture balances throughput and energy efficiencyideal for private, on-premises AI workloads. To understand the CPU and GPU architecture of DGX Spark, refer to [Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/).
Arm-powered platforms like [DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) allow efficient parallelism: use CPU cores for audio preprocessing and whisper inference, while offloading LLM reasoning to powerful GPUs. This architecture balances throughput and energy efficiency-ideal for private, on-premises AI workloads. To understand the CPU and GPU architecture of DGX Spark, refer to [Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/).

DGX Spark also supports standard USB interfaces, making it easy to connect consumer-grade microphones for development or deployment. This makes it viable for edge inference and desktop-style prototyping.

In this Learning Path, youll build a complete, offline voice chatbot prototype using PyAudio, faster-whisper, and vLLM on an Arm-based systemresulting in a fully functional assistant that runs entirely on local hardware with no internet dependency.
In this Learning Path, you'll build a complete, offline voice chatbot prototype using PyAudio, faster-whisper, and vLLM on an Arm-based system-resulting in a fully functional assistant that runs entirely on local hardware with no internet dependency.
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,11 @@ weight: 3
layout: learningpathall
---

[Faster‑whisper](https://github.com/SYSTRAN/faster-whisper) is a high‑performance reimplementation of OpenAI Whisper, designed to significantly reduce transcription latency and memory usage. It is well suited for local and real‑time speech‑to‑text (STT) pipelines, especially when running on CPU‑only systems or hybrid CPU/GPU environments.
## Set up faster-whisper for offline speech recognition

You'll use faster‑whisper as the STT engine to convert raw microphone input into structured text. At this stage, the goal is to install faster‑whisper correctly and verify that it can transcribe audio reliably. Detailed tuning and integration are covered in later sections.
[faster-whisper](https://github.com/SYSTRAN/faster-whisper) is a high-performance re-implementation of OpenAI Whisper, designed to significantly reduce transcription latency and memory usage. It's well suited for local and real-time speech-to-text (STT) pipelines, especially when running on CPU-only systems or hybrid CPU/GPU environments.

You'll use faster-whisper as the STT engine to convert raw microphone input into structured text. At this stage, the goal is to install faster-whisper correctly and verify that it can transcribe audio reliably. Detailed tuning and integration are covered in later sections.

### Install build dependencies

Expand All @@ -22,11 +24,11 @@ sudo apt install python3.12 python3.12-venv python3.12-dev -y
sudo apt install gcc portaudio19-dev ffmpeg -y
```

## Create and activate Python environment
## Create and activate a Python environment

In particular, [pyaudio](https://pypi.org/project/PyAudio/) (used for real-time microphone capture) depends on the PortAudio library and the Python C API. These must match the version of Python you're using.

Now that the system libraries are in place and audio input is verified, it's time to set up an isolated Python environment for your voice assistant project. This will prevent dependency conflicts and make your installation reproducible.
Set up an isolated Python environment for your voice assistant project to prevent dependency conflicts and make your installation reproducible.

```bash
python3.12 -m venv va_env
Expand All @@ -53,7 +55,7 @@ pip install requests webrtcvad sounddevice==0.5.3
```

{{% notice Note %}}
While sounddevice==0.5.4 is available, it introduces callback-related errors during audio stream cleanup that may confuse beginners.
While sounddevice==0.5.4 is available, it introduces callback-related errors during audio stream cleanup that can confuse beginners.
Use sounddevice==0.5.3, which is stable and avoids these warnings.
{{% /notice %}}

Expand Down Expand Up @@ -162,7 +164,7 @@ Recording for 10 seconds...

{{% notice Note %}}
To stop the script, press Ctrl+C during any transcription loop. The current 10-second recording completes and transcribes before the program exits cleanly.
Avoid using Ctrl+Z, which suspends the process instead of terminating it.
Don't use Ctrl+Z, which suspends the process instead of terminating it.
{{% /notice %}}


Expand All @@ -189,7 +191,7 @@ pip install sounddevice==0.5.3

You can record audio without errors, but nothing is played back.

Verify that your USB microphone or headset is selected as the default input/output device. Also ensure the system volume is not muted.
Ensure that your USB microphone or headset is selected as the default input/output device. Also check that the system volume isn't muted.

**Fix:** List all available audio devices:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ weight: 4
layout: learningpathall
---

## Build a CPU-based speech-to-text engine

In this section, you'll build a real-time speech-to-text (STT) pipeline using only the CPU. Starting from a basic 10-second recorder, you'll incrementally add noise filtering, sentence segmentation, and parallel audio processing to achieve a transcription engine for Arm-based systems like DGX Spark.

You'll start from a minimal loop and iterate toward a multithreaded, VAD-enhanced STT engine.
Expand Down Expand Up @@ -104,7 +106,7 @@ When you speak to the device, the output is similar to:

{{% notice Note %}}
faster-whisper supports many models like tiny, base, small, medium and large-v1/2/3.
Check the [GitHub repository](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages) for more model details.
See the [GitHub repository](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages) for more model details.
{{% /notice %}}


Expand Down Expand Up @@ -238,15 +240,15 @@ When you say a long sentence with multiple clauses, the output is similar to:
Segment done.
```

The result is a smoother and more accurate voice UXparticularly important when integrating with downstream LLMs in later sections.
The result is a smoother and more accurate voice UX - particularly important when integrating with downstream LLMs in later sections.

### Demo: Real-time speech transcription on Arm CPU with faster-whisper

This demo shows the real-time transcription pipeline in action, running on an Arm-based DGX Spark system. Using a USB microphone and the faster-whisper model (`medium.en`), the system records voice input, processes it on the CPU, and returns accurate transcriptions with timestampsall without relying on cloud services.
This demo shows the real-time transcription pipeline in action, running on an Arm-based DGX Spark system. Using a USB microphone and the faster-whisper model (`medium.en`), the system records voice input, processes it on the CPU, and returns accurate transcriptions with timestamps - all without relying on cloud services.

Notice the clean terminal output and low latency, demonstrating how the pipeline is optimized for local, real-time voice recognition on resource-efficient hardware.

![Real-time speech transcription demo with volume visualization#center](fasterwhipser_demo1.gif "Figure 1: Real-time speech transcription with audio volume bar")
![Real-time speech transcription demo with volume visualization alt-txt#center](fasterwhipser_demo1.gif "Real-time speech transcription with audio volume bar")

The device runs audio capture and transcription in parallel. Use `threading.Thread` to collect audio without blocking, store audio frames in a `queue.Queue`, and in the main thread, poll for new data and run STT.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ weight: 5
layout: learningpathall
---

After applying the previous steps—model upgrade, VAD, smart turn detection, and multi-threaded audio collection—you now have a high-quality, CPU-based local speech-to-text system.
## Optimize speech segmentation for your environment

After applying the previous steps-model upgrade, VAD, smart turn detection, and multi-threaded audio collection - you now have a high-quality, CPU-based local speech-to-text system.

At this stage, the core pipeline is complete. What remains is fine-tuning: adapting the system to your environment, microphone setup, and speaking style. This flexibility is one of the key advantages of a fully local STT pipeline.

Expand Down Expand Up @@ -42,7 +44,7 @@ Adjust this setting based on background noise and microphone quality.

### Tuning `MIN_SPEECH_SEC` and `SILENCE_LIMIT_SEC`

- `MIN_SPEECH_SEC`: This parameter defines the minimum duration of detected speech required before a segment is considered valid. Use this to filter out very short utterances such as false starts or background chatter.
- `MIN_SPEECH_SEC`: This parameter defines the minimum duration of detected speech needed before a segment is considered valid. Use this to filter out very short utterances such as false starts or background chatter.
- Lower values: More responsive, but may capture incomplete phrases or noise
- Higher values: More stable sentences, but slower response

Expand All @@ -58,7 +60,7 @@ Based on practical experiments, the following presets provide a good starting po
|----------------------|----------------------|-------------------------|-------------------|
| Short command phrases | 0.8 | 0.6 | Optimized for quick voice commands such as "yes", "next", or "stop". Prioritizes responsiveness over sentence completeness. |
| Natural conversational speech | 1.0 | 1.0 | Balanced settings for everyday dialogue with natural pauses between phrases. |
| Long-form explanations (for example, tutorials) | 2.0 | 2.0 | Designed for longer sentences and structured explanations, reducing the risk of premature segmentation. |
| Long-form explanations such as tutorials | 2.0 | 2.0 | Designed for longer sentences and structured explanations, reducing the risk of premature segmentation. |

## Apply these settings

Expand Down
Loading
Loading