Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
---
title: Learn about offline voice assistants
title: Build an offline voice assistant with whisper and vLLM
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Why build an offline voice assistant?
## Benefits of running a voice assistant offline

Voice-based AI assistants are becoming essential in customer support, productivity tools, and embedded interfaces. For example, a retail kiosk might need to answer product-related questions verbally without relying on internet access. However, many of these systems depend heavily on cloud services for speech recognition and language understanding, raising concerns around latency, cost, and data privacy.

Expand All @@ -16,16 +16,16 @@ You avoid unpredictable latency caused by network fluctuations, prevent sensitiv

By combining local speech-to-text (STT) with a locally hosted large language model (LLM), you gain complete control over the pipeline and eliminate API dependencies. You can experiment, customize, and scale without relying on external services.

## What are some common development challenges?
## Challenges of building a local voice assistant

While the benefits are clear, building a local voice assistant involves several engineering challenges.

Real-time audio segmentation requires reliably identifying when users start and stop speaking, accounting for natural pauses and background noise. Timing mismatches between STT and LLM components can cause delayed responses or repeated input, reducing conversational quality. You also need to balance CPU/GPU workloads to keep the pipeline responsive without overloading resources or blocking audio capture.

## Why use Arm and DGX Spark?
## Why run offline voice AI on Arm-based DGX Spark?

Arm-powered platforms like [DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) allow efficient parallelism: use CPU cores for audio preprocessing and whisper inference, while offloading LLM reasoning to powerful GPUs. This architecture balances throughput and energy efficiencyideal for private, on-premises AI workloads. To understand the CPU and GPU architecture of DGX Spark, refer to [Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/).
Arm-powered platforms like [DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) allow efficient parallelism: use CPU cores for audio preprocessing and whisper inference, while offloading LLM reasoning to powerful GPUs. This architecture balances throughput and energy efficiency-ideal for private, on-premises AI workloads. To understand the CPU and GPU architecture of DGX Spark, refer to [Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/).

DGX Spark also supports standard USB interfaces, making it easy to connect consumer-grade microphones for development or deployment. This makes it viable for edge inference and desktop-style prototyping.

In this Learning Path, youll build a complete, offline voice chatbot prototype using PyAudio, faster-whisper, and vLLM on an Arm-based systemresulting in a fully functional assistant that runs entirely on local hardware with no internet dependency.
In this Learning Path, you'll build a complete, offline voice chatbot prototype using PyAudio, faster-whisper, and vLLM on an Arm-based system-resulting in a fully functional assistant that runs entirely on local hardware with no internet dependency.
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,11 @@ weight: 3
layout: learningpathall
---

[Faster‑whisper](https://github.com/SYSTRAN/faster-whisper) is a high‑performance reimplementation of OpenAI Whisper, designed to significantly reduce transcription latency and memory usage. It is well suited for local and real‑time speech‑to‑text (STT) pipelines, especially when running on CPU‑only systems or hybrid CPU/GPU environments.
## Set up faster-whisper for offline speech recognition

You'll use faster‑whisper as the STT engine to convert raw microphone input into structured text. At this stage, the goal is to install faster‑whisper correctly and verify that it can transcribe audio reliably. Detailed tuning and integration are covered in later sections.
[faster-whisper](https://github.com/SYSTRAN/faster-whisper) is a high-performance reimplementation of OpenAI Whisper, designed to significantly reduce transcription latency and memory usage. It's well suited for local and real-time speech-to-text (STT) pipelines, especially when running on CPU-only systems or hybrid CPU/GPU environments.

You'll use faster-whisper as the STT engine to convert raw microphone input into structured text. At this stage, the goal is to install faster-whisper correctly and verify that it can transcribe audio reliably. Detailed tuning and integration are covered in later sections.

### Install build dependencies

Expand All @@ -22,11 +24,11 @@ sudo apt install python3.12 python3.12-venv python3.12-dev -y
sudo apt install gcc portaudio19-dev ffmpeg -y
```

## Create and activate Python environment
## Create and activate a Python environment

In particular, [pyaudio](https://pypi.org/project/PyAudio/) (used for real-time microphone capture) depends on the PortAudio library and the Python C API. These must match the version of Python you're using.

Now that the system libraries are in place and audio input is verified, it's time to set up an isolated Python environment for your voice assistant project. This will prevent dependency conflicts and make your installation reproducible.
Set up an isolated Python environment for your voice assistant project to prevent dependency conflicts and make your installation reproducible.

```bash
python3.12 -m venv va_env
Expand All @@ -53,7 +55,7 @@ pip install requests webrtcvad sounddevice==0.5.3
```

{{% notice Note %}}
While sounddevice==0.5.4 is available, it introduces callback-related errors during audio stream cleanup that may confuse beginners.
While sounddevice==0.5.4 is available, it introduces callback-related errors during audio stream cleanup that can confuse beginners.
Use sounddevice==0.5.3, which is stable and avoids these warnings.
{{% /notice %}}

Expand Down Expand Up @@ -162,7 +164,7 @@ Recording for 10 seconds...

{{% notice Note %}}
To stop the script, press Ctrl+C during any transcription loop. The current 10-second recording completes and transcribes before the program exits cleanly.
Avoid using Ctrl+Z, which suspends the process instead of terminating it.
Don't use Ctrl+Z, which suspends the process instead of terminating it.
{{% /notice %}}


Expand All @@ -189,7 +191,7 @@ pip install sounddevice==0.5.3

You can record audio without errors, but nothing is played back.

Verify that your USB microphone or headset is selected as the default input/output device. Also ensure the system volume is not muted.
Ensure that your USB microphone or headset is selected as the default input/output device. Also check that the system volume isn't muted.

**Fix:** List all available audio devices:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ weight: 4
layout: learningpathall
---

## Build a CPU-based speech-to-text engine

In this section, you'll build a real-time speech-to-text (STT) pipeline using only the CPU. Starting from a basic 10-second recorder, you'll incrementally add noise filtering, sentence segmentation, and parallel audio processing to achieve a transcription engine for Arm-based systems like DGX Spark.

You'll start from a minimal loop and iterate toward a multithreaded, VAD-enhanced STT engine.
Expand Down Expand Up @@ -104,7 +106,7 @@ When you speak to the device, the output is similar to:

{{% notice Note %}}
faster-whisper supports many models like tiny, base, small, medium and large-v1/2/3.
Check the [GitHub repository](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages) for more model details.
See the [GitHub repository](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages) for more model details.
{{% /notice %}}


Expand Down Expand Up @@ -238,15 +240,15 @@ When you say a long sentence with multiple clauses, the output is similar to:
Segment done.
```

The result is a smoother and more accurate voice UXparticularly important when integrating with downstream LLMs in later sections.
The result is a smoother and more accurate voice UX - particularly important when integrating with downstream LLMs in later sections.

### Demo: Real-time speech transcription on Arm CPU with faster-whisper

This demo shows the real-time transcription pipeline in action, running on an Arm-based DGX Spark system. Using a USB microphone and the faster-whisper model (`medium.en`), the system records voice input, processes it on the CPU, and returns accurate transcriptions with timestampsall without relying on cloud services.
This demo shows the real-time transcription pipeline in action, running on an Arm-based DGX Spark system. Using a USB microphone and the faster-whisper model (`medium.en`), the system records voice input, processes it on the CPU, and returns accurate transcriptions with timestamps - all without relying on cloud services.

Notice the clean terminal output and low latency, demonstrating how the pipeline is optimized for local, real-time voice recognition on resource-efficient hardware.

![Real-time speech transcription demo with volume visualization#center](fasterwhipser_demo1.gif "Figure 1: Real-time speech transcription with audio volume bar")
![Real-time speech transcription demo with volume visualization alt-txt#center](fasterwhipser_demo1.gif "Real-time speech transcription with audio volume bar")

The device runs audio capture and transcription in parallel. Use `threading.Thread` to collect audio without blocking, store audio frames in a `queue.Queue`, and in the main thread, poll for new data and run STT.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ weight: 5
layout: learningpathall
---

After applying the previous steps—model upgrade, VAD, smart turn detection, and multi-threaded audio collection—you now have a high-quality, CPU-based local speech-to-text system.
## Optimize speech segmentation for your environment

After applying the previous steps-model upgrade, VAD, smart turn detection, and multi-threaded audio collection - you now have a high-quality, CPU-based local speech-to-text system.

At this stage, the core pipeline is complete. What remains is fine-tuning: adapting the system to your environment, microphone setup, and speaking style. This flexibility is one of the key advantages of a fully local STT pipeline.

Expand Down Expand Up @@ -42,7 +44,7 @@ Adjust this setting based on background noise and microphone quality.

### Tuning `MIN_SPEECH_SEC` and `SILENCE_LIMIT_SEC`

- `MIN_SPEECH_SEC`: This parameter defines the minimum duration of detected speech required before a segment is considered valid. Use this to filter out very short utterances such as false starts or background chatter.
- `MIN_SPEECH_SEC`: This parameter defines the minimum duration of detected speech needed before a segment is considered valid. Use this to filter out very short utterances such as false starts or background chatter.
- Lower values: More responsive, but may capture incomplete phrases or noise
- Higher values: More stable sentences, but slower response

Expand All @@ -58,7 +60,7 @@ Based on practical experiments, the following presets provide a good starting po
|----------------------|----------------------|-------------------------|-------------------|
| Short command phrases | 0.8 | 0.6 | Optimized for quick voice commands such as "yes", "next", or "stop". Prioritizes responsiveness over sentence completeness. |
| Natural conversational speech | 1.0 | 1.0 | Balanced settings for everyday dialogue with natural pauses between phrases. |
| Long-form explanations (for example, tutorials) | 2.0 | 2.0 | Designed for longer sentences and structured explanations, reducing the risk of premature segmentation. |
| Long-form explanations such as tutorials | 2.0 | 2.0 | Designed for longer sentences and structured explanations, reducing the risk of premature segmentation. |

## Apply these settings

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ weight: 6
layout: learningpathall
---

## Deploy vLLM for local language generation

In the previous section, you built a complete Speech-to-Text (STT) engine using faster-whisper, running efficiently on Arm-based CPUs. Now it's time to add the next building block: a local large language model (LLM) that can generate intelligent responses from user input.

You'll integrate [vLLM](https://vllm.ai/), a high-performance LLM inference engine that runs on GPU and supports advanced features such as continuous batching, OpenAI-compatible APIs, and quantized models.
Expand All @@ -18,7 +20,7 @@ vLLM is especially effective in hybrid systems like the DGX Spark, where CPU cor

### Install and launch vLLM with GPU acceleration

In this section, youll install and launch vLLMan optimized large language model (LLM) inference engine that runs efficiently on GPU. This component will complete your local speech-to-response pipeline by transforming transcribed text into intelligent replies.
In this section, you'll install and launch vLLM - an optimized large language model (LLM) inference engine that runs efficiently on GPU. This component will complete your local speech-to-response pipeline by transforming transcribed text into intelligent replies.

#### Install Docker and pull vLLM image

Expand All @@ -45,7 +47,7 @@ nvcr.io/nvidia/vllm 25.11-py3 d33d4cadbe0f 2 months ago

#### Download a quantized model (GPTQ)

Use Hugging Face CLI to download a pre-quantized LLM such as Mistral-7B-Instruct-GPTQ and Meta-Llama-3-70B-Instruct-GPTQ models for following Real-Time AI Conversations.
Use Hugging Face CLI to download a pre-quantized LLM such as Mistral-7B-Instruct-GPTQ and Meta-Llama-3-70B-Instruct-GPTQ models for real-time AI conversations.

```bash
pip install huggingface_hub
Expand Down Expand Up @@ -99,7 +101,7 @@ docker run -it --gpus all -p 8000:8000 \
```

{{% notice Note %}}
Tip: The first launch will compile and cache the model. To reduce startup time in future runs, consider creating a Docker snapshot with docker commit.
The first launch compiles and caches the model. To reduce startup time in future runs, consider creating a Docker snapshot with docker commit.
{{% /notice %}}

You can also check your NVIDIA driver and CUDA compatibility during the vLLM launch by looking at the output.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ weight: 7
layout: learningpathall
---

Now that both faster-whisper and vLLM are working independently, it's time to connect them into a real-time speech-to-response pipeline. Your system will listen to live audio, transcribe it, and send the transcription to vLLM to generate an intelligent reply—all running locally without cloud services.
## Integrate STT with vLLM for voice interaction

Now that both faster-whisper and vLLM are working independently, it's time to connect them into a real-time speech-to-response pipeline. Your system will listen to live audio, transcribe it, and send the transcription to vLLM to generate an intelligent reply - all running locally without cloud services.

### Dual process architecture: vLLM and STT

Expand All @@ -27,7 +29,7 @@ This separation has several advantages:

Separating container startup from model launch provides greater control and improves development experience.

By launching the container first, you can troubleshoot errors like model path issues or GPU memory limits directly inside the environmentwithout the container shutting down immediately. It also speeds up iteration: you avoid reloading the entire image each time you tweak settings or restart the model.
By launching the container first, you can troubleshoot errors like model path issues or GPU memory limits directly inside the environment, without the container shutting down immediately. It also speeds up iteration: you avoid reloading the entire image each time you tweak settings or restart the model.

This structure also improves visibility. You can inspect files, monitor GPU usage, or run diagnostics like `curl` and `nvidia-smi` inside the container. Breaking these steps apart makes the process easier to understand, debug, and extend.

Expand All @@ -52,7 +54,7 @@ vllm serve /models/mistral-7b \
--dtype float16
```

Look for "Application startup complete," in the output:
Look for "Application startup complete." in the output:

```output
(APIServer pid=1) INFO: Started server process [1]
Expand Down Expand Up @@ -113,7 +115,7 @@ print(f" AI : {reply}\n")
This architecture mirrors the OpenAI Chat API design, enabling future enhancements like system-level prompts, multi-turn history, or role-specific behavior.

{{% notice tip %}}
If you encounter a "model does not exist" error, double-check the model path you used when launching vLLM. It must match MODEL_NAME exactly.
If you encounter a "model doesn't exist" error, double-check the model path you used when launching vLLM. It must match MODEL_NAME exactly.
{{% /notice %}}

Switch to another terminal and save the following Python code in a file named `stt-client.py`:
Expand Down Expand Up @@ -280,9 +282,7 @@ If your input is too short, you'll see:
Skipped short segment (1.32s < 2.0s)
```

{{% notice Tip %}}
You can fine-tune these parameters in future sections to better fit your speaking style or environment.
{{% /notice %}}
{{% notice Tip %}}You can fine-tune these parameters in future sections to better fit your speaking style or environment.{{% /notice %}}

## What you've accomplished and what's next

Expand Down
Loading