Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
---
title: Understanding Offline Voice Assistants
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Why Build an Offline Voice Assistant?

Voice-based AI assistants are becoming essential in customer support, productivity tools, and embedded interfaces. For example, a retail kiosk might need to answer product-related questions verbally without relying on internet access. However, many of these systems depend heavily on cloud services for speech recognition and language understanding, raising concerns around latency, cost, and data privacy.

In addition, a healthcare terminal or legal consultation assistant may need to handle voice queries involving sensitive personal information, where sending audio data to the cloud would violate privacy requirements. Running your voice assistant entirely offline solves these problems.

You avoid unpredictable latency caused by network fluctuations, prevent sensitive voice data from leaving the local machine, and eliminate recurring API costs that make large-scale deployment expensive. It also boosts trust for on-device deployments and compliance-sensitive industries.

By combining local speech-to-text (STT) with a locally hosted large language model (LLM), you gain full control over the pipeline and eliminate API dependencies. This gives you full control to experiment, customize, and scale—without relying on external APIs.

## Common Development Challenges:

While the benefits are clear, building a local voice assistant involves several engineering challenges:

- Managing audio stream segmentation and speech detection in real-time: It's hard to reliably identify when the user starts and stops speaking, especially with natural pauses and background noise.

- Avoiding latency or misfires in STT/LLM integration: Timing mismatches can cause delayed responses or repeated input, reducing the conversational quality.

- Keeping the pipeline responsive on local hardware without overloading resources: You need to carefully balance CPU/GPU workloads so that inference doesn't block audio capture or processing.

## Why use Arm and DGX Spark?

Arm-powered platforms like [DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) allow efficient parallelism: use CPU cores for audio preprocessing and whisper inference, while offloading LLM reasoning to powerful GPUs. This architecture balances throughput and energy efficiency—ideal for private, on-prem AI workloads. Check this [learning path](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction/) to understand the CPU and GPU architecture of DGX Spark.

DGX Spark also supports standard USB interfaces, making it easy to connect consumer-grade microphones for development or deployment. This makes it viable not just for data center use, but also for edge inference or desktop-style prototyping.

In this Learning Path, you’ll build a complete, offline voice chatbot prototype using PyAudio, faster-whisper, and vLLM on an Arm-based system—resulting in a fully functional assistant that runs entirely on local hardware with no internet dependency.
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
---
title: Installing faster-whisper for Local Speech Recognition
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Installing faster-whisper for Local Speech Recognition

[Faster‑whisper](https://github.com/SYSTRAN/faster-whisper) is a high‑performance reimplementation of OpenAI Whisper, designed to significantly reduce transcription latency and memory usage. It is well suited for local and real‑time speech‑to‑text (STT) pipelines, especially when running on CPU‑only systems or hybrid CPU/GPU environments.

In this Learning Path, faster‑whisper serves as the STT engine that converts raw microphone input into structured text. At this stage, the goal is to install faster‑whisper correctly and verify that it can transcribe audio reliably. Detailed tuning and integration will be covered in later modules.

### Install build dependencies

While some Python packages such as sounddevice and webrtcvad previously had compatibility issues with newer Python versions, these have been resolved. This Learning Path uses ***Python 3.12***, which has been tested and confirmed to work reliably with all required dependencies.

Install Python 3.12 and build dependencies:

```bash
sudo apt update
sudo apt install python3.12 python3.12-venv python3.12-dev -y
sudo apt install portaudio19-dev ffmpeg -y
```

### Create and Activate Python Environment

In particular, [pyaudio](https://pypi.org/project/PyAudio/) (used for real-time microphone capture) depends on the PortAudio library and the Python C API. These must match the version of Python you're using.

Now that the system libraries are in place and audio input is verified, it's time to set up an isolated Python environment for your voice assistant project. This will prevent dependency conflicts and make your installation reproducible.

```bash
python3.12 -m venv voice_ass_env
source voice_ass_env/bin/activate
python3 -m pip install --upgrade pip
```

Before install the package, it will be worst to check the Python version in your virtual environment.

```bash
python3 --version
```

Expected output should be `3.12.x` or higher.
```log
Python 3.12.3
```

Install required Python packages:

```bash
pip install pyaudio numpy torch faster-whisper
pip install requests webrtcvad sounddevice==0.5.3
```

{{% notice Note %}}
While sounddevice==0.5.4 is available, it introduces callback-related errors during audio stream cleanup that may confuse beginners.
For this Learning Path, we recommend sounddevice==0.5.3, which is stable and avoids these warnings.
{{% /notice %}}

Verify the pyaudio version by:
```bash
python -c "import pyaudio; print(pyaudio.__version__)"
```

Expected output should be:
```log
0.2.14
```

### Verify microphone input

Once your system dependencies are installed, you can test that your audio hardware is functioning properly. This ensures that your audio input is accessible by Python through the sounddevice module.

Now plug in your USB microphone and run the following Python code to verify that it is detected and functioning correctly:

```python
import sounddevice as sd
import numpy as np

SAMPLE_RATE = 16000
DURATION = 5

print(" Start recording for 5 seconds...")
with sd.InputStream(samplerate=SAMPLE_RATE, channels=1, dtype='float32') as stream:
audio = stream.read(int(SAMPLE_RATE * DURATION))[0]
print(" Recording complete.")

print(" Playing back...")
sd.play(audio, samplerate=SAMPLE_RATE)
sd.wait()
print(" Playback complete.")
```

DGX Spark will record the audio for 5 seconds and immediately play back the captured audio.
If you do not hear any playback, check your USB connection and verify the installation steps above.

Once you’ve confirmed that your microphone is working and your environment is set up, you’re ready to test real-time transcription and move on to the next phase.

### Sample: Real-time Transcription with faster-whisper

Now let's verify that your Whisper model works with live microphone input. This example records a 10-second audio clip using the system microphone, transcribes it using faster-whisper, and prints the transcribed text with timestamps.

This script records a fixed-duration mono audio clip using the sounddevice module and transcribes it using the faster-whisper model. The overall flow includes several key steps:
- The `record_audio()` function starts the microphone and records a 10-second audio segment, returning the data as a NumPy array.
- The `WhisperModel("small.en")` call loads the ***small English model*** from faster-whisper, using `compute_type`=***"int8"*** to ensure compatibility with CPU-only systems.
- The `transcribe_audio()` function processes the recorded audio and prints the transcription results along with start and end timestamps for each spoken segment.
- The while True loop continuously records and transcribes in real time until interrupted, allowing the user to speak multiple utterances across iterations.
- The script will continue running in 10-second cycles until stopped with ***Ctrl+C***.

```python
import sounddevice as sd
import numpy as np
from faster_whisper import WhisperModel

SAMPLE_RATE = 16000
DURATION = 10 # seconds per loop

def record_audio():
print(" Recording for 10 seconds...")
with sd.InputStream(samplerate=SAMPLE_RATE, channels=1, dtype='float32') as stream:
audio = stream.read(int(SAMPLE_RATE * DURATION))[0] # returns (data, overflow)
sd.wait()
print(" Recording complete.")
return audio.flatten()

def transcribe_audio(audio):
model = WhisperModel("small.en", device="cpu", compute_type="int8")
print(" Transcribing...")
segments, _ = model.transcribe(audio, language="en")
for segment in segments:
print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text.strip()}")
print(" Done.\n")

if __name__ == "__main__":
try:
while True:
audio_data = record_audio()
transcribe_audio(audio_data)
except KeyboardInterrupt:
print(" Stopped by user.")
```

{{% notice Note %}}
To stop the script, press ***Ctrl+C*** during any transcription loop.The current 10-second recording will complete and transcribe before the program exits cleanly.
Avoid using ***Ctrl+Z***, which suspends the process instead of terminating it.
{{% /notice %}}


### Troubleshooting

This section lists common issues you may encounter when setting up local speech-to-text, along with clear checks and fixes.

***Problem 1: Callback-related errors with sounddevice***

If you encounter errors like:

```log
AttributeError: '_CallbackContext' object has no attribute 'data'
```

***Cause***
This is a known issue introduced in sounddevice==0.5.4, related to internal callback cleanup.

***Fix***
Use the stable version recommended in this Learning Path:
```bash
pip install sounddevice==0.5.3
```

***Problem 2: No sound playback after recording***

You can record audio without errors, but nothing is played back.

***Check:***
- Verify that your USB microphone or headset is selected as the default input/output device.
- Ensure the system volume is not muted.

***Fix***
List all available audio devices:

```bash
python -m sounddevice
```

You should see an output similar to:
```log
0 NVIDIA: HDMI 0 (hw:0,3), ALSA (0 in, 8 out)
1 NVIDIA: HDMI 1 (hw:0,7), ALSA (0 in, 8 out)
2 NVIDIA: HDMI 2 (hw:0,8), ALSA (0 in, 8 out)
3 NVIDIA: HDMI 3 (hw:0,9), ALSA (0 in, 8 out)
4 Plantronics Blackwire 3225 Seri: USB Audio (hw:1,0), ALSA (2 in, 2 out)
5 hdmi, ALSA (0 in, 8 out)
6 pipewire, ALSA (64 in, 64 out)
7 pulse, ALSA (32 in, 32 out)
* 8 default, ALSA (64 in, 64 out)
```

If your microphone or headset is listed but not active, try explicitly selecting it in Python:

```bash
import sounddevice as sd

sd.default.device = 4 # Set to your desired device index
```

Other fixes to try:
- Increase system volume
- Replug the device
- Reboot your system to refresh device mappings


### Next Module

Once your transcription prints correctly in the terminal and playback works as expected, you’ve successfully completed the setup for local STT using faster-whisper.

In the next module, you’ll enhance this basic transcription loop by adding real-time audio segmentation, turn detection, and background threading to support natural voice interactions.
Loading