diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/1_offline_voice_assistant.md b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/1_offline_voice_assistant.md index 6ed1bd9c23..b873081f51 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/1_offline_voice_assistant.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/1_offline_voice_assistant.md @@ -1,12 +1,12 @@ --- -title: Learn about offline voice assistants +title: Build an offline voice assistant with whisper and vLLM weight: 2 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Why build an offline voice assistant? +## Benefits of running a voice assistant offline Voice-based AI assistants are becoming essential in customer support, productivity tools, and embedded interfaces. For example, a retail kiosk might need to answer product-related questions verbally without relying on internet access. However, many of these systems depend heavily on cloud services for speech recognition and language understanding, raising concerns around latency, cost, and data privacy. @@ -16,16 +16,16 @@ You avoid unpredictable latency caused by network fluctuations, prevent sensitiv By combining local speech-to-text (STT) with a locally hosted large language model (LLM), you gain complete control over the pipeline and eliminate API dependencies. You can experiment, customize, and scale without relying on external services. -## What are some common development challenges? +## Challenges of building a local voice assistant While the benefits are clear, building a local voice assistant involves several engineering challenges. Real-time audio segmentation requires reliably identifying when users start and stop speaking, accounting for natural pauses and background noise. Timing mismatches between STT and LLM components can cause delayed responses or repeated input, reducing conversational quality. You also need to balance CPU/GPU workloads to keep the pipeline responsive without overloading resources or blocking audio capture. -## Why use Arm and DGX Spark? +## Why run offline voice AI on Arm-based DGX Spark? -Arm-powered platforms like [DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) allow efficient parallelism: use CPU cores for audio preprocessing and whisper inference, while offloading LLM reasoning to powerful GPUs. This architecture balances throughput and energy efficiency—ideal for private, on-premises AI workloads. To understand the CPU and GPU architecture of DGX Spark, refer to [Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/). +Arm-powered platforms like [DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) allow efficient parallelism: use CPU cores for audio preprocessing and whisper inference, while offloading LLM reasoning to powerful GPUs. This architecture balances throughput and energy efficiency-ideal for private, on-premises AI workloads. To understand the CPU and GPU architecture of DGX Spark, refer to [Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/). DGX Spark also supports standard USB interfaces, making it easy to connect consumer-grade microphones for development or deployment. This makes it viable for edge inference and desktop-style prototyping. -In this Learning Path, you’ll build a complete, offline voice chatbot prototype using PyAudio, faster-whisper, and vLLM on an Arm-based system—resulting in a fully functional assistant that runs entirely on local hardware with no internet dependency. +In this Learning Path, you'll build a complete, offline voice chatbot prototype using PyAudio, faster-whisper, and vLLM on an Arm-based system-resulting in a fully functional assistant that runs entirely on local hardware with no internet dependency. diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/2_setup.md b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/2_setup.md index e3809ba713..4c4598acf9 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/2_setup.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/2_setup.md @@ -6,9 +6,11 @@ weight: 3 layout: learningpathall --- -[Faster‑whisper](https://github.com/SYSTRAN/faster-whisper) is a high‑performance reimplementation of OpenAI Whisper, designed to significantly reduce transcription latency and memory usage. It is well suited for local and real‑time speech‑to‑text (STT) pipelines, especially when running on CPU‑only systems or hybrid CPU/GPU environments. +## Set up faster-whisper for offline speech recognition -You'll use faster‑whisper as the STT engine to convert raw microphone input into structured text. At this stage, the goal is to install faster‑whisper correctly and verify that it can transcribe audio reliably. Detailed tuning and integration are covered in later sections. +[faster-whisper](https://github.com/SYSTRAN/faster-whisper) is a high-performance reimplementation of OpenAI Whisper, designed to significantly reduce transcription latency and memory usage. It's well suited for local and real-time speech-to-text (STT) pipelines, especially when running on CPU-only systems or hybrid CPU/GPU environments. + +You'll use faster-whisper as the STT engine to convert raw microphone input into structured text. At this stage, the goal is to install faster-whisper correctly and verify that it can transcribe audio reliably. Detailed tuning and integration are covered in later sections. ### Install build dependencies @@ -22,11 +24,11 @@ sudo apt install python3.12 python3.12-venv python3.12-dev -y sudo apt install gcc portaudio19-dev ffmpeg -y ``` -## Create and activate Python environment +## Create and activate a Python environment In particular, [pyaudio](https://pypi.org/project/PyAudio/) (used for real-time microphone capture) depends on the PortAudio library and the Python C API. These must match the version of Python you're using. -Now that the system libraries are in place and audio input is verified, it's time to set up an isolated Python environment for your voice assistant project. This will prevent dependency conflicts and make your installation reproducible. +Set up an isolated Python environment for your voice assistant project to prevent dependency conflicts and make your installation reproducible. ```bash python3.12 -m venv va_env @@ -53,7 +55,7 @@ pip install requests webrtcvad sounddevice==0.5.3 ``` {{% notice Note %}} -While sounddevice==0.5.4 is available, it introduces callback-related errors during audio stream cleanup that may confuse beginners. +While sounddevice==0.5.4 is available, it introduces callback-related errors during audio stream cleanup that can confuse beginners. Use sounddevice==0.5.3, which is stable and avoids these warnings. {{% /notice %}} @@ -162,7 +164,7 @@ Recording for 10 seconds... {{% notice Note %}} To stop the script, press Ctrl+C during any transcription loop. The current 10-second recording completes and transcribes before the program exits cleanly. -Avoid using Ctrl+Z, which suspends the process instead of terminating it. +Don't use Ctrl+Z, which suspends the process instead of terminating it. {{% /notice %}} @@ -189,7 +191,7 @@ pip install sounddevice==0.5.3 You can record audio without errors, but nothing is played back. -Verify that your USB microphone or headset is selected as the default input/output device. Also ensure the system volume is not muted. +Ensure that your USB microphone or headset is selected as the default input/output device. Also check that the system volume isn't muted. **Fix:** List all available audio devices: diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/3_fasterwhisper.md b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/3_fasterwhisper.md index ba6ec94c0b..9b8fe584da 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/3_fasterwhisper.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/3_fasterwhisper.md @@ -6,6 +6,8 @@ weight: 4 layout: learningpathall --- +## Build a CPU-based speech-to-text engine + In this section, you'll build a real-time speech-to-text (STT) pipeline using only the CPU. Starting from a basic 10-second recorder, you'll incrementally add noise filtering, sentence segmentation, and parallel audio processing to achieve a transcription engine for Arm-based systems like DGX Spark. You'll start from a minimal loop and iterate toward a multithreaded, VAD-enhanced STT engine. @@ -104,7 +106,7 @@ When you speak to the device, the output is similar to: {{% notice Note %}} faster-whisper supports many models like tiny, base, small, medium and large-v1/2/3. -Check the [GitHub repository](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages) for more model details. +See the [GitHub repository](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages) for more model details. {{% /notice %}} @@ -238,15 +240,15 @@ When you say a long sentence with multiple clauses, the output is similar to: Segment done. ``` -The result is a smoother and more accurate voice UX—particularly important when integrating with downstream LLMs in later sections. +The result is a smoother and more accurate voice UX - particularly important when integrating with downstream LLMs in later sections. ### Demo: Real-time speech transcription on Arm CPU with faster-whisper -This demo shows the real-time transcription pipeline in action, running on an Arm-based DGX Spark system. Using a USB microphone and the faster-whisper model (`medium.en`), the system records voice input, processes it on the CPU, and returns accurate transcriptions with timestamps—all without relying on cloud services. +This demo shows the real-time transcription pipeline in action, running on an Arm-based DGX Spark system. Using a USB microphone and the faster-whisper model (`medium.en`), the system records voice input, processes it on the CPU, and returns accurate transcriptions with timestamps - all without relying on cloud services. Notice the clean terminal output and low latency, demonstrating how the pipeline is optimized for local, real-time voice recognition on resource-efficient hardware. -![Real-time speech transcription demo with volume visualization#center](fasterwhipser_demo1.gif "Figure 1: Real-time speech transcription with audio volume bar") +![Real-time speech transcription demo with volume visualization alt-txt#center](fasterwhipser_demo1.gif "Real-time speech transcription with audio volume bar") The device runs audio capture and transcription in parallel. Use `threading.Thread` to collect audio without blocking, store audio frames in a `queue.Queue`, and in the main thread, poll for new data and run STT. diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/3a_segmentation.md b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/3a_segmentation.md index 579912ab75..0db3781579 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/3a_segmentation.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/3a_segmentation.md @@ -6,7 +6,9 @@ weight: 5 layout: learningpathall --- -After applying the previous steps—model upgrade, VAD, smart turn detection, and multi-threaded audio collection—you now have a high-quality, CPU-based local speech-to-text system. +## Optimize speech segmentation for your environment + +After applying the previous steps-model upgrade, VAD, smart turn detection, and multi-threaded audio collection - you now have a high-quality, CPU-based local speech-to-text system. At this stage, the core pipeline is complete. What remains is fine-tuning: adapting the system to your environment, microphone setup, and speaking style. This flexibility is one of the key advantages of a fully local STT pipeline. @@ -42,7 +44,7 @@ Adjust this setting based on background noise and microphone quality. ### Tuning `MIN_SPEECH_SEC` and `SILENCE_LIMIT_SEC` -- `MIN_SPEECH_SEC`: This parameter defines the minimum duration of detected speech required before a segment is considered valid. Use this to filter out very short utterances such as false starts or background chatter. +- `MIN_SPEECH_SEC`: This parameter defines the minimum duration of detected speech needed before a segment is considered valid. Use this to filter out very short utterances such as false starts or background chatter. - Lower values: More responsive, but may capture incomplete phrases or noise - Higher values: More stable sentences, but slower response @@ -58,7 +60,7 @@ Based on practical experiments, the following presets provide a good starting po |----------------------|----------------------|-------------------------|-------------------| | Short command phrases | 0.8 | 0.6 | Optimized for quick voice commands such as "yes", "next", or "stop". Prioritizes responsiveness over sentence completeness. | | Natural conversational speech | 1.0 | 1.0 | Balanced settings for everyday dialogue with natural pauses between phrases. | -| Long-form explanations (for example, tutorials) | 2.0 | 2.0 | Designed for longer sentences and structured explanations, reducing the risk of premature segmentation. | +| Long-form explanations such as tutorials | 2.0 | 2.0 | Designed for longer sentences and structured explanations, reducing the risk of premature segmentation. | ## Apply these settings diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/4_vllm.md b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/4_vllm.md index 285237e3aa..888967ce41 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/4_vllm.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/4_vllm.md @@ -6,6 +6,8 @@ weight: 6 layout: learningpathall --- +## Deploy vLLM for local language generation + In the previous section, you built a complete Speech-to-Text (STT) engine using faster-whisper, running efficiently on Arm-based CPUs. Now it's time to add the next building block: a local large language model (LLM) that can generate intelligent responses from user input. You'll integrate [vLLM](https://vllm.ai/), a high-performance LLM inference engine that runs on GPU and supports advanced features such as continuous batching, OpenAI-compatible APIs, and quantized models. @@ -18,7 +20,7 @@ vLLM is especially effective in hybrid systems like the DGX Spark, where CPU cor ### Install and launch vLLM with GPU acceleration -In this section, you’ll install and launch vLLM—an optimized large language model (LLM) inference engine that runs efficiently on GPU. This component will complete your local speech-to-response pipeline by transforming transcribed text into intelligent replies. +In this section, you'll install and launch vLLM - an optimized large language model (LLM) inference engine that runs efficiently on GPU. This component will complete your local speech-to-response pipeline by transforming transcribed text into intelligent replies. #### Install Docker and pull vLLM image @@ -45,7 +47,7 @@ nvcr.io/nvidia/vllm 25.11-py3 d33d4cadbe0f 2 months ago #### Download a quantized model (GPTQ) -Use Hugging Face CLI to download a pre-quantized LLM such as Mistral-7B-Instruct-GPTQ and Meta-Llama-3-70B-Instruct-GPTQ models for following Real-Time AI Conversations. +Use Hugging Face CLI to download a pre-quantized LLM such as Mistral-7B-Instruct-GPTQ and Meta-Llama-3-70B-Instruct-GPTQ models for real-time AI conversations. ```bash pip install huggingface_hub @@ -99,7 +101,7 @@ docker run -it --gpus all -p 8000:8000 \ ``` {{% notice Note %}} -Tip: The first launch will compile and cache the model. To reduce startup time in future runs, consider creating a Docker snapshot with docker commit. +The first launch compiles and caches the model. To reduce startup time in future runs, consider creating a Docker snapshot with docker commit. {{% /notice %}} You can also check your NVIDIA driver and CUDA compatibility during the vLLM launch by looking at the output. diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/4a_integration.md b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/4a_integration.md index 0da381cd38..8322db6ee1 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/4a_integration.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/4a_integration.md @@ -6,7 +6,9 @@ weight: 7 layout: learningpathall --- -Now that both faster-whisper and vLLM are working independently, it's time to connect them into a real-time speech-to-response pipeline. Your system will listen to live audio, transcribe it, and send the transcription to vLLM to generate an intelligent reply—all running locally without cloud services. +## Integrate STT with vLLM for voice interaction + +Now that both faster-whisper and vLLM are working independently, it's time to connect them into a real-time speech-to-response pipeline. Your system will listen to live audio, transcribe it, and send the transcription to vLLM to generate an intelligent reply - all running locally without cloud services. ### Dual process architecture: vLLM and STT @@ -27,7 +29,7 @@ This separation has several advantages: Separating container startup from model launch provides greater control and improves development experience. -By launching the container first, you can troubleshoot errors like model path issues or GPU memory limits directly inside the environment—without the container shutting down immediately. It also speeds up iteration: you avoid reloading the entire image each time you tweak settings or restart the model. +By launching the container first, you can troubleshoot errors like model path issues or GPU memory limits directly inside the environment, without the container shutting down immediately. It also speeds up iteration: you avoid reloading the entire image each time you tweak settings or restart the model. This structure also improves visibility. You can inspect files, monitor GPU usage, or run diagnostics like `curl` and `nvidia-smi` inside the container. Breaking these steps apart makes the process easier to understand, debug, and extend. @@ -52,7 +54,7 @@ vllm serve /models/mistral-7b \ --dtype float16 ``` -Look for "Application startup complete," in the output: +Look for "Application startup complete." in the output: ```output (APIServer pid=1) INFO: Started server process [1] @@ -113,7 +115,7 @@ print(f" AI : {reply}\n") This architecture mirrors the OpenAI Chat API design, enabling future enhancements like system-level prompts, multi-turn history, or role-specific behavior. {{% notice tip %}} -If you encounter a "model does not exist" error, double-check the model path you used when launching vLLM. It must match MODEL_NAME exactly. +If you encounter a "model doesn't exist" error, double-check the model path you used when launching vLLM. It must match MODEL_NAME exactly. {{% /notice %}} Switch to another terminal and save the following Python code in a file named `stt-client.py`: @@ -280,9 +282,7 @@ If your input is too short, you'll see: Skipped short segment (1.32s < 2.0s) ``` -{{% notice Tip %}} -You can fine-tune these parameters in future sections to better fit your speaking style or environment. -{{% /notice %}} +{{% notice Tip %}}You can fine-tune these parameters in future sections to better fit your speaking style or environment.{{% /notice %}} ## What you've accomplished and what's next diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/5_chatbot_prompt.md b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/5_chatbot_prompt.md index bf22fbd130..001e2f89b0 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/5_chatbot_prompt.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/5_chatbot_prompt.md @@ -6,15 +6,15 @@ weight: 8 layout: learningpathall --- +## Why adapt for customer support? + In the previous section, you built a fully offline voice assistant by combining local speech-to-text (STT) with vLLM for language generation. You can transform that general-purpose chatbot into a task-specific customer service agent designed to deliver fast, focused, and context-aware assistance. -## Why adapt for customer support? - -Unlike open-domain chatbots, customer-facing assistants must meet stricter communication standards. Each voice input must trigger a fast and relevant reply with no long pauses or uncertainty. Users expect direct answers, not verbose or vague explanations. The assistant should remember previous questions or actions to support multi-turn interactions. +Unlike open-domain chatbots, customer-facing assistants must meet stricter communication standards. Users expect direct answers, not verbose or vague explanations, with fast responses and no long pauses. The assistant should remember previous questions or actions to support multi-turn interactions. -These needs are especially relevant for questions like password resets ("I forgot my account password and need help resetting it"), order tracking ("Can you track my recent order and tell me when it will arrive?"), billing issues ("Why was I charged twice this month?"), and subscription management ("I want to cancel my subscription and avoid future charges"). Such queries require language generation, structured behavior, and task memory. +These needs are especially relevant for password resets ("I forgot my account password and need help resetting it"), order tracking ("Can you track my recent order and tell me when it will arrive?"), billing issues ("Why was I charged twice this month?"), and subscription management ("I want to cancel my subscription and avoid future charges"). Such queries need language generation, structured behavior, and task memory. ### What you'll build in this section @@ -32,7 +32,7 @@ Enable the assistant to recall recent interactions and respond within context. Y Explore how to integrate local company data using vector search. This allows the assistant to answer questions based on private documents, without ever sending data to the cloud. -This prepares your system for high-trust environments like: +This prepares your system for high-trust environments such as: - Enterprise customer support - Internal help desks - Regulated industries (healthcare, finance, legal) @@ -46,10 +46,10 @@ To make your AI assistant behave like a domain expert (such as a support agent o In OpenAI-compatible APIs (like vLLM), you can provide a special message with the role set to "system". This message defines the behavior and tone of the assistant before any user input is processed. -A system prompt gives your assistant a role to play, such as a polite and helpful customer service agent, a knowledgeable tour guide, or a motivational fitness coach. By customizing the system prompt, you can shape the assistant's language and tone, restrict or expand the type of information it shares, and align responses with business needs, such as short and precise replies for help desks. +A system prompt gives your assistant a role to play such as a polite and helpful customer service agent, a knowledgeable tour guide, or a motivational fitness coach. By customizing the system prompt, you can shape the assistant's language and tone, restrict or expand the type of information it shares, and align responses with business needs such as short and precise replies for help desks. -### Define the System Prompt Behavior +### Define the system prompt behavior To turn your general-purpose voice assistant into a focused role-specific agent, you must guide the language model’s behavior. This is done by defining a system prompt that acts as a task instruction. diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/6_chatbot_contextaware.md b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/6_chatbot_contextaware.md index c7df0ef700..c03ab6b016 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/6_chatbot_contextaware.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/6_chatbot_contextaware.md @@ -6,11 +6,11 @@ weight: 9 layout: learningpathall --- -In customer service and other task-based voice interactions, conversations naturally span multiple turns. Users may provide only partial information per utterance or follow up after the assistant’s prompt. +## Why multi-turn memory matters -To handle such situations effectively, your assistant needs short-term memory. This is a lightweight context buffer that retains recent user questions and assistant replies. +In customer service and other task-based voice interactions, conversations naturally span multiple turns. Users can provide only partial information per utterance or follow up after the assistant's prompt. -## Why multi-turn memory matters +To handle such situations effectively, your assistant needs short-term memory. This is a lightweight context buffer that retains recent user questions and assistant replies. Without memory, each user input is treated in isolation. This causes breakdowns like: @@ -51,9 +51,9 @@ This will build a list like: ] ``` -### Keep only the most recent 5 rounds +### Keep only the most recent five rounds -Each new turn makes the message array longer. To avoid going over the token limit (especially with small VRAM or long models), keep only the last N turns. Use 5 rounds as an example (10 messages, 5 rounds of user + assistant) +Each new turn makes the message array longer. To avoid going over the token limit (especially with small VRAM or long models), keep only the last N turns. Use five rounds as an example (10 messages, five rounds of user + assistant) ```python messages = [{"role": "system", "content": SYSTEM_PROMPT}] @@ -72,7 +72,7 @@ prompt_tokens = len(" ".join([m["content"] for m in messages]).split()) print(f" Estimated prompt tokens: {prompt_tokens}") ``` -This helps you balance max_tokens for the assistant's response, ensuring the prompt and reply fit within the model's limit (such as 4096 or 8192 tokens depending on the model). +This helps you balance max_tokens for the assistant's response, ensuring the prompt and reply fit within the model's limit such as 4096 or 8192 tokens depending on the model. The expected output is similar to: @@ -123,13 +123,13 @@ The assistant remembers the previous turns, including account verification and f | User | Okay, I see the account has been cancelled. Thanks for your help. | 180 | | AI | You're welcome, abc@email.com. I'm glad I could help you cancel your subscription. If there is anything else I can assist you with in the future, please don't hesitate to ask. Have a great day! | | -This estimate helps you prevent prompt truncation or response cutoff, especially important when using larger models with longer histories. +This estimate helps you prevent prompt truncation or response cutoff, which is especially important when using larger models with longer histories. ## Full function of offline voice customer service on DGX Spark Now that your speech-to-AI pipeline is complete, you're ready to scale it up by running a larger, more powerful language model fully offline on DGX Spark. -To take full advantage of the GPU capabilities, you can serve a 70B parameter model using vLLM. Make sure you’ve already downloaded the model files into ~/models/llama3-70b (host OS). +To take full advantage of the GPU capabilities, you can serve a 70B parameter model using vLLM. Ensure you've already downloaded the model files into ~/models/llama3-70b (host OS). Inside the vLLM Docker container, launch the model with: @@ -304,7 +304,7 @@ Once both the STT and LLM services are live, you'll be able to speak naturally a ### Demo: Multi-turn voice chatbot with context memory on DGX Spark -![img2 alt-text#center](fasterwhipser_vllm_demo2.gif "Figure 2: Full Function Voice-to-AI with volume bar") +![Animated terminal session showing real-time speech-to-text transcription and AI responses in a multi-turn customer service conversation, with a volume bar at the bottom indicating live audio input levels from a microphone alt-txt#center](fasterwhipser_vllm_demo2.gif "Full function voice-to-AI with volume bar") This demo showcases a fully offline voice assistant that combines real-time transcription (via faster-whisper) and intelligent response generation (via vLLM). Running on an Arm-based DGX Spark system, the assistant captures live audio, transcribes it, and generates context-aware replies using a local language model, all in a seamless loop. @@ -312,7 +312,7 @@ The assistant now supports multi-turn memory, allowing it to recall previous use No cloud services are used, ensuring full control, privacy, and low-latency performance. -### Full Voice-to-AI Conversation Flow +### Full voice-to-AI conversation flow The following diagram summarizes the complete architecture you've now assembled: from microphone input to AI-generated replies, entirely local, modular, and production-ready. @@ -348,9 +348,9 @@ This hybrid architecture is production-ready, modular, and offline-capable. All With a fully functional offline voice chatbot running on DGX Spark, you now have a strong foundation for many advanced features. Here are some next-step enhancements you might consider: -- Knowledge-augmented Generation (RAG) +- Knowledge-Augmented Generation (RAG) -Integrate local document search or FAQ databases with embedding-based retrieval to answer company-specific or domain-specific queries. You can reference a previous Learning Path about [deploying RAG on DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_rag/) for the same hardware. +Integrate local document search or FAQ databases with embedding-based retrieval to answer company-specific or domain-specific queries. See the Learning Path [Deploying RAG on DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_rag/) for the same hardware. - Multi-language Support @@ -358,10 +358,10 @@ Swap in multilingual STT models and LLMs to enable assistants for global users o - Text-to-Speech (TTS) Output -Add a local TTS engine (such as Coqui, piper, or NVIDIA Riva) to vocalize the assistant's replies, turning it into a true conversational agent. +Add a local TTS engine such as Coqui, piper, or NVIDIA Riva to vocalize the assistant's replies, turning it into a true conversational agent. - Personalization and Context Memory Extend short-term memory into long-term context retention using file-based or vector-based storage. This lets the assistant remember preferences or past sessions. -This on-device architecture enables experimentation and extension without vendor lock-in or privacy concerns, making it ideal for enterprise, educational, and embedded use cases. +This on-device architecture enables experimentation and extension without vendor lock-in or privacy concerns, making it ideal for enterprise, educational, and embedded use cases. \ No newline at end of file diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/_index.md b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/_index.md index 1579b908c1..b3fdb84a88 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/_index.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/_index.md @@ -1,10 +1,5 @@ --- title: Build an offline voice chatbot with faster-whisper and vLLM on DGX Spark -description: Learn how to build a fully offline voice assistant by combining local speech recognition with LLM-powered responses using faster-whisper and vLLM—optimized for DGX Spark. - -draft: true -cascade: - draft: true minutes_to_complete: 60 @@ -13,7 +8,7 @@ who_is_this_for: This is an advanced topic for developers and ML engineers who w learning_objectives: - Explain the architecture of an offline voice chatbot pipeline combining speech-to-text (STT) and vLLM - Capture and segment real-time audio using PyAudio and Voice Activity Detection (VAD) - - Transcribe speech using faster-whisper and generate replies via vLLM + - Transcribe speech using faster-whisper and generate replies using vLLM - Tune segmentation and prompt strategies to improve latency and response quality - Deploy and run the full pipeline on Arm-based systems such as DGX Spark