diff --git a/.wordlist.txt b/.wordlist.txt index 302f05eefd..74cfd01317 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -5644,4 +5644,16 @@ Numbat SKUs asct geminicli -passwordless \ No newline at end of file +passwordless +AWQ +Coqui +GPTQ +PortAudio +PyAudio +Riva +UX +actionability +customizations +pyaudio +sounddevice +webrtcvad \ No newline at end of file diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/1_offline_voice_assistant.md b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/1_offline_voice_assistant.md index 6ed1bd9c23..b873081f51 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/1_offline_voice_assistant.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/1_offline_voice_assistant.md @@ -1,12 +1,12 @@ --- -title: Learn about offline voice assistants +title: Build an offline voice assistant with whisper and vLLM weight: 2 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Why build an offline voice assistant? +## Benefits of running a voice assistant offline Voice-based AI assistants are becoming essential in customer support, productivity tools, and embedded interfaces. For example, a retail kiosk might need to answer product-related questions verbally without relying on internet access. However, many of these systems depend heavily on cloud services for speech recognition and language understanding, raising concerns around latency, cost, and data privacy. @@ -16,16 +16,16 @@ You avoid unpredictable latency caused by network fluctuations, prevent sensitiv By combining local speech-to-text (STT) with a locally hosted large language model (LLM), you gain complete control over the pipeline and eliminate API dependencies. You can experiment, customize, and scale without relying on external services. -## What are some common development challenges? +## Challenges of building a local voice assistant While the benefits are clear, building a local voice assistant involves several engineering challenges. Real-time audio segmentation requires reliably identifying when users start and stop speaking, accounting for natural pauses and background noise. Timing mismatches between STT and LLM components can cause delayed responses or repeated input, reducing conversational quality. You also need to balance CPU/GPU workloads to keep the pipeline responsive without overloading resources or blocking audio capture. -## Why use Arm and DGX Spark? +## Why run offline voice AI on Arm-based DGX Spark? -Arm-powered platforms like [DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) allow efficient parallelism: use CPU cores for audio preprocessing and whisper inference, while offloading LLM reasoning to powerful GPUs. This architecture balances throughput and energy efficiency—ideal for private, on-premises AI workloads. To understand the CPU and GPU architecture of DGX Spark, refer to [Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/). +Arm-powered platforms like [DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) allow efficient parallelism: use CPU cores for audio preprocessing and whisper inference, while offloading LLM reasoning to powerful GPUs. This architecture balances throughput and energy efficiency-ideal for private, on-premises AI workloads. To understand the CPU and GPU architecture of DGX Spark, refer to [Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/). DGX Spark also supports standard USB interfaces, making it easy to connect consumer-grade microphones for development or deployment. This makes it viable for edge inference and desktop-style prototyping. -In this Learning Path, you’ll build a complete, offline voice chatbot prototype using PyAudio, faster-whisper, and vLLM on an Arm-based system—resulting in a fully functional assistant that runs entirely on local hardware with no internet dependency. +In this Learning Path, you'll build a complete, offline voice chatbot prototype using PyAudio, faster-whisper, and vLLM on an Arm-based system-resulting in a fully functional assistant that runs entirely on local hardware with no internet dependency. diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/2_setup.md b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/2_setup.md index e3809ba713..e14e077355 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/2_setup.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/2_setup.md @@ -6,9 +6,11 @@ weight: 3 layout: learningpathall --- -[Faster‑whisper](https://github.com/SYSTRAN/faster-whisper) is a high‑performance reimplementation of OpenAI Whisper, designed to significantly reduce transcription latency and memory usage. It is well suited for local and real‑time speech‑to‑text (STT) pipelines, especially when running on CPU‑only systems or hybrid CPU/GPU environments. +## Set up faster-whisper for offline speech recognition -You'll use faster‑whisper as the STT engine to convert raw microphone input into structured text. At this stage, the goal is to install faster‑whisper correctly and verify that it can transcribe audio reliably. Detailed tuning and integration are covered in later sections. +[faster-whisper](https://github.com/SYSTRAN/faster-whisper) is a high-performance re-implementation of OpenAI Whisper, designed to significantly reduce transcription latency and memory usage. It's well suited for local and real-time speech-to-text (STT) pipelines, especially when running on CPU-only systems or hybrid CPU/GPU environments. + +You'll use faster-whisper as the STT engine to convert raw microphone input into structured text. At this stage, the goal is to install faster-whisper correctly and verify that it can transcribe audio reliably. Detailed tuning and integration are covered in later sections. ### Install build dependencies @@ -22,11 +24,11 @@ sudo apt install python3.12 python3.12-venv python3.12-dev -y sudo apt install gcc portaudio19-dev ffmpeg -y ``` -## Create and activate Python environment +## Create and activate a Python environment In particular, [pyaudio](https://pypi.org/project/PyAudio/) (used for real-time microphone capture) depends on the PortAudio library and the Python C API. These must match the version of Python you're using. -Now that the system libraries are in place and audio input is verified, it's time to set up an isolated Python environment for your voice assistant project. This will prevent dependency conflicts and make your installation reproducible. +Set up an isolated Python environment for your voice assistant project to prevent dependency conflicts and make your installation reproducible. ```bash python3.12 -m venv va_env @@ -53,7 +55,7 @@ pip install requests webrtcvad sounddevice==0.5.3 ``` {{% notice Note %}} -While sounddevice==0.5.4 is available, it introduces callback-related errors during audio stream cleanup that may confuse beginners. +While sounddevice==0.5.4 is available, it introduces callback-related errors during audio stream cleanup that can confuse beginners. Use sounddevice==0.5.3, which is stable and avoids these warnings. {{% /notice %}} @@ -162,7 +164,7 @@ Recording for 10 seconds... {{% notice Note %}} To stop the script, press Ctrl+C during any transcription loop. The current 10-second recording completes and transcribes before the program exits cleanly. -Avoid using Ctrl+Z, which suspends the process instead of terminating it. +Don't use Ctrl+Z, which suspends the process instead of terminating it. {{% /notice %}} @@ -189,7 +191,7 @@ pip install sounddevice==0.5.3 You can record audio without errors, but nothing is played back. -Verify that your USB microphone or headset is selected as the default input/output device. Also ensure the system volume is not muted. +Ensure that your USB microphone or headset is selected as the default input/output device. Also check that the system volume isn't muted. **Fix:** List all available audio devices: diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/3_fasterwhisper.md b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/3_fasterwhisper.md index ba6ec94c0b..9b8fe584da 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/3_fasterwhisper.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/3_fasterwhisper.md @@ -6,6 +6,8 @@ weight: 4 layout: learningpathall --- +## Build a CPU-based speech-to-text engine + In this section, you'll build a real-time speech-to-text (STT) pipeline using only the CPU. Starting from a basic 10-second recorder, you'll incrementally add noise filtering, sentence segmentation, and parallel audio processing to achieve a transcription engine for Arm-based systems like DGX Spark. You'll start from a minimal loop and iterate toward a multithreaded, VAD-enhanced STT engine. @@ -104,7 +106,7 @@ When you speak to the device, the output is similar to: {{% notice Note %}} faster-whisper supports many models like tiny, base, small, medium and large-v1/2/3. -Check the [GitHub repository](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages) for more model details. +See the [GitHub repository](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages) for more model details. {{% /notice %}} @@ -238,15 +240,15 @@ When you say a long sentence with multiple clauses, the output is similar to: Segment done. ``` -The result is a smoother and more accurate voice UX—particularly important when integrating with downstream LLMs in later sections. +The result is a smoother and more accurate voice UX - particularly important when integrating with downstream LLMs in later sections. ### Demo: Real-time speech transcription on Arm CPU with faster-whisper -This demo shows the real-time transcription pipeline in action, running on an Arm-based DGX Spark system. Using a USB microphone and the faster-whisper model (`medium.en`), the system records voice input, processes it on the CPU, and returns accurate transcriptions with timestamps—all without relying on cloud services. +This demo shows the real-time transcription pipeline in action, running on an Arm-based DGX Spark system. Using a USB microphone and the faster-whisper model (`medium.en`), the system records voice input, processes it on the CPU, and returns accurate transcriptions with timestamps - all without relying on cloud services. Notice the clean terminal output and low latency, demonstrating how the pipeline is optimized for local, real-time voice recognition on resource-efficient hardware. -![Real-time speech transcription demo with volume visualization#center](fasterwhipser_demo1.gif "Figure 1: Real-time speech transcription with audio volume bar") +![Real-time speech transcription demo with volume visualization alt-txt#center](fasterwhipser_demo1.gif "Real-time speech transcription with audio volume bar") The device runs audio capture and transcription in parallel. Use `threading.Thread` to collect audio without blocking, store audio frames in a `queue.Queue`, and in the main thread, poll for new data and run STT. diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/3a_segmentation.md b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/3a_segmentation.md index 579912ab75..0db3781579 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/3a_segmentation.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/3a_segmentation.md @@ -6,7 +6,9 @@ weight: 5 layout: learningpathall --- -After applying the previous steps—model upgrade, VAD, smart turn detection, and multi-threaded audio collection—you now have a high-quality, CPU-based local speech-to-text system. +## Optimize speech segmentation for your environment + +After applying the previous steps-model upgrade, VAD, smart turn detection, and multi-threaded audio collection - you now have a high-quality, CPU-based local speech-to-text system. At this stage, the core pipeline is complete. What remains is fine-tuning: adapting the system to your environment, microphone setup, and speaking style. This flexibility is one of the key advantages of a fully local STT pipeline. @@ -42,7 +44,7 @@ Adjust this setting based on background noise and microphone quality. ### Tuning `MIN_SPEECH_SEC` and `SILENCE_LIMIT_SEC` -- `MIN_SPEECH_SEC`: This parameter defines the minimum duration of detected speech required before a segment is considered valid. Use this to filter out very short utterances such as false starts or background chatter. +- `MIN_SPEECH_SEC`: This parameter defines the minimum duration of detected speech needed before a segment is considered valid. Use this to filter out very short utterances such as false starts or background chatter. - Lower values: More responsive, but may capture incomplete phrases or noise - Higher values: More stable sentences, but slower response @@ -58,7 +60,7 @@ Based on practical experiments, the following presets provide a good starting po |----------------------|----------------------|-------------------------|-------------------| | Short command phrases | 0.8 | 0.6 | Optimized for quick voice commands such as "yes", "next", or "stop". Prioritizes responsiveness over sentence completeness. | | Natural conversational speech | 1.0 | 1.0 | Balanced settings for everyday dialogue with natural pauses between phrases. | -| Long-form explanations (for example, tutorials) | 2.0 | 2.0 | Designed for longer sentences and structured explanations, reducing the risk of premature segmentation. | +| Long-form explanations such as tutorials | 2.0 | 2.0 | Designed for longer sentences and structured explanations, reducing the risk of premature segmentation. | ## Apply these settings diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/4_vllm.md b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/4_vllm.md index 1f712a49ca..888967ce41 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/4_vllm.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/4_vllm.md @@ -6,9 +6,11 @@ weight: 6 layout: learningpathall --- -In the previous section, you built a complete Speech-to-Text (STT) engine using faster-whisper, running efficiently on Arm-based CPUs. Now it's time to add the next major building block: a local large language model (LLM) that can generate intelligent responses from user input. +## Deploy vLLM for local language generation -You'll integrate [vLLM](https://vllm.ai/)—a high-performance LLM inference engine that runs on GPU and supports advanced features such as continuous batching, OpenAI-compatible APIs, and quantized models like GPTQ and AWQ. +In the previous section, you built a complete Speech-to-Text (STT) engine using faster-whisper, running efficiently on Arm-based CPUs. Now it's time to add the next building block: a local large language model (LLM) that can generate intelligent responses from user input. + +You'll integrate [vLLM](https://vllm.ai/), a high-performance LLM inference engine that runs on GPU and supports advanced features such as continuous batching, OpenAI-compatible APIs, and quantized models. ### Why vLLM? @@ -18,35 +20,34 @@ vLLM is especially effective in hybrid systems like the DGX Spark, where CPU cor ### Install and launch vLLM with GPU acceleration -In this section, you’ll install and launch vLLM—an optimized large language model (LLM) inference engine that runs efficiently on GPU. This component will complete your local speech-to-response pipeline by transforming transcribed text into intelligent replies. +In this section, you'll install and launch vLLM - an optimized large language model (LLM) inference engine that runs efficiently on GPU. This component will complete your local speech-to-response pipeline by transforming transcribed text into intelligent replies. -#### Step 1: Install Docker and pull vLLM image +#### Install Docker and pull vLLM image The most efficient way to install vLLM on DGX Spark is using the NVIDIA official Docker image. Before you pull the image, ensure [Docker](https://docs.nvidia.com/dgx/dgx-spark/nvidia-container-runtime-for-docker.html) is installed and functioning on DGX Spark. Then enable Docker GPU access and pull the latest NVIDIA vLLM container: ```bash -sudo usermod -aG docker $USER -newgrp docker - -docker ps # check Docker is working - export LATEST_VLLM_VERSION=25.11-py3 docker pull nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION} ``` Confirm the image was downloaded: -``` +```bash docker images +``` +The image is shown in the output: + +```output nvcr.io/nvidia/vllm 25.11-py3 d33d4cadbe0f 2 months ago 14.1GB ``` -#### Step 2: Download a quantized model (GPTQ) +#### Download a quantized model (GPTQ) -Use Hugging Face CLI to download a pre-quantized LLM such as Mistral-7B-Instruct-GPTQ and Meta-Llama-3-70B-Instruct-GPTQ models for following Real-Time AI Conversations. +Use Hugging Face CLI to download a pre-quantized LLM such as Mistral-7B-Instruct-GPTQ and Meta-Llama-3-70B-Instruct-GPTQ models for real-time AI conversations. ```bash pip install huggingface_hub @@ -57,7 +58,6 @@ After logging in successfully, download the specific models: ```bash mkdir -p ~/models - # Mistral 7B GPTQ hf download TheBloke/Mistral-7B-Instruct-v0.2-GPTQ --local-dir ~/models/mistral-7b # (Optional) LLaMA3 70B GPTQ (requires more GPU memory) @@ -72,7 +72,7 @@ tree ~/models/mistral-7b -L 1 The files should include config.json, tokenizer.model, model.safetensors, etc. -```log +```output ├── config.json ├── generation_config.json ├── model.safetensors @@ -86,7 +86,7 @@ The files should include config.json, tokenizer.model, model.safetensors, etc. 1 directory, 9 files ``` -#### Step 3: Run the vLLM server with GPU +#### Run the vLLM server with GPU Mount your local ~/models directory and start the vLLM inference server with your downloaded model: @@ -101,13 +101,12 @@ docker run -it --gpus all -p 8000:8000 \ ``` {{% notice Note %}} -Tip: The first launch will compile and cache the model. To reduce startup time in future runs, consider creating a Docker snapshot with docker commit. +The first launch compiles and caches the model. To reduce startup time in future runs, consider creating a Docker snapshot with docker commit. {{% /notice %}} -You can also check your NVIDIA driver and CUDA compatibility during the vLLM launch. - +You can also check your NVIDIA driver and CUDA compatibility during the vLLM launch by looking at the output. -```log +```output NVIDIA Release 25.11 (build 231063344) vLLM Version 0.11.0+582e4e37 Container image Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. @@ -210,10 +209,10 @@ Capturing CUDA graphs (decode, FULL): 100%|████████████ You don’t need to read every log line. As long as you see Application startup complete, your model server is ready. {{% /notice %}} +#### Verify the server is running -#### Step 4: Verify the server is running +Once you see the message "Application startup complete." vLLM is ready to run the model. -Once you see the message "Application startup complete." in guest OS, vLLM is ready to run the model. Send a test request with curl on other terminal: ```bash @@ -228,298 +227,13 @@ curl http://localhost:8000/v1/chat/completions \ If successful, the response includes a text reply from the model. -```log +```output {"id":"chatcmpl-19aee139aabc474c93a3d211ee89d2c8","object":"chat.completion","created":1769183473,"model":"/models/mistral-7b","choices":[{"index":0,"message":{"role":"assistant","content":" RISC (Reduced Instruction Set Computing) is a computer architecture design where the processor has a simpler design and a smaller instruction set compared to CISC (Complex Instruction Set Computing) processors. RISC processors execute a larger number of simpler, more fundamental instructions. Here are some key features of RISC architecture:\n\n1. **Reduced Instruction Set:** RISC processors use a small set of basic instructions that can be combined in various ways to perform complex tasks. This is in contrast to CISC processors, which have a larger instruction set that includes instructions for performing complex tasks directly.\n2. **Register-based:** RISC processors often make extensive use of registers to store data instead of memory. They have a larger number of registers compared to CISC processors, and instructions typically operate directly on these registers. This reduces the number of memory accesses, resulting in faster execution.\n3. **Immediate addressing:** RISC instruction format includes immediate addressing, meaning some instruction operands are directly encoded within the instruction itself, like an add instruction with a constant value. This eliminates the need for additional memory fetch operations, which can save clock cycles.\n4.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":14,"total_tokens":270,"completion_tokens":256,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` - -### Connect Speech Recognition Output to vLLM - -Now that both ***faster-whisper*** and ***vLLM*** are working independently, it’s time to connect them into a real-time speech-to-response pipeline. Your system will listen to live audio, transcribe it, and send the transcription to vLLM to generate an intelligent reply—all running locally without cloud services. - - -#### Dual Process Architecture: vLLM and STT - -For a robust and production-aligned architecture, you will separating the system into two independent processes: -- ***vLLM Server (in Docker)***: Hosts the large language model, optimized for GPU inference. It can run standalone and be reused across multiple services. -- ***STT Client (Python)***: A lightweight CPU-based process that captures microphone input, runs transcription, and sends queries to the vLLM server over HTTP. - -This separation has several advantages: - -- ***Modularity***: – STT and LLM logic can be developed, updated, or debugged independently. -- ***Flexibility***: – Restart or refine your STT pipeline without touching the model backend. -- ***Performance Isolation***: – GPU-heavy inference doesn’t block audio input or local UI logic. -- ***Production Alignment***: – Mirrors real-world architectures like client-server or microservices. - - -#### Step 1: Launch vLLM (in Docker) - -Separating container startup from model launch provides greater control and improves development experience. - -By launching the container first, you can troubleshoot errors like model path issues or GPU memory limits directly inside the environment—without the container shutting down immediately. It also speeds up iteration: you avoid reloading the entire image each time you tweak settings or restart the model. - -This structure also improves visibility. You can inspect files, monitor GPU usage, or run diagnostics like `curl` and `nvidia-smi` inside the container. Breaking these steps apart makes the process easier to understand, debug, and extend. - -1. Start the Docker container - -```bash -export LATEST_VLLM_VERSION=25.11-py3 - -sudo docker run --gpus all \ - -p 8000:8000 \ - -v $HOME/models:/models \ - -e NVIDIA_VISIBLE_DEVICES=all \ - -it nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION} bash -``` - -2. Inside the container, launch vLLM - -```bash -vllm serve /models/mistral-7b \ - --quantization gptq \ - --gpu-memory-utilization 0.9 \ - --max-num-seqs 8 \ - --dtype float16 -``` - -Once you see the message: -``` -(APIServer pid=1) INFO: Started server process [1] -(APIServer pid=1) INFO: Waiting for application startup. -(APIServer pid=1) INFO: Application startup complete. -``` - -The vLLM server is now live and ready to accept HTTP requests. - -#### Step 2: Extend STT Python to connect vLLM for instant AI responses - -Now that you've implemented a real-time speech recognizer, extend the pipeline by connecting it to a local language model (LLM) powered by vLLM. - -You'll convert the STT result into a message prompt, send it to the running vLLM server via HTTP, dynamically estimate max_tokens based on input length, and print the model's reply next to the transcribed speech. - -1. Set up the LLM endpoint and model reference. - -Define LLM endpoint and model path by adding the following variables at the top of your script: - -```python -VLLM_ENDPOINT = "http://localhost:8000/v1/chat/completions" -MODEL_NAME = "/models/mistral-7b" -``` - -Make sure these match the vLLM server you launched in the previous step. - - -2. Format Transcription and Send to vLLM: - -After transcribing the user’s speech, send the result to the vLLM server by formatting it as a chat prompt. - -```python -user_text = " ".join([seg.text.strip() for seg in segments]).strip() -``` - -Estimate token length and send the request: - -```python -max_tokens = min(256, max(64, len(user_text.split()) * 5)) -response = requests.post(VLLM_ENDPOINT, json={ - "model": MODEL_NAME, - "messages": [{"role": "user", "content": user_text}], - "max_tokens": max_tokens -}) -``` - -Extract the assistant’s reply from the vLLM API response: - -```python -result = response.json() -reply = result["choices"][0]["message"]["content"].strip() -``` - -3. Extract and display the model’s response: - -Display both the transcribed input and the model’s response: -```python -... -print(f"\n User: {user_text}\n") -... -print(f" AI : {reply}\n") -``` - -This architecture mirrors the OpenAI Chat API design, enabling future enhancements like system-level prompts, multi-turn history, or role-specific behavior. - -{{% notice tip %}} -If you encounter a "model does not exist" error, double-check the model path you used when launching vLLM. It must match MODEL_NAME exactly. -{{% /notice %}} - -Switch to another terminal to execute following Python code. - -```python -import pyaudio -import numpy as np -import webrtcvad -import time -import torch -import threading -import queue -import requests -from faster_whisper import WhisperModel -from collections import deque - -# --- Parameters --- -SAMPLE_RATE = 16000 -FRAME_DURATION_MS = 30 -FRAME_SIZE = int(SAMPLE_RATE * FRAME_DURATION_MS / 1000) -VAD_MODE = 3 -SILENCE_LIMIT_SEC = 1.0 -MIN_SPEECH_SEC = 2.0 -VLLM_ENDPOINT = "http://localhost:8000/v1/chat/completions" -MODEL_NAME = "/models/mistral-7b" - -# --- Init VAD and buffers --- -vad = webrtcvad.Vad(VAD_MODE) -speech_buffer = deque() -speech_started = False -last_speech_time = time.time() - -# --- Init Thread and Queue --- -audio_queue = queue.Queue() -stop_event = threading.Event() - -# --- Init Whisper model --- -device = "cpu" # "cpu" or "gpu" -compute_type = "int8" # "int8" or "float16", "int8", "int4" -model = WhisperModel("medium.en", device=device, compute_type=compute_type) - -# --- Audio capture thread --- -def audio_capture(): - pa = pyaudio.PyAudio() - stream = pa.open(format=pyaudio.paInt16, - channels=1, - rate=SAMPLE_RATE, - input=True, - frames_per_buffer=FRAME_SIZE) - print(" Listening... Press Ctrl+C to stop") - try: - while not stop_event.is_set(): - frame = stream.read(FRAME_SIZE, exception_on_overflow=False) - audio_queue.put(frame) - finally: - stream.stop_stream() - stream.close() - pa.terminate() - -# --- Start audio capture thread --- -threading.Thread(target=audio_capture, daemon=True).start() - -# --- Main loop: process queue and transcribe --- -try: - while True: - if audio_queue.empty(): - time.sleep(0.01) - continue - - frame = audio_queue.get() - is_speech = vad.is_speech(frame, SAMPLE_RATE) - - if is_speech: - speech_buffer.append(frame) - speech_started = True - last_speech_time = time.time() - elif speech_started: - speech_duration = len(speech_buffer) * (FRAME_DURATION_MS / 1000.0) - silence_duration = time.time() - last_speech_time - - if silence_duration > SILENCE_LIMIT_SEC: - if speech_duration >= MIN_SPEECH_SEC: - print(" Transcribing buffered speech...") - audio_bytes = b"".join(speech_buffer) - audio_np = np.frombuffer(audio_bytes, dtype=np.int16).astype(np.float32) / 32768.0 - - segments, _ = model.transcribe(audio_np, language="en") - user_text = " ".join([seg.text.strip() for seg in segments]).strip() - print(f"\n User: {user_text}\n") - - max_tokens = 128 - response = requests.post(VLLM_ENDPOINT, json={ - "model": MODEL_NAME, - "messages": [ - {"role": "user", "content": user_text} - ], - "max_tokens": max_tokens - }) - result = response.json() - if "choices" not in result: - print(" Error from vLLM:", result.get("error", "Unknown error")) - continue - reply = result["choices"][0]["message"]["content"].strip() - print(f" AI : {reply}\n") - else: - print(f" Skipped short segment ({speech_duration:.2f}s < {MIN_SPEECH_SEC}s)") - - speech_buffer.clear() - speech_started = False -except KeyboardInterrupt: - print(" Stopped") -finally: - stop_event.set() -``` - - -#### Step 3: Interact with the chatbot - -Once both your vLLM server and Python STT script are running correctly, you'll see output like the following in your terminal. - -Each time you speak a full sentence (based on your silence/segment thresholds), the system transcribes your speech, displays the recognized text, and shows the model's reply in natural language. - - -If your input is too short (for example, a false trigger or a background noise spike), you'll see a message like: - -``` -Skipped short segment (1.32s < 2.0s) -``` - -This means your speech did not meet the MIN_SPEECH_SEC threshold. You can adjust this value in the next section to make the system more or less sensitive. - - -An example when asking the assistant for a joke: - -``` - Listening... Press Ctrl+C to stop - Skipped short segment (0.39s < 2.0s) - Skipped short segment (1.44s < 2.0s) - Skipped short segment (1.89s < 2.0s) - Skipped short segment (1.77s < 2.0s) - Skipped short segment (0.36s < 2.0s) - Transcribing buffered speech... - - Listening... Press Ctrl+C to stop - Transcribing buffered speech... - - User: Hello, please tell me the joke. - - AI : Of course, I'd be happy to tell you a joke! Here's a classic one: - -Why don't libraries smell like popcorn? - -Because they are full of books, not movies! - -I hope that brings a smile to your face. If you have any other requests, feel free to ask! -``` - -If your input is too short, you’ll see: - -``` -Skipped short segment (1.32s < 2.0s) -``` - -{{% notice tip %}} -You can fine-tune these parameters in future sections to better fit your speaking style or environment. -{{% /notice %}} - ## What you've accomplished and what's next -You've successfully built a complete voice-to-AI-response loop: microphone input is captured in real time, transcribed locally using faster-whisper on CPU, forwarded to a local vLLM server running on GPU, and receives intelligent responses with low latency. +You've successfully installed and launched vLLM on DGX Spark using Docker, downloaded a quantized LLM model (GPTQ format), and verified the server responds to API requests. Your GPU-accelerated language model is now ready to generate intelligent responses. -This foundation supports a wide range of customizations in the next section, where you'll build customer-specific workflows with prompt engineering and multi-turn memory. +In the next section, you'll connect this vLLM server to the STT pipeline you built earlier, creating a complete voice-to-response system. -In the next section, you'll adapt this core pipeline for real-world assistant scenarios like customer service. \ No newline at end of file diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/4a_integration.md b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/4a_integration.md new file mode 100644 index 0000000000..8322db6ee1 --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/4a_integration.md @@ -0,0 +1,291 @@ +--- +title: Connect speech recognition to vLLM for real-time voice interaction +weight: 7 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Integrate STT with vLLM for voice interaction + +Now that both faster-whisper and vLLM are working independently, it's time to connect them into a real-time speech-to-response pipeline. Your system will listen to live audio, transcribe it, and send the transcription to vLLM to generate an intelligent reply - all running locally without cloud services. + +### Dual process architecture: vLLM and STT + +For a robust and production-aligned architecture, separate the system into two independent processes: +- vLLM Server (in Docker): Hosts the large language model, optimized for GPU inference. It can run standalone and be reused across multiple services. +- STT Client (Python): A lightweight CPU-based process that captures microphone input, runs transcription, and sends queries to the vLLM server over HTTP. + +This separation has several advantages: + +- Modularity: STT and LLM logic can be developed, updated, or debugged independently. +- Flexibility: Restart or refine your STT pipeline without touching the model backend. +- Performance Isolation: GPU-heavy inference doesn't block audio input or local UI logic. +- Production Alignment: Mirrors real-world architectures like client-server or microservices. + +### Launch vLLM and connect to STT + +#### Launch vLLM (in Docker) + +Separating container startup from model launch provides greater control and improves development experience. + +By launching the container first, you can troubleshoot errors like model path issues or GPU memory limits directly inside the environment, without the container shutting down immediately. It also speeds up iteration: you avoid reloading the entire image each time you tweak settings or restart the model. + +This structure also improves visibility. You can inspect files, monitor GPU usage, or run diagnostics like `curl` and `nvidia-smi` inside the container. Breaking these steps apart makes the process easier to understand, debug, and extend. + +Start the Docker container: + +```bash +export LATEST_VLLM_VERSION=25.11-py3 +docker run --gpus all \ + -p 8000:8000 \ + -v $HOME/models:/models \ + -e NVIDIA_VISIBLE_DEVICES=all \ + -it nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION} bash +``` + +Inside the container, launch vLLM: + +```bash +vllm serve /models/mistral-7b \ + --quantization gptq \ + --gpu-memory-utilization 0.9 \ + --max-num-seqs 8 \ + --dtype float16 +``` + +Look for "Application startup complete." in the output: + +```output +(APIServer pid=1) INFO: Started server process [1] +(APIServer pid=1) INFO: Waiting for application startup. +(APIServer pid=1) INFO: Application startup complete. +``` + +The vLLM server is now live and ready to accept HTTP requests. + +#### Extend STT Python to connect vLLM for instant AI responses + +Now that you've implemented a real-time speech recognizer, extend the pipeline by connecting it to a local language model (LLM) powered by vLLM. + +You'll convert the STT result into a message prompt, send it to the running vLLM server via HTTP, dynamically estimate max_tokens based on input length, and print the model's reply next to the transcribed speech. + +Set up the LLM endpoint and model reference by adding the following variables at the top of your script: + +```python +VLLM_ENDPOINT = "http://localhost:8000/v1/chat/completions" +MODEL_NAME = "/models/mistral-7b" +``` + +Make sure these match the vLLM server you launched in the previous step. + +After transcribing the user's speech, send the result to the vLLM server by formatting it as a chat prompt. + +```python +user_text = " ".join([seg.text.strip() for seg in segments]).strip() +``` + +Estimate token length and send the request: + +```python +max_tokens = min(256, max(64, len(user_text.split()) * 5)) +response = requests.post(VLLM_ENDPOINT, json={ + "model": MODEL_NAME, + "messages": [{"role": "user", "content": user_text}], + "max_tokens": max_tokens +}) +``` + +Extract the assistant's reply from the vLLM API response: + +```python +result = response.json() +reply = result["choices"][0]["message"]["content"].strip() +``` + +Display both the transcribed input and the model's response: + +```python +... +print(f"\n User: {user_text}\n") +... +print(f" AI : {reply}\n") +``` + +This architecture mirrors the OpenAI Chat API design, enabling future enhancements like system-level prompts, multi-turn history, or role-specific behavior. + +{{% notice tip %}} +If you encounter a "model doesn't exist" error, double-check the model path you used when launching vLLM. It must match MODEL_NAME exactly. +{{% /notice %}} + +Switch to another terminal and save the following Python code in a file named `stt-client.py`: + +```python +import pyaudio +import numpy as np +import webrtcvad +import time +import torch +import threading +import queue +import requests +from faster_whisper import WhisperModel +from collections import deque + +# --- Parameters --- +SAMPLE_RATE = 16000 +FRAME_DURATION_MS = 30 +FRAME_SIZE = int(SAMPLE_RATE * FRAME_DURATION_MS / 1000) +VAD_MODE = 3 +SILENCE_LIMIT_SEC = 1.0 +MIN_SPEECH_SEC = 2.0 +VLLM_ENDPOINT = "http://localhost:8000/v1/chat/completions" +MODEL_NAME = "/models/mistral-7b" + +# --- Init VAD and buffers --- +vad = webrtcvad.Vad(VAD_MODE) +speech_buffer = deque() +speech_started = False +last_speech_time = time.time() + +# --- Init Thread and Queue --- +audio_queue = queue.Queue() +stop_event = threading.Event() + +# --- Init Whisper model --- +device = "cpu" # "cpu" or "gpu" +compute_type = "int8" # "int8" or "float16", "int8", "int4" +model = WhisperModel("medium.en", device=device, compute_type=compute_type) + +# --- Audio capture thread --- +def audio_capture(): + pa = pyaudio.PyAudio() + stream = pa.open(format=pyaudio.paInt16, + channels=1, + rate=SAMPLE_RATE, + input=True, + frames_per_buffer=FRAME_SIZE) + print(" Listening... Press Ctrl+C to stop") + try: + while not stop_event.is_set(): + frame = stream.read(FRAME_SIZE, exception_on_overflow=False) + audio_queue.put(frame) + finally: + stream.stop_stream() + stream.close() + pa.terminate() + +# --- Start audio capture thread --- +threading.Thread(target=audio_capture, daemon=True).start() + +# --- Main loop: process queue and transcribe --- +try: + while True: + if audio_queue.empty(): + time.sleep(0.01) + continue + + frame = audio_queue.get() + is_speech = vad.is_speech(frame, SAMPLE_RATE) + + if is_speech: + speech_buffer.append(frame) + speech_started = True + last_speech_time = time.time() + elif speech_started: + speech_duration = len(speech_buffer) * (FRAME_DURATION_MS / 1000.0) + silence_duration = time.time() - last_speech_time + + if silence_duration > SILENCE_LIMIT_SEC: + if speech_duration >= MIN_SPEECH_SEC: + print(" Transcribing buffered speech...") + audio_bytes = b"".join(speech_buffer) + audio_np = np.frombuffer(audio_bytes, dtype=np.int16).astype(np.float32) / 32768.0 + + segments, _ = model.transcribe(audio_np, language="en") + user_text = " ".join([seg.text.strip() for seg in segments]).strip() + print(f"\n User: {user_text}\n") + + max_tokens = 128 + response = requests.post(VLLM_ENDPOINT, json={ + "model": MODEL_NAME, + "messages": [ + {"role": "user", "content": user_text} + ], + "max_tokens": max_tokens + }) + result = response.json() + if "choices" not in result: + print(" Error from vLLM:", result.get("error", "Unknown error")) + continue + reply = result["choices"][0]["message"]["content"].strip() + print(f" AI : {reply}\n") + else: + print(f" Skipped short segment ({speech_duration:.2f}s < {MIN_SPEECH_SEC}s)") + + speech_buffer.clear() + speech_started = False +except KeyboardInterrupt: + print(" Stopped") +finally: + stop_event.set() +``` + +Run the code using: + +```bash +python3 ./stt-client.py +``` + +#### Interact with the chatbot + +Once both your vLLM server and Python STT script are running correctly, you'll see output like the following in your terminal. + +Each time you speak a full sentence (based on your silence and segment thresholds), the system transcribes your speech, displays the recognized text, and shows the model's reply in natural language. + +If your input is too short (a false trigger or background noise spike), you'll see a message like: + +``` +Skipped short segment (1.32s < 2.0s) +``` + +This means your speech did not meet the MIN_SPEECH_SEC threshold. You can adjust this value to make the system more or less sensitive. + +An example when asking the assistant for a joke: + +``` + Listening... Press Ctrl+C to stop + Skipped short segment (0.39s < 2.0s) + Skipped short segment (1.44s < 2.0s) + Skipped short segment (1.89s < 2.0s) + Skipped short segment (1.77s < 2.0s) + Skipped short segment (0.36s < 2.0s) + Transcribing buffered speech... + + Listening... Press Ctrl+C to stop + Transcribing buffered speech... + + User: Hello, please tell me the joke. + + AI : Of course, I'd be happy to tell you a joke! Here's a classic one: + +Why don't libraries smell like popcorn? + +Because they are full of books, not movies! + +I hope that brings a smile to your face. If you have any other requests, feel free to ask! +``` + +If your input is too short, you'll see: + +``` +Skipped short segment (1.32s < 2.0s) +``` + +{{% notice Tip %}}You can fine-tune these parameters in future sections to better fit your speaking style or environment.{{% /notice %}} + +## What you've accomplished and what's next + +You've successfully built a complete voice-to-AI-response loop: microphone input is captured in real time, transcribed locally using faster-whisper on CPU, forwarded to a local vLLM server running on GPU, and receives intelligent responses with low latency. + +This foundation supports a wide range of customizations where you can build customer-specific workflows with prompt engineering and multi-turn memory. diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/5_chatbot_prompt.md b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/5_chatbot_prompt.md index 20037b5307..001e2f89b0 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/5_chatbot_prompt.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/5_chatbot_prompt.md @@ -1,43 +1,43 @@ --- title: Specialize offline voice assistants for customer service -weight: 7 +weight: 8 ### FIXED, DO NOT MODIFY layout: learningpathall --- +## Why adapt for customer support? + In the previous section, you built a fully offline voice assistant by combining local speech-to-text (STT) with vLLM for language generation. You can transform that general-purpose chatbot into a task-specific customer service agent designed to deliver fast, focused, and context-aware assistance. -## Why adapt for customer support? +Unlike open-domain chatbots, customer-facing assistants must meet stricter communication standards. Users expect direct answers, not verbose or vague explanations, with fast responses and no long pauses. The assistant should remember previous questions or actions to support multi-turn interactions. -Unlike open-domain chatbots, customer-facing assistants must meet stricter communication standards. Each voice input must trigger a fast and relevant reply with no long pauses or uncertainty. Users expect direct answers, not verbose or vague explanations. The assistant should remember previous questions or actions to support multi-turn interactions. - -These needs are especially relevant for questions like password resets ("I forgot my account password and need help resetting it"), order tracking ("Can you track my recent order and tell me when it will arrive?"), billing issues ("Why was I charged twice this month?"), and subscription management ("I want to cancel my subscription and avoid future charges"). Such queries require language generation, structured behavior, and task memory. +These needs are especially relevant for password resets ("I forgot my account password and need help resetting it"), order tracking ("Can you track my recent order and tell me when it will arrive?"), billing issues ("Why was I charged twice this month?"), and subscription management ("I want to cancel my subscription and avoid future charges"). Such queries need language generation, structured behavior, and task memory. ### What you'll build in this section You’ll enhance your assistant with three critical upgrades: -1. ***Role-Specific System Prompts*** +1. Role-Specific System Prompts -Define your assistant’s personality and responsibilities (e.g., support agent, coach, guide) through system messages. Learn how prompt engineering influences tone, detail level, and actionability. +Define your assistant's personality and responsibilities (support agent, coach, guide) through system messages. Learn how prompt engineering influences tone, detail level, and actionability. -2. ***Multi-Turn Memory*** +2. Multi-Turn Memory -Enable the assistant to recall recent interactions and respond within context. You’ll manage a rolling history of messages while avoiding token overflow. +Enable the assistant to recall recent interactions and respond within context. You'll manage a rolling history of messages while avoiding token overflow. -3. ***(Optional) Secure Knowledge Retrieval*** +3. (Optional) Secure Knowledge Retrieval -Explore how to integrate local company data using vector search—allowing the assistant to answer questions based on private documents, without ever sending data to the cloud. +Explore how to integrate local company data using vector search. This allows the assistant to answer questions based on private documents, without ever sending data to the cloud. -This prepares your system for high-trust environments like: +This prepares your system for high-trust environments such as: - Enterprise customer support - Internal help desks - Regulated industries (healthcare, finance, legal) -By the end of this section, your assistant behaves like a real support agent—able to respond quickly, maintain context, and optionally access internal knowledge to resolve complex requests. +By the end of this section, your assistant behaves like a real support agent. It can respond quickly, maintain context, and optionally access internal knowledge to resolve complex requests. ## Control AI behavior with system prompts @@ -46,25 +46,26 @@ To make your AI assistant behave like a domain expert (such as a support agent o In OpenAI-compatible APIs (like vLLM), you can provide a special message with the role set to "system". This message defines the behavior and tone of the assistant before any user input is processed. -A system prompt gives your assistant a role to play, such as a polite and helpful customer service agent, a knowledgeable tour guide, or a motivational fitness coach. By customizing the system prompt, you can shape the assistant's language and tone, restrict or expand the type of information it shares, and align responses with business needs (such as short and precise replies for help desks). +A system prompt gives your assistant a role to play such as a polite and helpful customer service agent, a knowledgeable tour guide, or a motivational fitness coach. By customizing the system prompt, you can shape the assistant's language and tone, restrict or expand the type of information it shares, and align responses with business needs such as short and precise replies for help desks. -### Step 1: Define the System Prompt Behavior +### Define the system prompt behavior To turn your general-purpose voice assistant into a focused role-specific agent, you must guide the language model’s behavior. This is done by defining a system prompt that acts as a task instruction. -For example: -***Support Agent*** +For example: + +Support Agent: ```bash You are a professional customer support assistant. Always prioritize clarity and solve customer issues politely and efficiently. ``` -***Fitness Coach*** +Fitness Coach: ```bash You are a friendly and motivational fitness coach. Offer helpful workout tips and health advice. ``` -***Tour Guide*** +Tour Guide: ```bash You are an enthusiastic travel guide. Recommend popular tourist destinations and answer cultural questions. ``` @@ -82,7 +83,7 @@ If required information is missing, ask a clear follow-up question.""" This prompt guides the model to respond effectively, even when the user's input is vague, by reducing ambiguity, maintaining a consistent tone, and helping the assistant stay on topic and solution-oriented. -### Step 2: Inject the role instruction into the user message +### Inject the role instruction into the user message Instead of sending a separate "system" role (which may cause vLLM to return a formatting error), you can prepend the role instruction directly into the user's message. This keeps control over the assistant's behavior while maintaining compatibility with vLLM's expected message format. @@ -101,9 +102,9 @@ messages = [ ] ``` -This line embeds your desired assistant role (for example, customer support) directly into the input and ensures that vLLM treats the message as valid by alternating roles correctly between user and assistant. By combining prompt and question, the assistant behaves like a helpful agent without triggering message formatting errors. +This line embeds your desired assistant role (like customer support) directly into the input and ensures that vLLM treats the message as valid by alternating roles correctly between user and assistant. By combining prompt and question, the assistant behaves like a helpful agent without triggering message formatting errors. -### Step 3: Choose the right response length for each role +### Choose the right response length for each role Different assistant roles require different response styles. A customer service agent should be concise and focused, while a tour guide or teacher may need to provide more elaboration. @@ -113,14 +114,15 @@ To control the response length dynamically, we calculate max_tokens using this f max_tokens = min(512, max(64, len(user_text.split()) * 5)) ``` -Here's how this formula works: `len(user_text.split())` counts how many words the user spoke, the multiplication by 5 estimates about 5 tokens per word (a rough average), `max()` ensures a minimum of 64 tokens to avoid overly short responses, and `min()` caps the response length to 512 tokens to avoid delays or exceeding model limits. This gives you a balanced reply size that grows with input length but stays within safe bounds. +Here's how this formula works: `len(user_text.split())` counts how many words the user spoke, multiplication by 5 estimates about five tokens per word (a rough average), `max()` ensures a minimum of 64 tokens to avoid short responses, and `min()` caps the response length to 512 tokens to avoid delays or exceeding model limits. This gives you a balanced reply size that grows with input length but stays within safe bounds. {{% notice tip %}} Adjust this formula based on role. Customer support needs shorter replies, so use `max_tokens = 256` for snappy responses. Tour guide or teacher roles are more verbose and can use `max_tokens = 384` or higher. For FAQ retrieval, adjust based on expected answer complexity. {{% /notice %}} +### Run the complete customer service assistant -The full version of the code is: +Save the following code to a file named `customer-assist.py`: ```python import pyaudio @@ -244,4 +246,12 @@ finally: stop_event.set() ``` -This approach works well for single-turn interactions. In the next section, you'll extend this pattern to multi-turn conversations, storing and managing multiple rounds of user and assistant prompts. +Run the assistant: + +```bash +python3 customer-assist.py +``` + +The assistant will listen for your voice input and respond as a customer support agent, using the system prompt to guide its behavior. + +This approach works well for single-turn interactions. You'll extend this pattern to multi-turn conversations, storing and managing multiple rounds of user and assistant prompts. diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/6_chatbot_contextaware.md b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/6_chatbot_contextaware.md index ab0dec7310..753bb006d1 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/6_chatbot_contextaware.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/6_chatbot_contextaware.md @@ -1,16 +1,16 @@ --- title: Enable context-aware dialogue with short-term memory -weight: 8 +weight: 9 ### FIXED, DO NOT MODIFY layout: learningpathall --- -In customer service and other task-based voice interactions, conversations naturally span multiple turns. Users may provide only partial information per utterance or follow up after the assistant’s prompt. +## Why multi-turn memory matters -To handle such situations effectively, your assistant needs short-term memory—a lightweight context buffer that retains recent user questions and assistant replies. +In customer service and other task-based voice interactions, conversations naturally span multiple turns. Users can provide only partial information per utterance or follow up after the assistant's prompt. -## Why multi-turn memory matters +To handle such situations effectively, your assistant needs short-term memory. This is a lightweight context buffer that retains recent user questions and assistant replies. Without memory, each user input is treated in isolation. This causes breakdowns like: @@ -25,7 +25,7 @@ Example: What happens without memory This happens because the assistant doesn't remember the user's intent or its own previous message. To fix this, you'll implement a memory buffer. -### Step 1: Store previous user and assistant turns +### Store previous user and assistant turns Create a `chat_history` list to hold recent turns. After each interaction, append both user and assistant responses: @@ -42,7 +42,7 @@ chat_history.append({"role": "assistant", "content": reply}) This will build a list like: -```log +```output [ {"role": "user", "content": "I need to cancel my order."}, {"role": "assistant", "content": "Sure, can you provide the order ID?"}, @@ -51,9 +51,9 @@ This will build a list like: ] ``` -### Step 2: Keep only the most recent 5 rounds +### Keep only the most recent five rounds -Each new turn makes the message array longer. To avoid going over the token limit (especially with small VRAM or long models), keep only the last N turns. Use 5 rounds as an example (10 messages, 5 rounds of user + assistant) +Each new turn makes the message array longer. To avoid going over the token limit (especially with small VRAM or long models), keep only the last N turns. Use five rounds as an example (10 messages, five rounds of user + assistant) ```python messages = [{"role": "system", "content": SYSTEM_PROMPT}] @@ -63,7 +63,7 @@ messages.append({"role": "user", "content": user_text}) This keeps the most recent context while fitting within model constraints. -### Step 3: Estimate token usage before sending +### Estimate token usage before sending Before calling the API, estimate how many tokens your prompt is using. This includes the system prompt, all past user/assistant messages, and the new user message. @@ -72,23 +72,23 @@ prompt_tokens = len(" ".join([m["content"] for m in messages]).split()) print(f" Estimated prompt tokens: {prompt_tokens}") ``` -This helps you balance max_tokens for the assistant's response, ensuring the prompt and reply fit within the model's limit (for example, 4096 or 8192 tokens depending on the model). +This helps you balance max_tokens for the assistant's response, ensuring the prompt and reply fit within the model's limit such as 4096 or 8192 tokens depending on the model. The expected output is similar to: -```log +```output User: Hi, I need to cancel my subscription. Please help me Estimated prompt tokens: 46 Sending messages: [{'role': 'user', 'content': " You are a professional customer support assistant.\nRespond politely, clearly, and concisely.\nFocus on solving the user's issue.\nDo not include unnecessary apologies or long explanations.\nIf required information is missing, ask a clear follow-up question.\n\nHi, I need to cancel my subscription. Please help me"}] - AI : Hello, I'd be happy to help you cancel your subscription. Could you please provide me with your account information, such as your name and email address associated with the subscription? This will ensure a smooth cancellation process. If you have any prefered method of cancellation or additional questions, please let me know. + AI : Hello, I'd be happy to help you cancel your subscription. Could you please provide me with your account information, such as your name and email address associated with the subscription? This will ensure a smooth cancellation process. If you have any preferred method of cancellation or additional questions, please let me know. Transcribing buffered speech... ] User: My account information is abc at email.com Estimated prompt tokens: 69 -Sending messages: [{'role': 'user', 'content': 'Hi, I need to cancel my subscription. Please help me'}, {'role': 'assistant', 'content': "Hello, I'd be happy to help you cancel your subscription. Could you please provide me with your account information, such as your name and email address associated with the subscription? This will ensure a smooth cancellation process. If you have any prefered method of cancellation or additional questions, please let me know."}, {'role': 'user', 'content': 'My account information is abc at email.com'}] +Sending messages: [{'role': 'user', 'content': 'Hi, I need to cancel my subscription. Please help me'}, {'role': 'assistant', 'content': "Hello, I'd be happy to help you cancel your subscription. Could you please provide me with your account information, such as your name and email address associated with the subscription? This will ensure a smooth cancellation process. If you have any preferred method of cancellation or additional questions, please let me know."}, {'role': 'user', 'content': 'My account information is abc at email.com'}] AI : Hi abc@email.com, I'll do my best to assist you with canceling your subscription. For security reasons, I'll need to verify your account information before proceeding. Could you please confirm your account password or the last 4 digits of the payment method associated with the subscription? This will Transcribing buffered speech... ] @@ -96,7 +96,7 @@ Sending messages: [{'role': 'user', 'content': 'Hi, I need to cancel my subscrip User: The account password will be 3355. Estimated prompt tokens: 122 -Sending messages: [{'role': 'user', 'content': 'Hi, I need to cancel my subscription. Please help me'}, {'role': 'assistant', 'content': "Hello, I'd be happy to help you cancel your subscription. Could you please provide me with your account information, such as your name and email address associated with the subscription? This will ensure a smooth cancellation process. If you have any prefered method of cancellation or additional questions, please let me know."}, {'role': 'user', 'content': 'My account information is abc at email.com'}, {'role': 'assistant', 'content': "Hi abc@email.com, I'll do my best to assist you with canceling your subscription. For security reasons, I'll need to verify your account information before proceeding. Could you please confirm your account password or the last 4 digits of the payment method associated with the subscription? This will"}, {'role': 'user', 'content': 'The account password will be 3355.'}] +Sending messages: [{'role': 'user', 'content': 'Hi, I need to cancel my subscription. Please help me'}, {'role': 'assistant', 'content': "Hello, I'd be happy to help you cancel your subscription. Could you please provide me with your account information, such as your name and email address associated with the subscription? This will ensure a smooth cancellation process. If you have any preferred method of cancellation or additional questions, please let me know."}, {'role': 'user', 'content': 'My account information is abc at email.com'}, {'role': 'assistant', 'content': "Hi abc@email.com, I'll do my best to assist you with canceling your subscription. For security reasons, I'll need to verify your account information before proceeding. Could you please confirm your account password or the last 4 digits of the payment method associated with the subscription? This will"}, {'role': 'user', 'content': 'The account password will be 3355.'}] AI : Hi abc@email.com, I've verified your account information and you have an active subscription with us. I'm sorry to see that you'd like to cancel. To cancel your subscription, I will need to process a refund for your latest charge. This refund may take up to 7-1 Transcribing buffered speech... ] @@ -104,7 +104,7 @@ Sending messages: [{'role': 'user', 'content': 'Hi, I need to cancel my subscrip User: Okay, I see the account has been cancelled. Thanks for your help. Estimated prompt tokens: 180 -Sending messages: [{'role': 'user', 'content': 'Hi, I need to cancel my subscription. Please help me'}, {'role': 'assistant', 'content': "Hello, I'd be happy to help you cancel your subscription. Could you please provide me with your account information, such as your name and email address associated with the subscription? This will ensure a smooth cancellation process. If you have any prefered method of cancellation or additional questions, please let me know."}, {'role': 'user', 'content': 'My account information is abc at email.com'}, {'role': 'assistant', 'content': "Hi abc@email.com, I'll do my best to assist you with canceling your subscription. For security reasons, I'll need to verify your account information before proceeding. Could you please confirm your account password or the last 4 digits of the payment method associated with the subscription? This will"}, {'role': 'user', 'content': 'The account password will be 3355.'}, {'role': 'assistant', 'content': "Hi abc@email.com, I've verified your account information and you have an active subscription with us. I'm sorry to see that you'd like to cancel. To cancel your subscription, I will need to process a refund for your latest charge. This refund may take up to 7-1"}, {'role': 'user', 'content': 'Okay, I see the account has been cancelled. Thanks for your help.'}] +Sending messages: [{'role': 'user', 'content': 'Hi, I need to cancel my subscription. Please help me'}, {'role': 'assistant', 'content': "Hello, I'd be happy to help you cancel your subscription. Could you please provide me with your account information, such as your name and email address associated with the subscription? This will ensure a smooth cancellation process. If you have any preferred method of cancellation or additional questions, please let me know."}, {'role': 'user', 'content': 'My account information is abc at email.com'}, {'role': 'assistant', 'content': "Hi abc@email.com, I'll do my best to assist you with canceling your subscription. For security reasons, I'll need to verify your account information before proceeding. Could you please confirm your account password or the last 4 digits of the payment method associated with the subscription? This will"}, {'role': 'user', 'content': 'The account password will be 3355.'}, {'role': 'assistant', 'content': "Hi abc@email.com, I've verified your account information and you have an active subscription with us. I'm sorry to see that you'd like to cancel. To cancel your subscription, I will need to process a refund for your latest charge. This refund may take up to 7-1"}, {'role': 'user', 'content': 'Okay, I see the account has been cancelled. Thanks for your help.'}] AI : You're welcome, abc@email.com. I'm glad I could help you cancel your subscription. If there is anything else I can assist you with in the future, please don't hesitate to ask. Have a great day! ``` @@ -115,7 +115,7 @@ The assistant remembers the previous turns, including account verification and f | Speaker | Message | Prompt Tokens | |---------|---------|---------------| | User | Hi, I need to cancel my subscription. Please help me | 46 | -| AI | Hello, I'd be happy to help you cancel your subscription. Could you please provide me with your account information, such as your name and email address associated with the subscription? This will ensure a smooth cancellation process. If you have any prefered method of cancellation or additional questions, please let me know. | | +| AI | Hello, I'd be happy to help you cancel your subscription. Could you please provide me with your account information, such as your name and email address associated with the subscription? This will ensure a smooth cancellation process. If you have any preferred method of cancellation or additional questions, please let me know. | | | User | My account information is abc at email.com | 69 | | AI | Hi abc@email.com, I'll do my best to assist you with canceling your subscription. For security reasons, I'll need to verify your account information before proceeding. Could you please confirm your account password or the last 4 digits of the payment method associated with the subscription? This will | | User | The account password will be 3355. | 122 | @@ -123,13 +123,13 @@ The assistant remembers the previous turns, including account verification and f | User | Okay, I see the account has been cancelled. Thanks for your help. | 180 | | AI | You're welcome, abc@email.com. I'm glad I could help you cancel your subscription. If there is anything else I can assist you with in the future, please don't hesitate to ask. Have a great day! | | -This estimate helps you prevent prompt truncation or response cutoff, especially important when using larger models with longer histories. +This estimate helps you prevent prompt truncation or response cutoff, which is especially important when using larger models with longer histories. ## Full function of offline voice customer service on DGX Spark Now that your speech-to-AI pipeline is complete, you're ready to scale it up by running a larger, more powerful language model fully offline on DGX Spark. -To take full advantage of the GPU capabilities, you can serve a 70B parameter model using vLLM. Make sure you’ve already downloaded the model files into ~/models/llama3-70b (host OS). +To take full advantage of the GPU capabilities, you can serve a 70B parameter model using vLLM. Ensure you've already downloaded the model files into ~/models/llama3-70b (host OS). Inside the vLLM Docker container, launch the model with: @@ -141,9 +141,9 @@ vllm serve /models/llama3-70b \ --dtype float16 ``` -This command starts the high-performance LLM backend using quantized weights and optimized GPU memory allocation, giving you full-scale generation power—while keeping the assistant responsive and completely private. +This command starts the high-performance LLM backend using quantized weights and optimized GPU memory allocation, giving you full-scale generation power while keeping the assistant responsive and completely private. -Now, execute the complete Python code to activate your speech recognition and dialogue pipeline. Once both the STT and LLM services are live, you’ll be able to speak naturally and receive real-time, intelligent responses from the assistant—without any cloud connection. +Now, save the complete Python code to a file named `fwhisper_vllm_audio.py` to activate your speech recognition and dialogue pipeline with multi-turn memory support: ```python import pyaudio @@ -294,20 +294,27 @@ finally: stop_event.set() ``` +Run the assistant with multi-turn memory: + +```bash +python3 fwhisper_vllm_audio.py +``` + +Once both the STT and LLM services are live, you'll be able to speak naturally and receive real-time, intelligent responses from the assistant. The assistant will remember previous exchanges in the conversation, allowing for natural multi-turn dialogues without any cloud connection. ### Demo: Multi-turn voice chatbot with context memory on DGX Spark -![img2 alt-text#center](fasterwhipser_vllm_demo2.gif "Figure 2: Full Function Voice-to-AI with volume bar") +![Animated terminal session showing real-time speech-to-text transcription and AI responses in a multi-turn customer service conversation, with a volume bar at the bottom indicating live audio input levels from a microphone alt-txt#center](fasterwhipser_vllm_demo2.gif "Full function voice-to-AI with volume bar") -This demo showcases a fully offline voice assistant that combines real-time transcription (via faster-whisper) and intelligent response generation (via vLLM). Running on an Arm-based DGX Spark system, the assistant captures live audio, transcribes it, and generates context-aware replies using a local language model—all in a seamless loop. +This demo showcases a fully offline voice assistant that combines real-time transcription (via faster-whisper) and intelligent response generation (via vLLM). Running on an Arm-based DGX Spark system, the assistant captures live audio, transcribes it, and generates context-aware replies using a local language model, all in a seamless loop. The assistant now supports multi-turn memory, allowing it to recall previous user inputs and its own responses. As shown in the video, this enables natural back-and-forth conversations, such as confirming account details or resolving support requests. -No cloud services are used—ensuring full control, privacy, and low-latency performance. +No cloud services are used, ensuring full control, privacy, and low-latency performance. -### Full Voice-to-AI Conversation Flow +### Full voice-to-AI conversation flow -The following diagram summarizes the complete architecture you’ve now assembled: from microphone input to AI-generated replies—entirely local, modular, and production-ready. +The following diagram summarizes the complete architecture you've now assembled: from microphone input to AI-generated replies, entirely local, modular, and production-ready. ``` USB Microphone (16kHz mono) @@ -328,7 +335,7 @@ Transcribed User Text (timestamped) ↓ System Prompt + Conversation History ↓ -vLLM API (for example, mistral-7b or llama3-70b GPU inference) +vLLM API (such as mistral-7b or llama3-70b GPU inference) ↓ AI Response ↓ @@ -341,9 +348,9 @@ This hybrid architecture is production-ready, modular, and offline-capable. All With a fully functional offline voice chatbot running on DGX Spark, you now have a strong foundation for many advanced features. Here are some next-step enhancements you might consider: -- Knowledge-augmented Generation (RAG) +- Knowledge-Augmented Generation (RAG) -Integrate local document search or FAQ databases with embedding-based retrieval to answer company-specific or domain-specific queries. You can reference a previous Learning Path about [deploying RAG on DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_rag/) for the same hardware. +Integrate local document search or FAQ databases with embedding-based retrieval to answer company-specific or domain-specific queries. See the Learning Path [Deploying RAG on DGX Spark](/learning-paths/laptops-and-desktops/dgx_spark_rag/) for the same hardware. - Multi-language Support @@ -351,10 +358,10 @@ Swap in multilingual STT models and LLMs to enable assistants for global users o - Text-to-Speech (TTS) Output -Add a local TTS engine (for example, Coqui, piper, or NVIDIA Riva) to vocalize the assistant's replies, turning it into a true conversational agent. +Add a local TTS engine such as Coqui, piper, or NVIDIA Riva to vocalize the assistant's replies, turning it into a true conversational agent. - Personalization and Context Memory Extend short-term memory into long-term context retention using file-based or vector-based storage. This lets the assistant remember preferences or past sessions. -This on-device architecture enables experimentation and extension without vendor lock-in or privacy concerns, making it ideal for enterprise, educational, and embedded use cases. +This on-device architecture enables experimentation and extension without vendor lock-in or privacy concerns, making it ideal for enterprise, educational, and embedded use cases. \ No newline at end of file diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/_index.md b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/_index.md index 1579b908c1..b3fdb84a88 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/_index.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/_index.md @@ -1,10 +1,5 @@ --- title: Build an offline voice chatbot with faster-whisper and vLLM on DGX Spark -description: Learn how to build a fully offline voice assistant by combining local speech recognition with LLM-powered responses using faster-whisper and vLLM—optimized for DGX Spark. - -draft: true -cascade: - draft: true minutes_to_complete: 60 @@ -13,7 +8,7 @@ who_is_this_for: This is an advanced topic for developers and ML engineers who w learning_objectives: - Explain the architecture of an offline voice chatbot pipeline combining speech-to-text (STT) and vLLM - Capture and segment real-time audio using PyAudio and Voice Activity Detection (VAD) - - Transcribe speech using faster-whisper and generate replies via vLLM + - Transcribe speech using faster-whisper and generate replies using vLLM - Tune segmentation and prompt strategies to improve latency and response quality - Deploy and run the full pipeline on Arm-based systems such as DGX Spark diff --git a/content/learning-paths/laptops-and-desktops/pytorch-finetuning-on-spark/1-setup.md b/content/learning-paths/laptops-and-desktops/pytorch-finetuning-on-spark/1-setup.md new file mode 100644 index 0000000000..01ede5ca87 --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/pytorch-finetuning-on-spark/1-setup.md @@ -0,0 +1,122 @@ +--- +title: Set up your NVIDIA DGX Spark +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +The NVIDIA DGX Spark pairs an Arm-based Grace CPU with a Blackwell GPU in a compact desktop form factor. The GPU handles the compute-intensive training passes while the Grace CPU manages data preprocessing and orchestration, making the system well suited for fine-tuning large language models locally without sending data to the cloud. + +To get started, you'll configure Docker, pull a pre-built PyTorch container, and install the libraries you need for fine-tuning. + +## Configure Docker permissions + +Docker is pre-installed on the DGX Spark, so you don't need to install it yourself. However, your user account might not have permission to run Docker commands without `sudo`. + +Check whether Docker is accessible by opening a terminal and running: + +```bash +docker images +``` + +If this prints a table (even an empty one), you're all set and can skip ahead to the next section. If you see a `permission denied` error, add your user to the `docker` group: + +```bash +sudo usermod -aG docker $USER +newgrp docker +``` + +The first command grants your user Docker access, and `newgrp docker` activates the new group membership in your current shell so you don't need to log out and back in. Verify that it worked by running `docker images` again. You should now see the table without any errors. + +## Download PyTorch container + +NVIDIA provides pre-built PyTorch containers that include all the necessary frameworks, libraries, and dependencies optimized for NVIDIA GPUs. These containers are regularly updated and maintained, ensuring you have access to the latest stable versions without the complexity of manual dependency management. + +Pull the latest PyTorch container from NVIDIA's container registry: + +```bash +docker pull nvcr.io/nvidia/pytorch:25.11-py3 +``` + +This command downloads the November 2025 release of the PyTorch container, which includes PyTorch, CUDA libraries, cuDNN, and other essential tools pre-configured for optimal performance on NVIDIA hardware. The download size is several gigabytes, so this step might take a few minutes depending on your internet connection. + +## Launch container instance + +Now that you have the container image, you can launch an interactive session where you'll perform all your fine-tuning work. + +Run the following command to start the container: + +```bash +docker run --gpus all -it --rm --ipc=host \ +-v $HOME/.cache/huggingface:/root/.cache/huggingface \ +-v ${PWD}:/workspace -w /workspace \ +nvcr.io/nvidia/pytorch:25.11-py3 +``` + +Here's what each flag does: + +- `--gpus all` gives the container access to all available GPUs on your system +- `--ipc=host` enables shared memory between the host and container, which is essential for multi-GPU training and data loading +- `-v $HOME/.cache/huggingface:/root/.cache/huggingface` mounts your Hugging Face cache directory, preventing repeated downloads of models and datasets +- `-v ${PWD}:/workspace -w /workspace` mounts your current directory into the container and sets it as the working directory, so you can access the fine-tuned model from outside the container later + +After running the command, you'll be inside the container with a root shell prompt. + +## Install dependencies + +The base PyTorch container doesn't include all the specialized libraries needed for efficient model fine-tuning. You need to install several additional Python packages that provide transformer models, parameter-efficient fine-tuning methods, dataset utilities, and training frameworks. + +Inside the running container, install the required dependencies: + +```bash +pip install transformers peft datasets trl bitsandbytes +``` + +These packages serve specific purposes: + +- `transformers` provides access to pre-trained language models and tokenizers from Hugging Face +- `peft` (Parameter-Efficient Fine-Tuning) enables techniques like LoRA and QLoRA that reduce memory requirements +- `datasets` offers a standardized interface for loading and processing training datasets +- `trl` (Transformer Reinforcement Learning) includes training utilities and recipes for language models +- `bitsandbytes` enables 4-bit and 8-bit quantization for memory-efficient training + +The installation typically takes a few minutes as pip downloads and installs each package along with their dependencies. + +## Authenticate with Hugging Face + +Many of the models you'll fine-tune are hosted on Hugging Face's model hub. Some models, particularly larger ones like Llama, require authentication to download. Even for public models, authentication provides better rate limits and tracking. + +First, obtain an access token from your [Hugging Face token settings](https://huggingface.co/settings/tokens) page. Then authenticate: + +```bash +hf auth login +``` + +When prompted, paste your token and press Enter. When asked about git credentials, enter `n` since you don't need git integration for this workflow. This authentication persists across sessions because you mounted your Hugging Face cache directory, so you won't need to repeat this step. + +## Download NVIDIA DGX Spark playbook + +NVIDIA provides a collection of ready-to-use fine-tuning scripts optimized for DGX systems. These scripts implement best practices for various model sizes and fine-tuning techniques, so you can focus on your dataset and model selection rather than training boilerplate. + +Clone the playbooks repository: + +```bash +git clone https://github.com/NVIDIA/dgx-spark-playbooks +cd dgx-spark-playbooks +git checkout e51dae47ec9233ccd722dd465be87a984fd97d61 +cd nvidia/pytorch-fine-tune/assets +``` + +The repository contains scripts for different model architectures and training strategies. The `assets` directory includes the fine-tuning scripts you'll use in the next steps. Each script is preconfigured with sensible defaults but also accepts command-line arguments for customization. + +## What you've accomplished and what's next + +In this section you: + +- Configured Docker permissions on DGX Spark +- Pulled the NVIDIA PyTorch container and launched an interactive session +- Installed fine-tuning libraries and authenticated with Hugging Face +- Cloned the DGX Spark playbooks repository + +In the next section, you'll learn how supervised fine-tuning works and what makes it effective for adapting pre-trained models to specific tasks. diff --git a/content/learning-paths/laptops-and-desktops/pytorch-finetuning-on-spark/2-finetuning.md b/content/learning-paths/laptops-and-desktops/pytorch-finetuning-on-spark/2-finetuning.md new file mode 100644 index 0000000000..66249cffec --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/pytorch-finetuning-on-spark/2-finetuning.md @@ -0,0 +1,87 @@ +--- +title: Understand fine-tuning +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Why fine-tuning matters for domain knowledge + +Pre-trained models like Llama 3.1 8B have broad language skills, but they don't know everything. Ask the base model about the maximum clock speed of the RP2350 microcontroller and it confidently answers "1.8 GHz," a completely fabricated number. The actual specification is 150 MHz. + +Fine-tuning fixes this by training the model on real data from Raspberry Pi datasheets. After fine-tuning, the same model answers correctly: "The RP2350 supports up to 150 MHz." No hallucination, no guessing. + +The process breaks down into three steps: + +1. Patch the NVIDIA playbook's fine-tuning script to load a custom dataset, then run training +2. Serve both the original and fine-tuned models using vLLM +3. Compare the outputs side by side to see factual accuracy improve + +To understand why this works, you need to know three things: what supervised fine-tuning does, how the training data is structured, and what options you have for making fine-tuning efficient on your hardware. + +## How supervised fine-tuning adapts a model + +Pre-trained LLMs learn general language patterns from massive text datasets. Supervised fine-tuning (SFT) takes that foundation and reshapes the model's behavior using labeled examples that show it how to respond to specific prompts. + +Think of it as teaching by example. You provide pairs of inputs and desired outputs, and the training process adjusts the model's parameters so its responses look more like your examples. The model's original knowledge stays intact, but SFT steers how that knowledge gets applied and fills in gaps where the base model lacks specific domain expertise. + +The key benefit is efficiency. Pre-training a model from scratch can take thousands of GPUs and trillions of tokens of data. Fine-tuning achieves targeted improvements with hundreds or thousands of examples on a single GPU, which makes the DGX Spark an ideal platform for the task. + +## How the training data is structured + +The NVIDIA playbook scripts use the Alpaca prompt format, which structures each training example with three fields: + +**Instruction** -- the question or task (for example, "What is the maximum clock speed of the RP2350?") + +**Input** -- optional additional context (left empty for most questions) + +**Output** -- the correct answer sourced from official datasheets + +Here's an example from the dataset you'll use: + +```json +{ + "instruction": "How many GPIO pins does the Raspberry Pi Pico 2 provide?", + "input": "", + "output": "The Raspberry Pi Pico 2 provides 26 GPIO pins." +} +``` + +During training, these fields are combined into a prompt template that the model learns to recognize and complete: + +```text +Below is an instruction that describes a task, paired with an input +that provides further context. Write a response that appropriately +completes the request. + +### Instruction: How many GPIO pins does the Raspberry Pi Pico 2 provide? + +### Input: + +### Response: The Raspberry Pi Pico 2 provides 26 GPIO pins. +``` + +The dataset you'll use contains around 250 question-answer pairs extracted from official Raspberry Pi datasheets covering the RP2040, RP2350, Pico, Pico 2, Compute Module 4, and other boards. After training on these examples, the model learns to respond with accurate, datasheet-sourced facts instead of hallucinating specifications. + +## Choosing a fine-tuning approach + +Not every model fits entirely in GPU memory during training. The fine-tuning scripts you'll work with in the next section offer several approaches to handle this: + +**Full fine-tuning** updates every parameter in the model. This gives the best results but needs enough GPU memory to hold the full model plus the optimizer state and gradients. For smaller models like Llama 3.2 3B, the DGX Spark handles this comfortably. + +**LoRA (Low-Rank Adaptation)** freezes the original model weights and trains a small set of additional parameters instead. The memory savings are significant because you only store gradients and optimizer state for a fraction of the total parameters. This is practical for 8B-class models. + +**QLoRA (Quantized LoRA)** goes a step further by loading the frozen model weights in 4-bit precision. Combined with LoRA's parameter-efficient training, this lets you fine-tune 70B-class models that would otherwise exceed available memory. + +The script you'll run in the next section uses full fine-tuning by default, but the playbook includes LoRA and QLoRA scripts for larger models. + +## What you've accomplished and what's next + +In this section you learned: + +- Why fine-tuning is valuable for teaching domain-specific facts to a base model +- How Raspberry Pi datasheet Q&A pairs are structured in the Alpaca prompt format +- The differences between full fine-tuning, LoRA, and QLoRA + +In the next section, you'll walk through the fine-tuning script, patch it to load the Raspberry Pi dataset, and run training to produce your own fine-tuned model. diff --git a/content/learning-paths/laptops-and-desktops/pytorch-finetuning-on-spark/3-pytorch.md b/content/learning-paths/laptops-and-desktops/pytorch-finetuning-on-spark/3-pytorch.md new file mode 100644 index 0000000000..ba6bf13e8a --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/pytorch-finetuning-on-spark/3-pytorch.md @@ -0,0 +1,210 @@ +--- +title: Fine-tune a model with PyTorch and Hugging Face +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +Now that you understand how fine-tuning works, it's time to look at the actual code. In this section, you'll walk through the key parts of the NVIDIA playbook's fine-tuning script, patch it to load the Raspberry Pi dataset, and run it to produce your own fine-tuned Llama model. + +## Review the fine-tuning scripts + +The NVIDIA playbook provides four main fine-tuning scripts, each designed for different scenarios: + +| Script | Approach | Best for | +|--------|----------|----------| +| `Llama3_3B_full_finetuning.py` | Full fine-tuning (all parameters) | Smaller models where GPU memory isn't a constraint | +| `Llama3_8B_LoRA_finetuning.py` | LoRA (frozen base + small trainable adapters) | Mid-size models with reduced memory needs | +| `Llama3_70B_LoRA_finetuning.py` | LoRA + FSDP (distributed across GPUs) | Large models that need multi-GPU sharding | +| `Llama3_70B_qLoRA_finetuning.py` | QLoRA (LoRA + 4-bit quantization) | Very large models on memory-limited systems | + +The file names refer to the default model each script uses, but you can pass a different model on the command line. This Learning Path uses `Llama3_3B_full_finetuning.py`. The key sections of that script are explained below. + +## Imports and dataset preparation + +The script starts by importing PyTorch, dataset loading utilities, and the Hugging Face libraries used for supervised fine-tuning. + +```python +import torch +import argparse +from datasets import load_dataset +from trl import SFTConfig, SFTTrainer +from transformers import AutoModelForCausalLM, AutoTokenizer +``` + +The `ALPACA_PROMPT_TEMPLATE` defines the instruction-following format for training data with three fields: instruction, input, and response. Each training example is formatted using this template so the model learns to recognize the pattern and produce structured answers. + +The `get_alpaca_dataset()` function loads the Alpaca dataset from Hugging Face by default and formats each example using the template, appending the EOS (End of String) token. You'll patch this function later to load the Raspberry Pi dataset from a local JSONL file instead. + +```python +# Define prompt templates +ALPACA_PROMPT_TEMPLATE = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. +### Instruction: {} + +### Input: {} + +### Response: {}""" + +def get_alpaca_dataset(eos_token, dataset_size=500): + # Preprocess the dataset + def preprocess(x): + texts = [ + ALPACA_PROMPT_TEMPLATE.format(instruction, input, output) + eos_token + for instruction, input, output in zip(x["instruction"], x["input"], x["output"]) + ] + return {"text": texts} + + dataset = load_dataset("tatsu-lab/alpaca", split="train").select(range(dataset_size)).shuffle(seed=42) + return dataset.map(preprocess, remove_columns=dataset.column_names, batched=True) +``` + +## Model and tokenizer loading + +The `from_pretrained()` method downloads and initializes a pre-trained language model from Hugging Face. The tokenizer is loaded alongside it, with the padding token set to match the EOS token (required for batched training). + +```python + # Load the model and tokenizer + print(f"Loading model: {args.model_name}") + model = AutoModelForCausalLM.from_pretrained( + args.model_name, + dtype=args.dtype, + device_map="auto", + trust_remote_code=True + ) + tokenizer = AutoTokenizer.from_pretrained(args.model_name, trust_remote_code=True) + tokenizer.pad_token = tokenizer.eos_token +``` + +## Dataset loading + +With the model and tokenizer loaded, the script prepares the training data by calling `get_alpaca_dataset()` with the tokenizer's EOS token and the specified dataset size. By default the script downloads the Alpaca dataset from Hugging Face, but you'll patch this function to load the Raspberry Pi JSONL file instead. + +```python + # Load and preprocess the dataset + print(f"Loading dataset with {args.dataset_size} samples...") + dataset = get_alpaca_dataset(tokenizer.eos_token, args.dataset_size) +``` + +## Training configuration + +The training configuration controls how the SFT process runs. Notable parameters include `num_train_epochs` (initially set to 0.01 for a warmup pass, then updated for full training), `gradient_accumulation_steps` (batches to accumulate before each weight update), `learning_rate` (optimizer step size), and `max_length` (maximum sequence length). The logging parameters determine where and how often training metrics are recorded. + +```python + # Configure the SFT config + config = { + "per_device_train_batch_size": args.batch_size, + "num_train_epochs": 0.01, # Warmup epoch + "gradient_accumulation_steps": args.gradient_accumulation_steps, + "learning_rate": args.learning_rate, + "optim": "adamw_torch", + "save_strategy": 'no', + "remove_unused_columns": False, + "seed": 42, + "dataset_text_field": "text", + "packing": False, + "max_length": args.seq_length, + "torch_compile": False, + "report_to": "none", + "logging_dir": args.log_dir, + "logging_steps": args.logging_steps, + "gradient_checkpointing": args.gradient_checkpointing, # Save memory + } +``` + +## Model compilation and training + +If `torch.compile()` is enabled, the script first optimizes the model graph for faster execution on the hardware. A short warmup pass (0.01 epochs) triggers compilation so the overhead doesn't affect the actual training run. After warmup, the script creates an `SFTTrainer` with the full epoch count and calls `trainer.train()`. The returned `trainer_stats` object contains metrics like loss, throughput, and training time. + +```python + # Compile model if requested + if args.use_torch_compile: + print("Compiling model with torch.compile()...") + model = torch.compile(model) + + # Warmup for torch compile + print("Running warmup for torch.compile()...") + SFTTrainer( + model=model, + processing_class=tokenizer, + train_dataset=dataset, + args=SFTConfig(**config), + ).train() + + # Train the model + print(f"\nStarting full fine-tuning for {args.num_epochs} epoch(s)...") + config["num_train_epochs"] = args.num_epochs + config["report_to"] = "tensorboard" + + trainer = SFTTrainer( + model=model, + processing_class=tokenizer, + train_dataset=dataset, + args=SFTConfig(**config), + ) + + trainer_stats = trainer.train() +``` + +## Patch the script for the Raspberry Pi dataset + +The script loads the Alpaca dataset from Hugging Face by default. You need to patch the dataset loading function to use the local Raspberry Pi JSONL file instead. + +First, open a new terminal on the DGX Spark (not inside the container) and navigate to the directory where you launched the Docker container. This is the directory that gets mounted as `/workspace` inside the container. Download the dataset file: + +```bash +wget https://learn.arm.com/learning-paths/laptops-and-desktops/pytorch-finetuning-on-spark/raspberry_pi_qa.jsonl +``` + +Because this directory is mounted into the container with `-v ${PWD}:/workspace`, the file is immediately available inside the container at `/workspace/raspberry_pi_qa.jsonl`. + +Back inside the container, copy the dataset into the script's working directory: + +```bash +cp /workspace/raspberry_pi_qa.jsonl . +``` + +The following `sed` command replaces the `get_alpaca_dataset()` function to load from a local JSONL file instead of Hugging Face. The replacement function reads the Raspberry Pi Q&A pairs and formats them using the same Alpaca prompt template: + +```bash +sed -i '/^def get_alpaca_dataset/,/^ return dataset\.map/c\ +def get_alpaca_dataset(eos_token, dataset_size=500):\ + def preprocess(x):\ + texts = [\ + ALPACA_PROMPT_TEMPLATE.format(instruction, inp, output) + eos_token\ + for instruction, inp, output in zip(x["instruction"], x["input"], x["output"])\ + ]\ + return {"text": texts}\ + dataset = load_dataset("json", data_files="raspberry_pi_qa.jsonl", split="train")\ + if len(dataset) > dataset_size:\ + dataset = dataset.select(range(dataset_size))\ + return dataset.map(preprocess, remove_columns=dataset.column_names, batched=True)' Llama3_3B_full_finetuning.py +``` + +The key difference is `load_dataset("json", data_files="raspberry_pi_qa.jsonl", split="train")`, which reads the local file instead of downloading from Hugging Face. The function still applies the same Alpaca prompt template and EOS token. + +## Run the fine-tuning + +With the dataset patch applied, you're ready to run the fine-tuning. The command below trains the Llama 3.1 8B model using full fine-tuning on the Raspberry Pi dataset: + +```bash +python Llama3_3B_full_finetuning.py \ +--model_name "meta-llama/Llama-3.1-8B" \ +--dataset_size 300 \ +--output_dir "/workspace/models/Llama-3.1-8B-FineTuned" +``` + +The `--dataset_size 300` flag tells the script to use all entries in the Raspberry Pi dataset (the default is 500, but a smaller, focused dataset can be more effective than a larger generic one). The `--output_dir` flag saves the fine-tuned model and tokenizer to the specified directory. Because you mounted your current directory into the container with `-v ${PWD}:/workspace`, the saved model is also accessible from the host system. + +Training takes a few minutes on DGX Spark. When it completes, you'll see a summary with metrics like runtime, samples per second, and loss, followed by a confirmation that the model was saved. + +## What you've accomplished and what's next + +In this section you: + +- Reviewed the available fine-tuning scripts and their approaches +- Walked through each stage of the full fine-tuning script +- Patched the dataset loading function to use Raspberry Pi datasheet Q&A pairs +- Ran full fine-tuning and saved the resulting model with `--output_dir` + +In the next section, you'll serve both the original and fine-tuned models and compare their responses to Raspberry Pi hardware questions. \ No newline at end of file diff --git a/content/learning-paths/laptops-and-desktops/pytorch-finetuning-on-spark/4-testing.md b/content/learning-paths/laptops-and-desktops/pytorch-finetuning-on-spark/4-testing.md new file mode 100644 index 0000000000..8a5ba06665 --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/pytorch-finetuning-on-spark/4-testing.md @@ -0,0 +1,200 @@ +--- +title: Test your fine-tuned model with vLLM +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +Now that you've fine-tuned your model on Raspberry Pi datasheet content, it's time to compare its behavior against the original. You'll serve both versions using vLLM, a high-performance inference server optimized for large language models, and observe how fine-tuning on domain-specific data changes the model's factual accuracy. + +## Download vLLM container + +NVIDIA provides a pre-built vLLM container that includes all the necessary components for efficient model serving. This container is optimized for NVIDIA GPUs and includes support for various quantization methods and multi-GPU inference. + +Pull the vLLM container from NVIDIA's registry: + +```bash +docker pull nvcr.io/nvidia/vllm:26.01-py3 +``` + +The January 2026 release includes the latest optimizations for model serving and inference acceleration. + +## Launch container instance + +Launch a vLLM container instance that allows you to serve models through an OpenAI-compatible API. This setup makes it easy to test your models using standard HTTP requests. + +Start the container with port forwarding enabled: + +```bash +docker run --gpus all -it --rm --ipc=host \ +-v $HOME/.cache/huggingface:/root/.cache/huggingface \ +-v ${PWD}:/workspace -w /workspace \ +-p 8000:8000 nvcr.io/nvidia/vllm:26.01-py3 +``` + +The flags are similar to the PyTorch container launch, with one addition: + +- `-p 8000:8000` maps port 8000 from the container to your host system, so you can send HTTP requests to the model server from outside the container + +## Test the original model + +Before testing your fine-tuned model, first observe how the original, unmodified Llama model responds to a Raspberry Pi hardware question. This establishes a baseline that reveals where the base model's knowledge falls short. + +### Launch vLLM + +Start the vLLM server with the original Llama 3.1 8B model: + +```bash +python3 -m vllm.entrypoints.openai.api_server \ +--model "meta-llama/Llama-3.1-8B" --trust-remote-code \ +--tensor-parallel-size 1 --quantization fp8 \ +--gpu-memory-utilization 0.80 +``` + +The server exposes an OpenAI-compatible API with the following configuration: + +- `--model` specifies the Hugging Face model to load, which will be pulled from the Hugging Face cache we made available to this container +- `--trust-remote-code` allows loading models with custom code (required for some architectures) +- `--tensor-parallel-size 1` runs inference on a single GPU (like on a DGX Spark) +- `--quantization fp8` uses 8-bit floating point quantization to reduce memory usage and improve throughput +- `--gpu-memory-utilization 0.80` limits memory usage to 80%, leaving room for the rest of the OS because the DGX Spark implements unified memory between CPU and GPU. + +Wait for the server to fully load the model and display the message indicating it's ready to accept requests (this typically takes 30-60 seconds). + +### Test prompt + + +From a new terminal window (outside the container), send a Raspberry Pi hardware question to the model using the Alpaca instruction format. For this example, use a question about the memory size: + +```bash +curl http://localhost:8000/v1/completions \ +-H "Content-Type: application/json" \ +-d '{ + "prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nHow much memory does the RP2350 have?\n\n### Response:", + "max_tokens": 200 +}' +``` + +The `max_tokens` parameter limits the response length to prevent runaway generation. + +### Output + +The original model hallucinates an incorrect specification. The output is similar to: + +```json +{ + "id": "cmpl-91e070e2a34aaf01", + "object": "text_completion", + "created": 1770998840, + "model": "meta-llama/Llama-3.1-8B", + "choices": [ + { + "index": 0, + "text": " \nThe RP2350 has 256MB of memory.", + "finish_reason": "stop" + } + ], + "usage": { + "prompt_tokens": 35, + "total_tokens": 48, + "completion_tokens": 13 + } +} +``` + +The base model confidently reports the RP2350 has "256MB of memory," which is off by three orders of magnitude. The actual specification from the datasheet is 520 KB of SRAM. The model doesn't have Raspberry Pi datasheet content in its training data, so it fabricates a plausible-sounding but completely incorrect answer. + +## Test the fine-tuned model + +Now test your fine-tuned model to see how training on Raspberry Pi datasheet content improved its factual accuracy. Stop the current vLLM server (press Ctrl+C in the container terminal) before launching the fine-tuned model. + +{{% notice Dependency Conflict %}} +As of this writing, vLLM does not support version 5 of the `transformers` library that was used when fine-tuning the model, so you need to patch its `tokenizer_config.json`. Run the following command to update the `tokenizer_class` to `PreTrainedTokenizerFast`, which is compatible with the older `transformers` version bundled in the vLLM container: + +```bash +sed -i 's/"tokenizer_class": "TokenizersBackend"/"tokenizer_class": "PreTrainedTokenizerFast"/' /workspace/models/Llama-3.1-8B-FineTuned/tokenizer_config.json +``` +{{% /notice %}} + +### Launch vLLM + +Start the vLLM server with your fine-tuned model: + +```bash +python3 -m vllm.entrypoints.openai.api_server \ +--model "/workspace/models/Llama-3.1-8B-FineTuned" --trust-remote-code \ +--tensor-parallel-size 1 --quantization fp8 \ +--gpu-memory-utilization 0.80 +``` + +The only change from the previous command is the `--model` parameter, which now points to your fine-tuned model directory instead of the Hugging Face model ID. + +### Test prompt + +Send the same Raspberry Pi question to your fine-tuned model: + +```bash +curl http://localhost:8000/v1/completions \ +-H "Content-Type: application/json" \ +-d '{ + "prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nHow much memory does the RP2350 have?\n\n### Response:", + "max_tokens": 200 +}' +``` + +### Output + +The fine-tuned model produces a correct, datasheet-accurate response. The output is similar to: + +```json +{ + "id": "cmpl-bad36ff5edddfb74", + "object": "text_completion", + "created": 1770999123, + "model": "/workspace/models/Llama-3.1-8B-FineTuned", + "choices": [ + { + "index": 0, + "text": " The RP2350 has 520 KB of on-chip SRAM.", + "finish_reason": "stop" + } + ], + "usage": { + "prompt_tokens": 35, + "total_tokens": 51, + "completion_tokens": 16 + } +} +``` + +The improvement is clear. Where the base model hallucinated "256MB," your fine-tuned model correctly answers "520 KB of on-chip SRAM," matching the official RP2350 datasheet specification. The model: + +- Provides a direct, concise answer without generating extra questions +- Stops naturally (`finish_reason: "stop"`) when the answer is complete +- Uses only a handful of tokens instead of rambling + +```bash +curl http://localhost:8000/v1/completions \ +-H "Content-Type: application/json" \ +-d '{ + "prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat memory options are available for the Raspberry Pi Compute Module 4?\n\n### Response:", + "max_tokens": 200 +}' +``` + +The fine-tuned model responds with accurate specifications from the CM4 datasheet, while the base model would likely guess or hallucinate memory options. + +Fine-tuning on the Raspberry Pi datasheet dataset transformed the base model from one that confidently fabricates hardware specifications into one that provides accurate, verified answers sourced from real datasheets. + +## What you've accomplished and what's next + +You've successfully fine-tuned a large language model on domain-specific data using PyTorch and Hugging Face libraries on an NVIDIA DGX Spark system. Throughout this Learning Path, you: + +- Set up a containerized environment with all necessary dependencies +- Learned how supervised fine-tuning teaches domain knowledge to a base model +- Patched a fine-tuning script to load a custom Raspberry Pi datasheet dataset +- Ran full fine-tuning to train the model on hardware specifications +- Compared base and fine-tuned model responses to verify factual accuracy improvements + +The approach you used can be applied to any domain where you need accurate, grounded responses. Consider experimenting with different datasets from your own technical documentation, product specifications, or internal knowledge bases. \ No newline at end of file diff --git a/content/learning-paths/laptops-and-desktops/pytorch-finetuning-on-spark/_index.md b/content/learning-paths/laptops-and-desktops/pytorch-finetuning-on-spark/_index.md new file mode 100644 index 0000000000..fe8a3f9311 --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/pytorch-finetuning-on-spark/_index.md @@ -0,0 +1,64 @@ +--- +title: Fine-tune PyTorch models on DGX Spark + +draft: true +cascade: + draft: true + +minutes_to_complete: 60 + +who_is_this_for: This is an advanced topic for AI developers and ML engineers who want to fine-tune large language models using PyTorch and Hugging Face on the NVIDIA DGX Spark platform. + +learning_objectives: + - Understand how fine-tuning teaches a model domain-specific knowledge + - Prepare a custom JSONL dataset for supervised fine-tuning + - Fine-tune Llama 3.1 8B on Raspberry Pi datasheet content using PyTorch and Hugging Face + - Compare base and fine-tuned model responses to verify factual accuracy improvements + +prerequisites: + - Hugging Face account and access token + - NVIDIA DGX Spark workstation + +author: Michael Hall + +### Tags +skilllevels: Advanced +subjects: ML +armips: + - Cortex-A + - Neoverse +tools_software_languages: + - Python + - PyTorch + - Docker +operatingsystems: + - Linux + +further_reading: + - resource: + title: NVIDIA PyTorch Fine-Tuning tutorial + link: https://build.nvidia.com/spark/pytorch-fine-tune/overview + type: documentation + - resource: + title: Hugging Face SFT Trainer + link: https://huggingface.co/docs/trl/en/sft_trainer + type: documentation + - resource: + title: Hugging Face Datasets + link: https://huggingface.co/datasets + type: website + - resource: + title: Hugging Face Fine-tuning Guide + link: https://huggingface.co/docs/transformers/training + type: documentation + - resource: + title: PyTorch Training Documentation + link: https://pytorch.org/tutorials/beginner/introyt/trainingyt.html + type: documentation + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/_next-steps.md b/content/learning-paths/laptops-and-desktops/pytorch-finetuning-on-spark/_next-steps.md similarity index 100% rename from content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/_next-steps.md rename to content/learning-paths/laptops-and-desktops/pytorch-finetuning-on-spark/_next-steps.md diff --git a/content/learning-paths/laptops-and-desktops/pytorch-finetuning-on-spark/raspberry_pi_qa.jsonl b/content/learning-paths/laptops-and-desktops/pytorch-finetuning-on-spark/raspberry_pi_qa.jsonl new file mode 100644 index 0000000000..08809df3a8 --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/pytorch-finetuning-on-spark/raspberry_pi_qa.jsonl @@ -0,0 +1,254 @@ +{"instruction": "Does the Raspberry Pi Compute Module 1 and 3 support I2C?", "input": "", "output": "The Raspberry Pi Compute Module 1 and Compute Module 3 both support I2C. Their I2C pins are available on the SODIMM edge connector."} +{"instruction": "What does the Raspberry Pi Compute Module 1 and 3 datasheet say about mechanical specification?", "input": "", "output": "The Compute Modules conform to JEDEC MO-224 mechanical specification for 200 pin DDR2 (1.8V)\nSODIMM modules (with the exception that the CM3, CM3L modules are 31mm in height rather than\n30mm of CM1) and therefore should work with the many DDR2 SODIMM sockets available on the."} +{"instruction": "Does the Raspberry Pi Compute Module 1 and 3 support DSI?", "input": "", "output": "Yes, the Raspberry Pi Compute Module 1 and 3 include DSI support."} +{"instruction": "Does the Raspberry Pi Compute Module 1 and 3 support GPIO?", "input": "", "output": "Yes, the Raspberry Pi Compute Module 1 and 3 include GPIO support."} +{"instruction": "What does the Raspberry Pi Compute Module 1 and 3 datasheet say about temperature range?", "input": "", "output": "The operating temperature range of the module is set by the lowest maximum and highest minimum of\nany of the components used."} +{"instruction": "What memory options are available for the Raspberry Pi Compute Module 1 and 3?", "input": "", "output": "The Raspberry Pi Compute Module 1 has 512 MB of RAM, the Compute Module 3 and CM3 Lite have 1 GB of RAM."} +{"instruction": "How many SPI controllers does the Raspberry Pi Compute Module 1 and 3 have?", "input": "", "output": "The Raspberry Pi Compute Module 1 and 3 has 3 SPI controllers (SPI0, SPI1, SPI2)."} +{"instruction": "How many GPIO pins does the Raspberry Pi Compute Module 1 and 3 provide?", "input": "", "output": "The Raspberry Pi Compute Module 1 and 3 provides 48 GPIO pins."} +{"instruction": "What happens if the supply voltage to the Raspberry Pi Compute Module 1 and 3 drops below the minimum?", "input": "", "output": "If the supply voltage drops below the minimum, the Raspberry Pi Compute Module 1 and 3 triggers a brown-out or power-on reset."} +{"instruction": "How do you supply power to the Raspberry Pi Compute Module 1 and 3?", "input": "", "output": "Power can be supplied to the Raspberry Pi Compute Module 1 and 3 via 1V8 (GPIO46 1V8 on CM1) and EMMC EN N 1V8 (GPIO47 1V8 on CM1) are now\ndriven from an IO expander rather than the processor. If a designer of a CM1 pro."} +{"instruction": "Does the Raspberry Pi Compute Module 3+ support I2C?", "input": "", "output": "Yes, the Raspberry Pi Compute Module 3+ includes I2C support."} +{"instruction": "What does the Raspberry Pi Compute Module 3+ datasheet say about secondary memory interface (smi)?", "input": "", "output": "The SMI peripheral is an asynchronous NAND type bus supporting Intel mode80 type transfers at 8 or\n16 bit widths and available in the ALT1 positions on GPIO banks 0 and 1 (see Table 9 and Table 10)."} +{"instruction": "Does the Raspberry Pi Compute Module 3+ support SDIO?", "input": "", "output": "Yes, the Raspberry Pi Compute Module 3+ includes SDIO support."} +{"instruction": "What USB interface does the Raspberry Pi Compute Module 3+ support?", "input": "", "output": "The Raspberry Pi Compute Module 3+ supports USB 2 HOST."} +{"instruction": "What does the Raspberry Pi Compute Module 3+ datasheet say about introduction?", "input": "", "output": "The Raspberry Pi Compute Module 3+ (CM3+) is a range of DDR2-SODIMM-mechanically-compatible\nSystem on Modules (SoMs) containing processor, memory, eMMC Flash (on non-Lite variants) and\nsupporting power circuitry."} +{"instruction": "What does the Raspberry Pi Compute Module 3+ datasheet say about availability?", "input": "", "output": "Raspberry Pi guarantee availability of CM3+ and CM3+ Lite until at least January 2026."} +{"instruction": "What does the Raspberry Pi Compute Module 3+ datasheet say about hardware?", "input": "", "output": "• Low cost\n• Low power\n• High availability\n• High reliability."} +{"instruction": "Does the Raspberry Pi Compute Module 3+ include an on-chip voltage regulator?", "input": "", "output": "Yes, the Raspberry Pi Compute Module 3+ includes an on-chip voltage regulator (ldo)."} +{"instruction": "What does the Raspberry Pi Compute Module 3+ datasheet say about sd/sdio interface?", "input": "", "output": "The BCM283x supports two SD card interfaces, SD0 and SD1."} +{"instruction": "How many SPI controllers does the Raspberry Pi Compute Module 4 have?", "input": "", "output": "The Raspberry Pi Compute Module 4 has 7 SPI controllers (SPI0 through SPI6)."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about gpio?", "input": "", "output": "There are 28 pins available for general purpose I/O (GPIO), which correspond to the GPIO pins on the Raspberry Pi 4\nModel B 40-pin header."} +{"instruction": "Does the Raspberry Pi Compute Module 4 support BLUETOOTH?", "input": "", "output": "Yes, the Raspberry Pi Compute Module 4 includes BLUETOOTH support."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about appendix a: troubleshooting?", "input": "", "output": "The CM4 has a number of stages of power-up before the CPU starts."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about run_pg?", "input": "", "output": "This pin when high signals that the CM4 has started."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about alternative function assignments?", "input": "", "output": "Up to six alternative functions are available."} +{"instruction": "Does the Raspberry Pi Compute Module 4 support GPIO?", "input": "", "output": "Yes, the Raspberry Pi Compute Module 4 includes GPIO support."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about mechanical?", "input": "", "output": "The CM4 is a compact 40mm × 55mm module."} +{"instruction": "Does the Raspberry Pi Compute Module 4 include an on-chip voltage regulator?", "input": "", "output": "Yes, the Raspberry Pi Compute Module 4 includes an on-chip voltage regulator (ldo)."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about introduction?", "input": "", "output": "The\nRaspberry Pi Compute\nModule 4 (CM4)."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about • mdi crossover, pair skew and pair polarity correction?", "input": "", "output": "A standard 1:1 RJ45 MagJack is all that is necessary to provide an Ethernet connection to the CM4."} +{"instruction": "Does the Raspberry Pi Compute Module 4 support PCIe?", "input": "", "output": "Yes, the Raspberry Pi Compute Module 4 supports PCIe."} +{"instruction": "What is the maximum clock frequency of the Raspberry Pi Compute Module 4?", "input": "", "output": "The Raspberry Pi Compute Module 4 uses the Broadcom BCM2711 SoC with a CPU clocked at up to 1.5 GHz. Its PCIe interface uses a 100 MHz reference clock (standard for PCI Express); on CM4 designs the PCIe clock is only driven when PCIe_CLK_nREQ is correctly connected/asserted."} +{"instruction": "Does the Raspberry Pi Compute Module 4 support a camera interface?", "input": "", "output": "Yes, the Raspberry Pi Compute Module 4 includes a CSI camera interface."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about kernel?", "input": "", "output": "The updated OS images use the new Raspberry Pi Compute Module 4 device tree file."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about regulator outputs?", "input": "", "output": "To make it easier to interface to the CM4 the on-board regulators (+3.3V and +1.8V) can each supply 600mA to devices\nconnected to the CM4."} +{"instruction": "What is the power output max specification for the Raspberry Pi Compute Module 4?", "input": "", "output": "The power output max for the Raspberry Pi Compute Module 4 is 300mA."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about eeprom write-protect?", "input": "", "output": "The on-board EEPROM can be write-protected by shorting EEPROM_nWP to ground."} +{"instruction": "How do you program the Raspberry Pi Compute Module 4?", "input": "", "output": "The Raspberry Pi Compute Module 4 can be programmed via usb."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about rpi-eeprom-update?", "input": "", "output": "CM4 will not run recovery.bin from from the EMMC (or SD Card on CM4Lite)."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about power-down sequencing?", "input": "", "output": "The operating system should be shut down before the power is removed, to ensure that the file system remains\nconsistent."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about firmware?", "input": "", "output": "A 5.4 or newer kernel and the latest firmware release is required."} +{"instruction": "What is the supply voltage range for the Raspberry Pi Compute Module 4?", "input": "", "output": "The Raspberry Pi Compute Module 4 requires a 5V supply. The absolute maximum rating for the supply pin is 6.0V."} +{"instruction": "Does the Raspberry Pi Compute Module 4 have Ethernet support?", "input": "", "output": "Yes, the Raspberry Pi Compute Module 4 supports Ethernet."} +{"instruction": "What memory options are available for the Raspberry Pi Compute Module 4?", "input": "", "output": "The Raspberry Pi Compute Module 4 is available with 1 GB, 2 GB, 4 GB, or 8 GB of LPDDR4 RAM, and up to 32 GB of eMMC storage."} +{"instruction": "What is the operating temperature range of the Raspberry Pi Compute Module 4?", "input": "", "output": "The Raspberry Pi Compute Module 4 operates over a temperature range of -20°C to 85°C."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about pcie (gen2 x1)?", "input": "", "output": "The CM4 has an internal PCIe 2.0 x1 host controller."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about i2c (id_sd id_sc)?", "input": "", "output": "This I2C bus is normally used for identifying HATs and controlling CSI0 and DSI0 devices."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about hardware checklist?", "input": "", "output": "Check this by pulling GLOBAL_EN low and apply an external 2A load to the +5V supply."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about global_en?", "input": "", "output": "Pulling this pin low puts the CM4 in the lowest possible power-down state."} +{"instruction": "What USB interface does the Raspberry Pi Compute Module 4 support?", "input": "", "output": "The Raspberry Pi Compute Module 4 supports USB 2.0."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about csi-2 (mipi serial camera)?", "input": "", "output": "The CM4 supports two camera ports: CAM0 (2 lanes) and CAM1 (4 lanes)."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about usb 2.0 (high speed)?", "input": "", "output": "The USB 2.0 interface supports up to 480Mbps signalling."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about power consumption?", "input": "", "output": "The exact power consumption of the CM4 will greatly depend on the tasks being run on the CM4."} +{"instruction": "Does the Raspberry Pi Compute Module 4 support HDMI?", "input": "", "output": "Yes, the Raspberry Pi Compute Module 4 includes HDMI support."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about • bluetooth 5.0, ble?", "input": "", "output": "These wireless interfaces can be individually enabled or disabled as required."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about 100ω differential pair signal lengths?", "input": "", "output": "On the CM4 all differential pairs are matched to better than 0.05mm (P/N signals)."} +{"instruction": "What does the Raspberry Pi Compute Module 4 datasheet say about bootloader?", "input": "", "output": "Connect a HDMI cable to see if the HDMI diagnostics screen appears."} +{"instruction": "How many I2C controllers does the Raspberry Pi CM4 IO Board have?", "input": "", "output": "The Raspberry Pi CM4 IO Board exposes I2C from the CM4, which has 6 I2C buses (I2C0 through I2C5)."} +{"instruction": "What does the Raspberry Pi CM4 IO Board datasheet say about micro usb connector?", "input": "", "output": "The micro USB connector is designed to enable CM4 to be updated via rpiboot."} +{"instruction": "Does the Raspberry Pi CM4 IO Board support a camera interface?", "input": "", "output": "Yes, the Raspberry Pi CM4 IO Board includes a CSI camera interface."} +{"instruction": "Does the Raspberry Pi CM4 IO Board support PCIE?", "input": "", "output": "Yes, the Raspberry Pi CM4 IO Board includes PCIE support."} +{"instruction": "What happens if the supply voltage to the Raspberry Pi CM4 IO Board drops below the minimum?", "input": "", "output": "If the supply voltage drops below the minimum, the Raspberry Pi CM4 IO Board triggers a brown-out or power-on reset."} +{"instruction": "What does the Raspberry Pi CM4 IO Board datasheet say about gigabit ethernet rj45?", "input": "", "output": "The Raspberry Pi Compute Module 4 IO Board uses a standard 1:1 Ethernet magjack, which supports PoE as well."} +{"instruction": "What does the Raspberry Pi CM4 IO Board datasheet say about usb 2.0 hub?", "input": "", "output": "The Raspberry Pi Compute Module 4 IO Board has an on-board USB 2.0 hub."} +{"instruction": "What USB interface does the Raspberry Pi CM4 IO Board support?", "input": "", "output": "The Raspberry Pi CM4 IO Board supports USB 2.0."} +{"instruction": "What does the Raspberry Pi CM4 IO Board datasheet say about raspberry pi hat connector?", "input": "", "output": "The Raspberry Pi Compute Module 4 IO Board has a standard Raspberry Pi 40-way HAT connector."} +{"instruction": "What is the maximum clock frequency of the Raspberry Pi CM4 IO Board?", "input": "", "output": "The Compute Module 4 IO Board is a carrier/baseboard, so it doesn’t have a single meaningful “maximum clock frequency” specification. Clock limits depend on the specific interface in use (for example, PCIe uses a 100 MHz reference clock, and other buses like I²C/SPI/UART have their own speed/clock limits set by the controller and signal integrity)."} +{"instruction": "Does the Raspberry Pi CM4 IO Board support DSI?", "input": "", "output": "Yes, the Raspberry Pi CM4 IO Board includes DSI support."} +{"instruction": "How do you connect a display to the Raspberry Pi CM4 IO Board?", "input": "", "output": "A display can be connected to the Raspberry Pi CM4 IO Board using the HDMI interface."} +{"instruction": "Does the Raspberry Pi CM4 IO Board support ETHERNET?", "input": "", "output": "Yes, the Raspberry Pi CM4 IO Board includes ETHERNET support."} +{"instruction": "What does the Raspberry Pi CM4 IO Board datasheet say about microsd card socket?", "input": "", "output": "WARNING: For use only with CM4 Lite modules. The microSD card socket is a push‑push socket."} +{"instruction": "What does the Raspberry Pi CM4 IO Board datasheet say about pcie gen 2 x1 socket?", "input": "", "output": "The PCIe socket is designed to take standard PC PCIe cards."} +{"instruction": "What does the Raspberry Pi CM4 IO Board datasheet say about cable)?", "input": "", "output": "Both DSI interfaces (2-channel and 4-channel) are brought out to separate 22-way 0.5mm pitch connectors."} +{"instruction": "What does the Raspberry Pi CM4 IO Board datasheet say about cm4 module connectors?", "input": "", "output": "The two CM4 module connectors are positioned so the on-board wireless antenna is at the edge of the board for best\nwireless performance."} +{"instruction": "Does the Raspberry Pi CM4 IO Board have wireless connectivity?", "input": "", "output": "The CM4 IO Board itself does not have wireless hardware. It relies on the wireless connectivity provided by the CM4 module (Wi-Fi and Bluetooth 5.0 with BLE), when using a CM4 variant with wireless support."} +{"instruction": "Does the Raspberry Pi Compute Module 4S support GPIO?", "input": "", "output": "Yes, the Raspberry Pi Compute Module 4S includes GPIO support."} +{"instruction": "What does the Raspberry Pi Compute Module 4S datasheet say about power-up sequencing?", "input": "", "output": "The CM4S requires a VBAT (2.5V to 5V) and +3.3V supplies."} +{"instruction": "How do you reset the Raspberry Pi Compute Module 4S?", "input": "", "output": "To reset a CM4S, assert the module’s RUN (nRUN) reset pin low, then release it. This resets the BCM2711 SoC and reboots the module."} +{"instruction": "What memory options are available for the Raspberry Pi Compute Module 4S?", "input": "", "output": "The Raspberry Pi Compute Module 4S is available with 1 GB, 2 GB, 4 GB, or 8 GB of LPDDR4 RAM, and up to 32 GB of eMMC storage."} +{"instruction": "Does the Raspberry Pi Compute Module 4S support HDMI?", "input": "", "output": "Yes, the Raspberry Pi Compute Module 4S includes HDMI support."} +{"instruction": "How do you connect a display to the Raspberry Pi Compute Module 4S?", "input": "", "output": "A display can be connected to the Raspberry Pi Compute Module 4S using the HDMI interface."} +{"instruction": "How do you supply power to the Raspberry Pi Compute Module 4S?", "input": "", "output": "Power can be supplied to the Raspberry Pi Compute Module 4S via 1.8V or 3.3V signalling and peripheral options: Up to 6 × UART Up to 6 × I2C Up to 6 × SPI 1 × SDIO interface 1 × DPI (parallel RGB display) 1 × PCM U."} +{"instruction": "What does the Raspberry Pi Compute Module 4S datasheet say about introduction?", "input": "", "output": "The Raspberry Pi Compute Module 4 SODIMM (CM4S) is a System on Module (SoM) containing processor, memory,\neMMC Flash, and supporting power circuitry."} +{"instruction": "Does the Raspberry Pi Compute Module 4S support I2C?", "input": "", "output": "Yes, the Raspberry Pi Compute Module 4S includes I2C support."} +{"instruction": "What does the Raspberry Pi Compute Module 4S datasheet say about emmc_en_n_1v8?", "input": "", "output": "This pin is driven low to signal that the BCM2711 wishes to access the SD card or eMMC."} +{"instruction": "What does the Raspberry Pi Compute Module 4S datasheet say about features?", "input": "", "output": "Key features of the CM4S are as follows: Broadcom BCM2711, quad core Cortex-A72 (ARM v8) 64-bit SoC @ 1.5GHz DDR2-SODIMM-mechanically-compatible form factor H.265 (HEVC) (upto 4Kp60 decode), H.264 (upto 1080p60 decode, 1080p30 encode) OpenGL ES 3.0 graphics Options for 1GB, 2GB, 4GB, or 8GB LPDDR4-3."} +{"instruction": "Does the Raspberry Pi Compute Module 4S support a camera interface?", "input": "", "output": "Yes, the Raspberry Pi Compute Module 4S includes a CSI camera interface."} +{"instruction": "What does the Raspberry Pi Compute Module 4S datasheet say about eeprom_nwp?", "input": "", "output": "It is recommended that final products connect this pin to ground to prevent end users changing the contents of the on-\nboard EEPROM."} +{"instruction": "What is the operating temperature range of the Raspberry Pi Compute Module 4S?", "input": "", "output": "The Raspberry Pi Compute Module 4S operates over a temperature range of -20°C to 85°C."} +{"instruction": "What does the Raspberry Pi Compute Module 4S datasheet say about composite (tv out)?", "input": "", "output": "The TVDAC pin can be used to output composite video (PAL or NTSC)."} +{"instruction": "Does the Raspberry Pi Compute Module 4S support SDIO?", "input": "", "output": "Yes, the Raspberry Pi Compute Module 4S includes SDIO support."} +{"instruction": "Does the Raspberry Pi Compute Module 4S support DSI?", "input": "", "output": "Yes, the Raspberry Pi Compute Module 4S includes DSI support."} +{"instruction": "What does the Raspberry Pi Compute Module 4S datasheet say about emmc_disable_n?", "input": "", "output": "This pin when low signals to the BCM2711 that boot should not occur from either an SD card or eMMC, forcing USB boot."} +{"instruction": "What does the Raspberry Pi Compute Module 4S datasheet say about bootloader?", "input": "", "output": "Connect a HDMI cable to see if the HDMI diagnostics screen appears."} +{"instruction": "What does the Raspberry Pi Compute Module 4S datasheet say about differential pairs?", "input": "", "output": "It is recommended that P/N signals within a pair are matched to better than 0.15mm."} +{"instruction": "What is the smps specification for the Raspberry Pi Pico 2?", "input": "", "output": "The smps for the Raspberry Pi Pico 2 is 3.3V."} +{"instruction": "What does the Raspberry Pi Pico 2 datasheet say about debugging?", "input": "", "output": "Raspberry Pi Pico 2 brings the RP2350 Serial Wire Debug (SWD) interface to a 3 pin debug header on the lower edge of\nthe board."} +{"instruction": "How many GPIO pins does the Raspberry Pi Pico 2 provide?", "input": "", "output": "The Raspberry Pi Pico 2 provides 26 GPIO pins."} +{"instruction": "What does the Raspberry Pi Pico 2 datasheet say about surface-mount footprint?", "input": "", "output": "The following footprint (Figure 5) is recommended for systems which will be reflow-soldering Pico 2 units as modules."} +{"instruction": "What does the Raspberry Pi Pico 2 datasheet say about power consumption?", "input": "", "output": "The power consumption tables show the typical VBUS (5V) current consumption during RP2350’s low power states, and\nalso various software use cases."} +{"instruction": "What is the supply voltage range for the Raspberry Pi Pico 2?", "input": "", "output": "The Raspberry Pi Pico 2 operates at a supply voltage of 0 V to 5.5 V."} +{"instruction": "How much memory does the Raspberry Pi Pico 2 support?", "input": "", "output": "Raspberry Pi Pico 2 includes 4 MB of on‑board QSPI flash for program/data storage, and the RP2350 microcontroller provides 520 kB of on‑chip SRAM."} +{"instruction": "What does the Raspberry Pi Pico 2 datasheet say about low power states?", "input": "", "output": "The following table details the typical power consumption in low power states P1.0 to P1.7, and also with RUN held low."} +{"instruction": "What does the Raspberry Pi Pico 2 datasheet say about programming the flash?", "input": "", "output": "The on-board 4 MB QSPI flash can be (re)programmed either using the Serial Wire Debug port or by the special USB Mass Storage Device mode."} +{"instruction": "What does the Raspberry Pi Pico 2 datasheet say about using a battery charger?", "input": "", "output": "Pico 2 can also be used with a battery charger."} +{"instruction": "What does the Raspberry Pi Pico 2 datasheet say about • comprehensive sdk, software examples and documentation?", "input": "", "output": "For full details of the RP2350A microcontroller, see the RP2350 Datasheet."} +{"instruction": "Does the Raspberry Pi Pico 2 support GPIO?", "input": "", "output": "Yes, the Raspberry Pi Pico 2 includes GPIO support."} +{"instruction": "Does the Raspberry Pi Pico 2 support USB?", "input": "", "output": "Yes, the Raspberry Pi Pico 2 includes USB support."} +{"instruction": "Does the Raspberry Pi Pico 2 support SWD?", "input": "", "output": "Yes, the Raspberry Pi Pico 2 includes SWD support."} +{"instruction": "What is the vsys min specification for the Raspberry Pi Pico 2?", "input": "", "output": "The vsys min for the Raspberry Pi Pico 2 is 1.8V."} +{"instruction": "What is the smps to generate the specification for the Raspberry Pi Pico 2?", "input": "", "output": "The smps to generate the for the Raspberry Pi Pico 2 is 3.3V."} +{"instruction": "What is the operating temp max specification for the Raspberry Pi Pico 2?", "input": "", "output": "The operating temp max for the Raspberry Pi Pico 2 is 85°C."} +{"instruction": "What does the Raspberry Pi Pico 2 datasheet say about recommended operating conditions?", "input": "", "output": "Operating conditions for the Raspberry Pi Pico 2 are largely a function of the operating conditions specified by its\ncomponents."} +{"instruction": "What does the Raspberry Pi Pico 2 datasheet say about typical use cases?", "input": "", "output": "The following table shows the typical power consumption in various use cases."} +{"instruction": "Does the Raspberry Pi Pico 2 support PWM?", "input": "", "output": "Yes, the Raspberry Pi Pico 2 includes PWM support."} +{"instruction": "What does the Raspberry Pi Pico 2 W datasheet say about ◦support for bluetooth classic?", "input": "", "output": "Yes. The Pico 2 W datasheet lists “Support for Bluetooth Classic” as a key feature, alongside Bluetooth 5.2 and BLE Central/Peripheral support (the wireless interfaces are described as 802.11n + Bluetooth 5.2)."} +{"instruction": "What does the Raspberry Pi Pico 2 W datasheet say about support?", "input": "", "output": "For support see the Pico section of the Raspberry Pi website, and post questions on the Raspberry Pi forum."} +{"instruction": "Does the Raspberry Pi Pico 2 W include an on-chip voltage regulator?", "input": "", "output": "Yes, the Raspberry Pi Pico 2 W includes an on-chip voltage regulator (LDO)."} +{"instruction": "What USB interface does the Raspberry Pi Pico 2 W support?", "input": "", "output": "The Raspberry Pi Pico 2 W supports USB 1.1."} +{"instruction": "What does the Raspberry Pi Pico 2 W datasheet say about chapter 1. about pico 2 w?", "input": "", "output": "Raspberry Pi Pico 2 W is a microcontroller board based on the Raspberry Pi RP2350 microcontroller chip."} +{"instruction": "What does the Raspberry Pi Pico 2 W datasheet say about powerchain?", "input": "", "output": "Pico 2 W has been designed with a simple yet flexible power supply architecture and can easily be powered from other\nsources such as batteries or external supplies."} +{"instruction": "What is the operating temperature range of the Raspberry Pi Pico 2 W?", "input": "", "output": "The Raspberry Pi Pico 2 W operates over a temperature range of -20°C to 70°C."} +{"instruction": "What does the Raspberry Pi Pico 2 W datasheet say about general purpose i/o?", "input": "", "output": "The Pico 2 W’s GPIO is powered from the on-board 3.3V rail, and is fixed at 3.3V."} +{"instruction": "What does the Raspberry Pi Pico 2 W datasheet say about chapter 2. mechanical specification?", "input": "", "output": "The Pico 2 W is a single sided 51mm × 21mm × 1mm PCB with a micro USB port overhanging the top edge, and dual\ncastellated/through-hole pins around the two long edges."} +{"instruction": "What happens if the supply voltage to the Raspberry Pi Pico 2 W drops below the minimum?", "input": "", "output": "If the supply voltage drops below the minimum, the Raspberry Pi Pico 2 W triggers a brown-out or power-on reset."} +{"instruction": "What does the Raspberry Pi Pico 2 W datasheet say about raspberry pi pico 2 w design files?", "input": "", "output": "The source design files, including the schematic and PCB layout, are made available openly except for the antenna."} +{"instruction": "Does the Raspberry Pi Pico 2 W support ADC?", "input": "", "output": "Yes, the Raspberry Pi Pico 2 W includes ADC support."} +{"instruction": "How do you program the Raspberry Pi Pico 2 W?", "input": "", "output": "The Raspberry Pi Pico 2 W can be programmed via USB."} +{"instruction": "What does the Raspberry Pi Pico datasheet say about • high performance full-crossbar bus fabric?", "input": "", "output": "Raspberry Pi Pico Datasheet\nChapter 1."} +{"instruction": "What does the Raspberry Pi Pico datasheet say about appendix b: pico schematic?", "input": "", "output": "See Figure 19 on the following page."} +{"instruction": "How do you enter the bootloader on the Raspberry Pi Pico?", "input": "", "output": "To enter the bootloader on the Raspberry Pi Pico, hold the BOOTSEL button while connecting USB."} +{"instruction": "What does the Raspberry Pi Pico datasheet say about raspberry pi pico pinout?", "input": "", "output": "The Pico pinout has been designed to directly bring out as much of the RP2040 GPIO and internal circuitry function as\npossible, while also providing a suitable number of ground pins to reduce EMI (Electro Magnetic Interference) and\nsignal crosstalk."} +{"instruction": "What does the Raspberry Pi Pico datasheet say about raspberry pi pico design files?", "input": "", "output": "The source design files, including the schematic and PCB layout, are made available openly, with no limitations."} +{"instruction": "What USB interface does the Raspberry Pi Pico support?", "input": "", "output": "The Raspberry Pi Pico supports USB 1.1."} +{"instruction": "What does the Raspberry Pi Pico datasheet say about chapter 1. about raspberry pi pico?", "input": "", "output": "Raspberry Pi Pico is a microcontroller board based on the Raspberry Pi RP2040 microcontroller chip."} +{"instruction": "What voltage should be supplied to the VBUS pin on the Raspberry Pi Pico?", "input": "", "output": "The VBUS pin on the Raspberry Pi Pico should be supplied with 5 V."} +{"instruction": "What does the Raspberry Pi Pico datasheet say about using the adc?", "input": "", "output": "The RP2040 ADC does not have an on-board reference and therefore uses its own power supply as a reference."} +{"instruction": "What does the Raspberry Pi Pico datasheet say about appendix a: availability?", "input": "", "output": "Raspberry Pi guarantee availability of the Raspberry Pi Pico product until at least January 2028."} +{"instruction": "What does the Raspberry Pi Pico datasheet say about powering pico?", "input": "", "output": "The simplest way to power Pico is to plug in the micro-USB, which will power VSYS (and therefore the system) from the\n5V USB VBUS voltage, via D1 (so VSYS becomes VBUS minus the Schottky diode drop)."} +{"instruction": "How do you supply power to the Raspberry Pi Pico?", "input": "", "output": "Power can be supplied to the Raspberry Pi Pico via 3.3V (to power RP2040 and external\ncircuitry) from a wide range of input voltages (~1.8 to 5.5V). This allows significant flexibility in powering the ."} +{"instruction": "What is the vsys min specification for the Raspberry Pi Pico W?", "input": "", "output": "The vsys min for the Raspberry Pi Pico W is 1.8V."} +{"instruction": "What does the Raspberry Pi Pico W datasheet say about programming the flash?", "input": "", "output": "The on-board 2MB QSPI flash can be (re)programmed either using the serial wire debug port or by the special USB mass\nstorage device mode."} +{"instruction": "What does the Raspberry Pi Pico W datasheet say about ◦can emulate interfaces such as sd card and vga?", "input": "", "output": "NOTE: Raspberry Pi Pico W I/O voltage is fixed at 3.3 V. The RP2040’s PIO blocks are flexible enough to emulate digital interfaces such as an SD card interface or VGA video output, with suitable external circuitry."} +{"instruction": "What is the smps to generate the specification for the Raspberry Pi Pico W?", "input": "", "output": "The smps to generate the for the Raspberry Pi Pico W is 3.3V."} +{"instruction": "Does the Raspberry Pi Pico W support PWM?", "input": "", "output": "Yes, the Raspberry Pi Pico W includes PWM support."} +{"instruction": "How much memory does the Raspberry Pi Pico W support?", "input": "", "output": "Raspberry Pi Pico W includes 2 MB of on‑board QSPI flash for program/data storage, and the RP2040 microcontroller provides 264 kB of on‑chip SRAM."} +{"instruction": "Does the Raspberry Pi Pico W support USB ?", "input": "", "output": "Yes, the Raspberry Pi Pico W includes USB support."} +{"instruction": "Does the Raspberry Pi Pico W support SPI?", "input": "", "output": "Yes, the Raspberry Pi Pico W includes SPI support."} +{"instruction": "Does the Raspberry Pi Pico W support GPIO?", "input": "", "output": "Yes, the Raspberry Pi Pico W includes GPIO support."} +{"instruction": "What is the smps specification for the Raspberry Pi Pico W?", "input": "", "output": "The smps for the Raspberry Pi Pico W is 3.3V."} +{"instruction": "Does the Raspberry Pi Pico W have wireless connectivity?", "input": "", "output": "Yes. Raspberry Pi Pico W has an on‑board 2.4 GHz 802.11n wireless LAN interface, and supports Bluetooth 5.2 (Classic and BLE) via the CYW43439 wireless chip and the Raspberry Pi Pico SDK."} +{"instruction": "How many GPIO pins does the Raspberry Pi Pico W provide?", "input": "", "output": "The Raspberry Pi Pico W provides 26 GPIO pins."} +{"instruction": "What does the Raspberry Pi Pico W datasheet say about debugging?", "input": "", "output": "Pico W brings the RP2040 serial wire debug (SWD) interface to a three-pin debug header."} +{"instruction": "What does the Raspberry Pi 4 Model B datasheet say about support?", "input": "", "output": "For support resources, documentation, and compliance information, refer to the official Raspberry Pi product documentation and the Raspberry Pi 4 Model B datasheet."} +{"instruction": "What does the Raspberry Pi 4 Model B datasheet say about interfaces?", "input": "", "output": "Raspberry Pi 4 Model B integrates a 2.4/5 GHz 802.11ac wireless LAN + Bluetooth 5.0 (BLE), Gigabit Ethernet, a microSD card slot, two USB 3.0 ports and two USB 2.0 ports, dual micro‑HDMI ports (up to 4Kp60), MIPI CSI/DSI connectors, and a 40‑pin GPIO header."} +{"instruction": "How many GPIO pins does the Raspberry Pi 4 Model B provide?", "input": "", "output": "Raspberry Pi 4 Model B has a 40‑pin GPIO header; up to 28 of those pins can be used as general‑purpose I/O (GPIO), with the remainder providing power, ground, and dedicated interfaces."} +{"instruction": "What does the Raspberry Pi 4 Model B datasheet say about caution! stresses above those listed in table 2 may cause permanent damage to the device. this is?", "input": "", "output": "a stress rating only; functional operation of the device under these or any other conditions above those\nlisted in the operational sections of this specification is not implied."} +{"instruction": "Does the Raspberry Pi 4 Model B support GPIO?", "input": "", "output": "Yes, the Raspberry Pi 4 Model B includes GPIO support."} +{"instruction": "Does the Raspberry Pi 4 Model B have Ethernet support?", "input": "", "output": "Yes, the Raspberry Pi 4 Model B supports Gigabit Ethernet."} +{"instruction": "What does the Raspberry Pi 4 Model B datasheet say about hdmi?", "input": "", "output": "Raspberry Pi 4 Model B provides dual micro‑HDMI ports supporting up to 4Kp60 (single display) and dual‑display configurations depending on resolution/refresh."} +{"instruction": "Does the Raspberry Pi 4 Model B support HDMI?", "input": "", "output": "Yes. Raspberry Pi 4 Model B supports HDMI via two micro‑HDMI ports (HDMI 2.0)."} +{"instruction": "What does the Raspberry Pi 4 Model B datasheet say about display parallel interface (dpi)?", "input": "", "output": "The Raspberry Pi 4 Model B supports a Display Parallel Interface (DPI) that can be brought out on GPIO bank 0 using alternate functions, allowing attachment of parallel RGB displays via the 40‑pin header."} +{"instruction": "What does the RP2040 datasheet say about configuring the 1 second reference clock:?", "input": "", "output": "This is done outside the RTC registers (see Section 4.8.4)."} +{"instruction": "What does the RP2040 datasheet say about • c application binary interface compliant exception model. this is the armv6-m, c application binary interface (c-?", "input": "", "output": "ABI) compliant exception model that enables the use of pure C functions as interrupt handlers."} +{"instruction": "Does the RP2040 support SPI?", "input": "", "output": "Yes, the RP2040 includes SPI support."} +{"instruction": "What is the dvdd of specification for the RP2040?", "input": "", "output": "The dvdd of for the RP2040 is 1.1V."} +{"instruction": "What does the RP2040 datasheet say about watchdog: scratch0, scratch1, …, scratch6, scratch7 registers?", "input": "", "output": "Offsets: 0x0c, 0x10, …, 0x24, 0x28\nTable 549."} +{"instruction": "What does the RP2040 datasheet say about memory power down?", "input": "", "output": "The main system memories (SRAM0…5, mapped to bus addresses 0x20000000 to 0x20041fff), as well as the USB DPRAM,\ncan be powered down via the MEMPOWERDOWN register in the Syscfg registers (see Section 2.21)."} +{"instruction": "What does the RP2040 datasheet say about source of last reset?", "input": "", "output": "The source of the most recent chip-level reset can be determined by reading the state of the HAD_POR, HAD_RUN and\nHAD_PSM_RESTART fields in the CHIP_RESET register."} +{"instruction": "What does the RP2040 datasheet say about spi: sspperiphid0 register?", "input": "", "output": "Offset: 0xfe0\nRP2040 Datasheet\n4.4."} +{"instruction": "What does the RP2040 datasheet say about gpio functions?", "input": "", "output": "Each individual GPIO pin can be connected to an internal peripheral via the GPIO functions defined below."} +{"instruction": "Does the RP2040 support ADC?", "input": "", "output": "Yes, the RP2040 includes ADC support."} +{"instruction": "What does the RP2040 datasheet say about peripheral power consumption?", "input": "", "output": "Baseline readings are taken with only clock sources and essential peripherals (BUSCTRL, BUSFAB, VREG, Resets, ROM,\nSRAMs) active in the WAKE_EN0/WAKE_EN1 registers."} +{"instruction": "What does the RP2040 datasheet say about resets: reset register?", "input": "", "output": "Offset: 0x0\nDescription\nReset control."} +{"instruction": "What does the RP2040 datasheet say about spi: sspcpsr register?", "input": "", "output": "Offset: 0x010\nDescription\nClock prescale register, SSPCPSR on page 3-8\nTable 501."} +{"instruction": "What does the RP2040 datasheet say about watchdog: reason register?", "input": "", "output": "Offset: 0x08\nDescription\nLogs the reason for the last reset."} +{"instruction": "What does the RP2040 datasheet say about • processor complex - finally the processors can start running?", "input": "", "output": "The final thing to come out of reset is the processor complex."} +{"instruction": "What does the RP2040 datasheet say about • master – the component that initializes a transfer (start command), generates the clock scl signal and?", "input": "", "output": "terminates the transfer (STOP command)."} +{"instruction": "What does the RP2040 datasheet say about memory model?", "input": "", "output": "The processor contains a bus matrix that arbitrates the processor core and Debug Access Port (DAP) memory\naccesses to both the external memory system and to the internal NVIC and debug components."} +{"instruction": "What does the RP2040 datasheet say about uart: uartdmacr register?", "input": "", "output": "Offset: 0x048\nDescription\nDMA Control Register, UARTDMACR\nTable 439."} +{"instruction": "What does the RP2040 datasheet say about clocks: clk_gpout1_div register?", "input": "", "output": "Offset: 0x10\nDescription\nClock divisor, can be changed on-the-fly\nTable 212."} +{"instruction": "What does the RP2040 datasheet say about motorola spi frame format?", "input": "", "output": "The Motorola SPI interface is a four-wire interface where the SSPFSSOUT signal behaves as a slave select."} +{"instruction": "What does the RP2040 datasheet say about clocks: clk_adc_selected register?", "input": "", "output": "Offset: 0x68\nDescription\nIndicates which SRC is currently selected by the glitchless mux (one-hot)."} +{"instruction": "What does the RP2040 datasheet say about gpio29_status registers?", "input": "", "output": "Offsets: 0x000, 0x008, …, 0x0e0, 0x0e8\nDescription\nGPIO status\nTable 284."} +{"instruction": "What does the RP2040 datasheet say about i2c: ic_intr_mask register?", "input": "", "output": "Offset: 0x30\nDescription\nI2C Interrupt Mask Register."} +{"instruction": "What does the RP2040 datasheet say about • an event is generated (hardware interrupt or send-event instruction from another processor) when the flag is set.?", "input": "", "output": "WFE wake up events\nThe following events are WFE wake up events:."} +{"instruction": "What does the RP2040 datasheet say about eeprom read?", "input": "", "output": "NOTE: This transfer mode is only valid for master configurations. In EEPROM read mode (TMOD = 11b), the SSI transmits the EEPROM opcode and address (typically 3 bytes) and only then captures incoming data into the receive FIFO; any extra frames in the TX FIFO beyond the control frames will cause read data to be lost. EEPROM read mode is not supported when the SSI is configured for SSP mode."} +{"instruction": "What does the RP2040 datasheet say about i2c: ic_ss_scl_lcnt register?", "input": "", "output": "Offset: 0x18\nDescription\nStandard Speed I2C Clock SCL Low Count Register\nTable 458."} +{"instruction": "What does the RP2040 datasheet say about features?", "input": "", "output": "The ARM Cortex-M0+ processor features and benefits are:."} +{"instruction": "What does the RP2040 datasheet say about usb: usbphy_direct_override register?", "input": "", "output": "Offset: 0x80\nDescription\nOverride enable for each control in usbphy_direct\nTable 416."} +{"instruction": "What does the RP2040 datasheet say about launching code on processor core 1?", "input": "", "output": "As described in the introduction to Section 2.8.1, after reset, processor core 1 \"sleeps (WFE with SCR.SLEEPDEEP\nenabled) and remains asleep until woken by user code, via the mailbox\"."} +{"instruction": "What does the RP2040 datasheet say about uart: uartpcellid2 register?", "input": "", "output": "Offset: 0xff8\nDescription\nUARTPCellID2 Register\nTable 446."} +{"instruction": "What does the RP2040 datasheet say about pad control - qspi bank?", "input": "", "output": "The QSPI Bank Pad Control registers start at a base address of 0x40020000 (defined as PADS_QSPI_BASE in SDK)."} +{"instruction": "What does the RP2040 datasheet say about • block: if 1, stall if tx fifo is empty. if 0, pulling from an empty fifo copies scratch x to osr.?", "input": "", "output": "Some peripherals (UART, SPI…) should halt when no data is available, and pick it up as it comes in; others (I2S) should\nclock continuously, and it is better to output placeholder or repeated data than to stop clocking."} +{"instruction": "What does the RP2040 datasheet say about uart: uartifls register?", "input": "", "output": "Offset: 0x034\nDescription\nInterrupt FIFO Level Select Register, UARTIFLS\nTable 434."} +{"instruction": "What does the RP2040 datasheet say about sio: gpio_oe_clr register?", "input": "", "output": "Offset: 0x028\nDescription\nGPIO output enable clear\nTable 26."} +{"instruction": "What does the RP2040 datasheet say about instruction set summary?", "input": "", "output": "The processor implements the ARMv6-M Thumb instruction set, including a number of 32-bit instructions that use\nThumb-2 technology."} +{"instruction": "What does the RP2040 datasheet say about dma: inte0 register?", "input": "", "output": "Offset: 0x404\nDescription\nInterrupt Enables for IRQ 0\nTable 138."} +{"instruction": "What does the RP2040 datasheet say about usb: usbphy_trim register?", "input": "", "output": "Offset: 0x84\nDescription\nUsed to adjust trim values of USB phy pull down resistors."} +{"instruction": "What does the RP2040 datasheet say about dma: intf0 register?", "input": "", "output": "Offset: 0x408\nDescription\nForce Interrupts\nTable 139."} +{"instruction": "What does the RP2350 datasheet say about • chip_reset.rescue_flag: this flag instructs the bootrom to halt the boot process. the bootrom clears the flag?", "input": "", "output": "You can use this to perform a full-system reset from almost any state (particularly ones where all\nsystem clocks are stopped), and catch the processors before they re-run the code that caused the bad state."} +{"instruction": "What does the RP2350 datasheet say about usb: nak_poll register?", "input": "", "output": "Offset: 0x06c\nDescription\nUsed by the host controller."} +{"instruction": "What does the RP2350 datasheet say about pio: ctrl register?", "input": "", "output": "Offset: 0x000\nDescription\nPIO control register\nTable 982."} +{"instruction": "What does the RP2350 datasheet say about • parallel-to-serial conversion on data transmitted to the peripheral device?", "input": "", "output": "The CPU reads and writes data and control/status information through the AMBA APB interface."} +{"instruction": "What does the RP2350 datasheet say about accessctrl: sram4 register?", "input": "", "output": "Offset: 0x2c\nDescription\nControl whether debugger, DMA, core 0 and core 1 can access SRAM4, and at what security/privilege levels they\ncan do so."} +{"instruction": "What does the RP2350 datasheet say about • 0x00020000 - is_ecc: if this bit is set, each value in the buffer is 2 bytes and ecc is used when read/writing from 24 bit?", "input": "", "output": "If this bit is not set, each value in the buffer is 4 bytes, the low 24-bits of which are written to or read\nfrom OTP."} +{"instruction": "What does the RP2350 datasheet say about otp boot oscillator?", "input": "", "output": "The OTP startup sequence (Section 13.3.4) runs from a local ring oscillator, dedicated to the OTP subsystem."} +{"instruction": "What does the RP2350 datasheet say about interrupt request (irq) and dma data request (dreq)?", "input": "", "output": "The PWM block has two IRQ outputs."} +{"instruction": "What does the RP2350 datasheet say about cross trigger interface unit (cti)?", "input": "", "output": "The CTI enables the debug logic, MTB, and ETM to interact with each other and with other CoreSight ™ components."} +{"instruction": "What does the RP2350 datasheet say about ◦even with the chip select asserted low, the internal flash die maintains a high-impedance state on its sd0?", "input": "", "output": "Even with the chip select asserted low, the internal flash die maintains a high‑impedance state on its SD0 through SD3 pins if there are no transitions on SCK, so you can keep CSn asserted throughout UART boot."} +{"instruction": "What does the RP2350 datasheet say about usb: buff_cpu_should_handle register?", "input": "", "output": "Offset: 0x05c\nDescription\nWhich of the double buffers should be handled."} +{"instruction": "What does the RP2350 datasheet say about uart: uartlcr_h register?", "input": "", "output": "Offset: 0x02c\nDescription\nLine Control Register, UARTLCR_H\nTable 1035."} +{"instruction": "What does the RP2350 datasheet say about security attribution and memory protection?", "input": "", "output": "The Cortex-M33 processor supports the Armv8-M Protected Memory System Architecture (PMSA) that provides\nprogrammable support for memory protection using a number of software controllable regions."} +{"instruction": "What does the RP2350 datasheet say about pads_bank0: gpio1 register?", "input": "", "output": "Offset: 0x08\nRP2350 Datasheet\n9.11."} +{"instruction": "What does the RP2350 datasheet say about overview (arm)?", "input": "", "output": "RP2350 provides hardware and bootrom security features for three purposes:\n1."} +{"instruction": "What does the RP2350 datasheet say about pin descriptions?", "input": "", "output": "The function\nof each pin is briefly\ndescribed here."} +{"instruction": "What does the RP2350 datasheet say about • the dw_apb_i2c adds one cycle to the programmed *_lcnt value in order to generate the low period of the scl clock;?", "input": "", "output": "this is due to the counting logic for SCL low counting to (*_LCNT + 1)."} +{"instruction": "What does the RP2350 datasheet say about accessctrl: pio0 register?", "input": "", "output": "Offset: 0x4c\nDescription\nControl whether debugger, DMA, core 0 and core 1 can access PIO0, and at what security/privilege levels they can\ndo so."} +{"instruction": "What does the RP2350 datasheet say about pio: dbg_padout register?", "input": "", "output": "Offset: 0x03c\nRP2350 Datasheet\n11.7."} +{"instruction": "What does the RP2350 datasheet say about sio: gpio_hi_out register?", "input": "", "output": "Offset: 0x014\nDescription\nOutput value for GPIO32…47, QSPI IOs and USB pins."} +{"instruction": "What does the RP2350 datasheet say about • edge low: the gpio has transitioned from a logical 1 to a logical 0?", "input": "", "output": "The level interrupts are not latched."} +{"instruction": "What does the RP2350 datasheet say about • an mrrc instruction (move two words from coprocessor to integer registers) takes one cycle, plus one cycle of?", "input": "", "output": "Instruction fetch\nEach Cortex-M33 core has separate instruction and data buses (\"Harvard architecture\")."} +{"instruction": "What does the RP2350 datasheet say about • boot3: entry point?", "input": "", "output": "Use this to vector into code preloaded in RAM which was retained during a low-power state."} +{"instruction": "What does the RP2350 datasheet say about using an external low-power clock?", "input": "", "output": "Instead of using the low-power RC oscillator, an external 32.768 kHz low power clock signal can be provided on one of\nGPIO 12, 14, 20, or 22."} +{"instruction": "How much memory does the RP2350 support?", "input": "", "output": "The RP2350 has 520 KB of on-chip SRAM."} +{"instruction": "What does the RP2350 datasheet say about powman: boot0, boot1, boot2, boot3 registers?", "input": "", "output": "Offsets: 0xd0, 0xd4, 0xd8, 0xdc\nTable 523."} +{"instruction": "What does the RP2350 datasheet say about spi: sspperiphid1 register?", "input": "", "output": "Offset: 0xfe4\nDescription\nPeripheral identification registers, SSPPeriphID0-3 on page 3-13\nTable 1111."} +{"instruction": "What does the RP2350 datasheet say about • configure the rosc sampling period in terms of system clock ticks, see sample_cnt1.?", "input": "", "output": "Because the system clock generally runs much faster than the ROSC, the sampling period is expected to be at least a\nfew tens of system clock ticks."} +{"instruction": "What does the RP2350 datasheet say about • refuse to boot, due to lack of suitable images and the uart and usb bootloaders being disabled via otp.?", "input": "", "output": "This section makes no distinction between the different types of flash boot (flash image boot, flash partition boot and\nflash partition-table-in-image boot)."} +{"instruction": "What does the RP2350 datasheet say about clocks: clk_sys_resus_ctrl register?", "input": "", "output": "Offset: 0x84\nRP2350 Datasheet\n8.1."} +{"instruction": "What does the RP2350 datasheet say about chip-level reset table?", "input": "", "output": "A dash\n(—) indicates no change caused by this source."} +{"instruction": "What does the RP2350 datasheet say about enabling the dma controller interface?", "input": "", "output": "To enable the DMA Controller interface on the DW_apb_i2c, you must write the DMA Control Register (IC_DMA_CR)."} +{"instruction": "What does the RP2350 datasheet say about io_qspi: proc0_inte register?", "input": "", "output": "Offset: 0x21c\nDescription\nInterrupt Enable for proc0\nTable 842."} +{"instruction": "What does the RP2350 datasheet say about usb: ep_rx_error register?", "input": "", "output": "Offset: 0x110\nDescription\nRX error count for each endpoint."} +{"instruction": "What does the RP2350 datasheet say about spi: sspicr register?", "input": "", "output": "Offset: 0x020\nDescription\nInterrupt clear register, SSPICR on page 3-11\nTable 1108."} +{"instruction": "What does the RP2350 datasheet say about pads_bank0: gpio15 register?", "input": "", "output": "Offset: 0x40\nRP2350 Datasheet\n9.11."} +{"instruction": "What does the RP2350 datasheet say about • secure_boot_enable (reset: 1 ): enable boot signature checking in bootrom, disable factory jtag, and force the?", "input": "", "output": "ARCHSEL register to Arm, at higher priority than the default boot architecture flag."} +{"instruction": "What does the RP2350 datasheet say about pwm: intr register?", "input": "", "output": "Offset: 0x0f4\nDescription\nRaw Interrupts\nTable 1138."} +{"instruction": "What does the RP2350 datasheet say about • ahb5 exclusive reads: arm ldrex* instructions, risc-v lr.w instructions, and the read phase of risc-v amos (the?", "input": "", "output": "Hazard3 cores on RP2350 implement AMOs as an exclusive read/write pair that retries until the write succeeds)."} +{"instruction": "What does the RP2350 datasheet say about boot outcomes?", "input": "", "output": "The bootrom decides the boot outcome based on the following system state:."} +{"instruction": "What does the RP2350 datasheet say about accessctrl: sram0 register?", "input": "", "output": "Offset: 0x1c\nDescription\nControl whether debugger, DMA, core 0 and core 1 can access SRAM0, and at what security/privilege levels they\ncan do so."} +{"instruction": "How does the memory of the Raspberry Pi Pico compare to the Raspberry Pi Pico W?", "input": "", "output": "The Raspberry Pi Pico has 2 MB while the Raspberry Pi Pico W has 2 MB."} +{"instruction": "How does the memory of the RP2040 compare to the Raspberry Pi Pico?", "input": "", "output": "The RP2040 has 264 KB of on-chip SRAM, while the Raspberry Pi Pico has 2 MB of flash storage."} +{"instruction": "Which has a higher clock speed, the RP2040 or the RP2350?", "input": "", "output": "It depends on what you consider “maximum.” RP2350 supports a flexible system clock up to 150 MHz. RP2040 is specified as 133 MHz in general, but the RP2040 datasheet also notes operation up to 200 MHz at 1.15 V (see the summary section). So at default voltage/speed grades RP2350 is higher (150 MHz vs 133 MHz), but RP2040 can be run faster in the 200 MHz mode if your design supports the higher core voltage."} +{"instruction": "What is the difference between the Raspberry Pi Compute Module 4 and the Raspberry Pi Compute Module 3+?", "input": "", "output": "Key differences (high level): the Compute Module 4 (CM4) is based on the BCM2711 (quad‑core Cortex‑A72) and is clocked up to 1.5 GHz, while the Compute Module 3+ (CM3+) is based on the BCM2837B0 (quad‑core Cortex‑A53) and is clocked at 1.2 GHz. CM4 also adds a PCIe Gen2 x1 lane, more RAM/eMMC configuration options, and is available in variants with or without on‑board Wi‑Fi/Bluetooth; CM3+ has no PCIe and is limited to 1 GB RAM, with optional eMMC depending on variant."} +{"instruction": "Does the Raspberry Pi Compute Module 4 or the Raspberry Pi Compute Module 4S have wireless connectivity?", "input": "", "output": "Compute Module 4 is available in variants with or without on‑board wireless (Wi‑Fi/Bluetooth). Compute Module 4S is specified without an on‑board wireless option, so external wireless would need to be provided if required."} +{"instruction": "How does the memory of the Raspberry Pi Compute Module 4 compare to the Raspberry Pi Compute Module 4S?", "input": "", "output": "Both the Raspberry Pi Compute Module 4 and Compute Module 4S are available with up to 8 GB of LPDDR4 RAM."} +{"instruction": "Which has a higher clock speed, the Raspberry Pi Compute Module 4 or the Raspberry Pi Compute Module 3+?", "input": "", "output": "The Raspberry Pi Compute Module 4 has a BCM2711 processor at up to 1.5 GHz while the Raspberry Pi Compute Module 3+ has a BCM2837B0 processor at up to 1.2 GHz, so the CM4 is faster."} +{"instruction": "How does the memory of the RP2040 compare to the RP2350?", "input": "", "output": "The RP2040 has 264 KB of on-chip SRAM while the RP2350 has 520 KB of on-chip SRAM."} +{"instruction": "How does the memory of the RP2350 compare to the Raspberry Pi Pico 2?", "input": "", "output": "The RP2350 has 520 KB of on-chip SRAM. The Raspberry Pi Pico 2 board has 4 MB of external QSPI flash for program storage."} +{"instruction": "What is the difference between the Raspberry Pi 4 Model B and the Raspberry Pi Compute Module 4?", "input": "", "output": "Key differences: the Raspberry Pi 4 Model B adds BLE, SPI0, SPI1; the Raspberry Pi Compute Module 4 adds JTAG, PCIE, PWM."} +{"instruction": "Does the Raspberry Pi Pico 2 or the Raspberry Pi Pico 2 W have wireless connectivity?", "input": "", "output": "The Raspberry Pi Pico 2 W includes wireless connectivity while the Raspberry Pi Pico 2 does not."} +{"instruction": "What is the difference between the RP2040 and the RP2350?", "input": "", "output": "Key differences: the RP2040 has 284 KB vs 520 KB for the RP2350."} +{"instruction": "What is the difference between the Raspberry Pi Compute Module 4 and the Raspberry Pi Compute Module 4S?", "input": "", "output": "Key differences: the Raspberry Pi Compute Module 4 adds Bluetooth, Ethernet, and PCIe. The Raspberry Pi Compute Module 4S adds USB."} diff --git a/content/learning-paths/mobile-graphics-and-gaming/model-training-gym/_index.md b/content/learning-paths/mobile-graphics-and-gaming/model-training-gym/_index.md index e106b163bc..bef6e52f58 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/model-training-gym/_index.md +++ b/content/learning-paths/mobile-graphics-and-gaming/model-training-gym/_index.md @@ -27,6 +27,7 @@ tools_software_languages: - PyTorch - Jupyter Notebook - Vulkan + - NX operatingsystems: - Linux further_reading: diff --git a/content/learning-paths/mobile-graphics-and-gaming/nss-unreal/3-create-example.md b/content/learning-paths/mobile-graphics-and-gaming/nss-unreal/3-create-example.md index d4089fdb75..c4e4a8388b 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/nss-unreal/3-create-example.md +++ b/content/learning-paths/mobile-graphics-and-gaming/nss-unreal/3-create-example.md @@ -14,7 +14,7 @@ This repository's release package contains everything you need to set up NSS for Download the latest release `.zip` and extract it on your Windows machine. {{% notice Unreal Engine 5.5 plugin %}} -For this UE version, the steps will be slightly different. Refer to the repository documentation for more information. +This version is deprecated. Refer to the repository documentation for more information. {{% /notice %}} @@ -48,7 +48,7 @@ Open your project directory in Windows explorer, and create a new folder called ## Enable the plugin -The plugin is included in the release package you downloaded in the previous section. The package contains a separate folder for each supported Unreal Engine version. Make sure you use the folder that matches your engine version (for example, UE5.5 for Unreal Engine 5.5). +The plugin is included in the release package you downloaded in the previous section. The package contains a separate folder for each supported Unreal Engine version. Make sure you use the folder that matches your engine version (for example, UE5.4 for Unreal Engine 5.4). 1. Copy the appropriate engine-version folder from the extracted .zip archive into your project's Plugins directory. 2. Reopen Unreal Engine. When prompted, confirm that you want to enable the plugin. diff --git a/content/learning-paths/mobile-graphics-and-gaming/nss-unreal/_index.md b/content/learning-paths/mobile-graphics-and-gaming/nss-unreal/_index.md index 9ded8ab2a2..8d4279ed06 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/nss-unreal/_index.md +++ b/content/learning-paths/mobile-graphics-and-gaming/nss-unreal/_index.md @@ -15,7 +15,7 @@ learning_objectives: prerequisites: - Windows 11 - - Unreal Engine 4.27 or 5.4-5.6 (with the Templates and Feature Pack enabled) + - Unreal Engine 4.27 or 5.4 or 5.6 (with the Templates and Feature Pack enabled) - Visual Studio (with Desktop Development with C++ and .NET desktop build tools) @@ -31,6 +31,7 @@ tools_software_languages: - Unreal Engine - Vulkan SDK - Visual Studio + - NX operatingsystems: - Windows diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/_index.md b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/_index.md new file mode 100644 index 0000000000..c07b067c4e --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/_index.md @@ -0,0 +1,63 @@ +--- +title: Unleashing leading On-Device AI performance with llama.cpp, SME2 and KleidiAI + +draft: true +cascade: + draft: true + +minutes_to_complete: 40 + +who_is_this_for: This is an advanced topic for software developers, performance engineers, and AI practitioners + +learning_objectives: + - Build llama.cpp library with KleidiAI and SME2 support + - Profile performance of LLMs running on llama-cli + - Learn how KleidiAI and SME2 accelerates LLM operators + +prerequisites: + - Knowledge of KleidiAI and SME2 + - An Linux or Android device with Arm SME2 support + +author: Zenon Zhilong Xiu + +### Tags +skilllevels: Advanced +subjects: ML +armips: + - Arm C1 CPU + - Arm SME2 unit +tools_software_languages: + - C++ + - llama.cpp +operatingsystems: + - Android + - Linux + + + +further_reading: + - resource: + title: part 1 Arm Scalable Matrix Extension Introduction + link: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction + type: blog + - resource: + title: part 2 Arm Scalable Matrix Extension Instructions + link: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction-p2 + type: blog + - resource: + title: part4 Arm SME2 Introduction + link: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/part4-arm-sme2-introduction + type: blog + - resource: + title: Profile llama.cpp performance with Arm Streamline and KleidiAI LLM kernels + link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/ + type: blog + + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/_next-steps.md b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/_next-steps.md new file mode 100644 index 0000000000..727b395ddd --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # The weight controls the order of the pages. _index.md always has weight 1. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/build_llama_cpp.md b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/build_llama_cpp.md new file mode 100644 index 0000000000..cfae409b93 --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/build_llama_cpp.md @@ -0,0 +1,57 @@ +--- +title: Build llama.cpp with KleidiAI and SME2 enabled +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Build llama.cpp with KleidiAI and SME2 enabled +For convenience, llama.cpp is statically linked. We use aarch64 GCC cross compile toolchain, *aarch64-none-linux-gnu-* to build the project. To support SME2, GCC compiler version 14.2 and onwards is required. The toolchain can be downloaded [here](https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads) . + +The llama.cpp with Tag b7611 is used in this tutorial, newer versions should also work but they are not tested. + +After downloading the llama.cpp source code [here](https://github.com/ggml-org/llama.cpp/releases/tag/b7610), create a new directory *build* under the llama.cpp root directory and change to the new directory, + +```bash +cd ~/llama.cpp +mkdir build && cd build +``` +Next, configure the project, + +```bash +cmake .. \ + -DCMAKE_SYSTEM_NAME=Linux \ + -DCMAKE_SYSTEM_PROCESSOR=arm \ + -DCMAKE_C_COMPILER=aarch64-linux-gnu-gcc \ + -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++ \ + -DLLAMA_NATIVE=OFF \ + -DLLAMA_F16C=OFF \ + -DLLAMA_GEMM_ARM=ON \ + -DBUILD_SHARED_LIBS=OFF \ + -DCMAKE_EXE_LINKER_FLAGS="-static -g" \ + -DGGML_OPENMP=OFF \ + -DCMAKE_C_FLAGS=" -march=armv8.7-a+sve+i8mm+dotprod+sme2 -g" \ + -DCMAKE_CXX_FLAGS=" -march=armv8.7-a+sve+i8mm+dotprod+sme2 -g" \ + -DLLAMA_BUILD_TESTS=OFF \ + -DLLAMA_BUILD_EXAMPLES=ON \ + -DLLAMA_CURL=OFF \ + -DGGML_CPU_KLEIDIAI=ON +``` +Set CMAKE_C_COMPILER and CMAKE_CXX_COMPILER to your cross compiler path. Make sure that *-march* in CMAKE_C_FLAGS and CMAKE_CXX_FLAGS includes "+sme2". + +The *-static* and *-g* options are specified to produce a statically linked executable, in oder to run on different Arm64 Linux/Android environments and include debug information. + +Next, build the project, + +```bash +cd ~/llama.cpp/build +cmake --build ./ --config Release -j $(nproc) +``` +After the building process completes, you can find the application,*llama-cli*,in the ~/llama.cpp/build/bin/ directory. + +To enable SME2 microkernels, you must set following environment variable before running the application. + +```bash +GGML_KLEIDIAI_SME="1" +``` \ No newline at end of file diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/images/kai_matmul_kernel.jpg b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/images/kai_matmul_kernel.jpg new file mode 100644 index 0000000000..37800a20cc Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/images/kai_matmul_kernel.jpg differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/images/llama-3.2-3b_architecture.jpg b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/images/llama-3.2-3b_architecture.jpg new file mode 100755 index 0000000000..46eee2e9b4 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/images/llama-3.2-3b_architecture.jpg differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/images/llama_components.jpg b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/images/llama_components.jpg new file mode 100644 index 0000000000..55f56c2883 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/images/llama_components.jpg differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/images/one_attention_block.jpg b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/images/one_attention_block.jpg new file mode 100755 index 0000000000..b3c06e86f6 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/images/one_attention_block.jpg differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/images/streamline_call_paths_combined.jpg b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/images/streamline_call_paths_combined.jpg new file mode 100644 index 0000000000..81de02614c Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/images/streamline_call_paths_combined.jpg differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/images/streamline_timeline_combined.jpg b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/images/streamline_timeline_combined.jpg new file mode 100644 index 0000000000..01f54dc616 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/images/streamline_timeline_combined.jpg differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/introduction.md b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/introduction.md new file mode 100644 index 0000000000..d5d69b7ddf --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/introduction.md @@ -0,0 +1,19 @@ +--- +title: Overview +weight: 2 + +### FIXED, DO NOT MODIFY +layout: Overview +--- + +## Introduction +Arm’s latest Client CPU processors such as Arm C1 include Scalable Matrix Extension 2 (SME2). SME2 accelerates the matrix-heavy AI operations behind large language models (LLMs), media processing, speech recognition, computer vision, real-time apps and multimodal apps. + +llama.cpp provides extensive support for many LLMs, including Phi, Llama, DeepSeek, Gemma and Qwen. Llama.cpp is designed for efficient CPU-based inference. It enables on-device LLM execution, reducing latency and enhancing privacy. + +By default llama.cpp integrates with Arm KleidiAI, a suite of optimized microkernels for Arm CPUs. KleidiAI includes SME2 optimized microkernels to get more performance benefits. + +In this learning path, llama.cpp and Llama-3.2-3B-Instruct-Q4_0.gguf model with 3 Billion parameters is used for the tutorial. + + + diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/kleidiai_integration.md b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/kleidiai_integration.md new file mode 100644 index 0000000000..e02d307be0 --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/kleidiai_integration.md @@ -0,0 +1,106 @@ +--- +title: Integration of SME2 optimized KleidiAI microkernels in llama.cpp +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Integration of SME2 optimized KleidiAI microkernels in llama.cpp +KleidiAI library provides optimized matrix multiplication (matmul) kernels tailored for hardware features such as SME, I8MM, and Dot Product(DotProd) acceleration. The feature can be enabled with the build option GGML_CPU_KLEIDIAI. + +![Figure showing components of llama.cpp alt-text#center](images/llama_components.jpg "Components of llama.cpp") + +The KleidiAI is integrated as a trait of ggml-cpu in llama.cpp CPU backend. +The integration source code locates in following directory of llama.cpp. +```text + ./ggml/src/ggml-cpu/kleidiai +``` +KleidiAI matmul microkernels can be used for some types of GGML_OP_MUL_MAT operators. The table below lists some matmul operators with specific input and output data type that can be accelerated by KleidiAI microkernels. + +| LHS data type | RHS data type | Output data type | +|---------|----------------|----------------| +| GGML_TYPE_F32 | GGML_TYPE_Q4_0 | GGML_TYPE_F32 | +| GGML_TYPE_F32 | GGML_TYPE_Q8_0 | GGML_TYPE_F32 | +| GGML_TYPE_F32 | GGML_TYPE_F16 | GGML_TYPE_F32 | + +Note: +LHS is short for Left Hand Source(or Left Hand Input Matrix). +RHS is short for Right Hand Source(or Right Hand Input Matrix). + +More operators and data types are being supported by KleidiAI microkernels. + +The figure below shows how KleidiAI microkernels are used for matmul with GGML_TYPE_Q4_0 or GGML_TYPE_Q8_0 RHS(weight). + +![Figure showing how kleidiai microkernel is used for quantization, packing and matrix multiply llama.cpp alt-text#center](images/kai_matmul_kernel.jpg "Quantization, packing and matrix multiply microkernels") + +The packing of GGML_TYPE_Q4_0 or GGML_TYPE_Q8_0 weight (RHS) only needs to be performed one-time when llama.cpp loads the model and weight tensor data, since the weight never changes during the inference. For performance, it repacks the original GGUF weights into a layout optimized for cache-friendly access and DotProd, I8MM, and SME2 operations with the KleidiAI microkernels. +Generally, if multiple choice of KleidiAI matmul microkernels (implemented with DotProd, I8MM or SME2) can be used for acceleration, the KleidAI trait selects one of implementation in following order, + + ```text + SME2, I8MM, DotProd +``` +Once the matmul microkernel is decided, its corresponding RHS packing and LHS quantizing & packing micro-kernel will be used. + +In case of using the Llama-3.2-3B-Instruct-Q4_0.gguf model and SME2 microkernels, RHS packing is done by the *kai_run_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon* microkernel when loading the model. It is shown in following function call stack, + + ```text +llama_model_load + llama_model::load_tensors + llama_model_loader::load_all_data + ggml_backend_tensor_set + ggml_backend_cpu_kleidiai_buffer_set_tensor + ggml::cpu::kleidiai::tensor_traits::repack + kai_run_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon +``` +The F32 activation input matrix (LHS) is dynamically quantized and packed by the *kai_run_lhs_quant_pack_qsi8d32p_f32_neon* microkernel every time, since the activation input keeps changing during the model run. It is done by following function call stack, + + ```text +llama_context::decode + llama_context::process_ubatch + llama_context::graph_compute + ggml_backend_sched_compute_splits + ggml_backend_cpu_graph_compute + ggml_graph_compute //tick off the compute thread + ggml_graph_compute_thread //the compute thread + ggml_compute_forward + ggml_cpu_extra_compute_forward + ggml::cpu::kleidiai::tensor_traits::compute_forward + ggml::cpu::kleidiai::tensor_traits::compute_forward_q4_0 + kai_run_lhs_quant_pack_qsi8d32p_f32_neon + ``` +Once the LHS and RHS is ready, KleidiAI matmul microkernel can be executed. + +In this example, we use Llama-3.2-3B-Instruct-Q4_0.gguf model and 512-bit SME2 streaming vector length. At the Prefill stage, the KleidiAI GEMM microkernel optimized with SME2, *kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa*, is selected by the KleidiAI trait, it produces a dequantized F32 output matrix. It is done right after LHS quantizing & packing by function call stack shown below. +```text +llama_context::decode + llama_context::process_ubatch + llama_context::graph_compute + ggml_backend_sched_compute_splits + ggml_backend_cpu_graph_compute + ggml_graph_compute //tick off the compute thread + ggml_graph_compute_thread //the compute thread + ggml_compute_forward + ggml_cpu_extra_compute_forward + ggml::cpu::kleidiai::tensor_traits::compute_forward + ggml::cpu::kleidiai::tensor_traits::compute_forward_q4_0 + kai_run_lhs_quant_pack_qsi8d32p_f32_neon + kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa +``` +At the LLM decode stage, KleidiAI GEMV micro-kernel optimized with SME2, *kai_run_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot*, is selected by the KleidiAI trait in llama.cpp, it produces a dequantized F32 output vector. It is done right after LHS quantizing & packing by function call stack shown below, + +```text +llama_context::decode + llama_context::process_ubatch + llama_context::graph_compute + ggml_backend_sched_compute_splits + ggml_backend_cpu_graph_compute + ggml_graph_compute //tick off the compute thread + ggml_graph_compute_thread //the compute thread + ggml_compute_forward + ggml_cpu_extra_compute_forward + ggml::cpu::kleidiai::tensor_traits::compute_forward + ggml::cpu::kleidiai::tensor_traits::compute_forward_q4_0 + kai_run_lhs_quant_pack_qsi8d32p_f32_neon + kai_run_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot +``` diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/run_llm.md b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/run_llm.md new file mode 100755 index 0000000000..dd494f2e25 --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/performance_llama_cpp_sme2/run_llm.md @@ -0,0 +1,86 @@ +--- +title: Run the Llama-3.2-3B-Instruct-Q4_0.gguf model with llama-cli +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Run the Llama-3.2-3B-Instruct-Q4_0.gguf model with llama-cli +Put the built llama-cli executable and Llama-3.2-3B-Instruct-Q4_0.gguf model file to your aarch64 Linux/Android target that supports SME2. +The model can be downloaded [here](https://huggingface.co/unsloth/Llama-3.2-3B-Instruct-GGUF). + +The figure below shows the architecture of Llama-3.2-3B model, +![Figure showing Llama-3.2-3B architecture alt-text#center](images/llama-3.2-3b_architecture.jpg "Architecture of Llama-3.2-3B") + +For performance evaluation, we run the model by binding it to a single Arm C1-Pro core with CPU affinity. +To run the model with SME2 microkernels enabled, set the environment variable first. + +```bash +env GGML_KLEIDIAI_SME="1" taskset 2 ./llama-cli -m ./ Llama-3.2-3B-Instruct-Q4_0.gguf -st -C 0x2 -Cb 0x2 -t 1 -p "input your prompt" +``` +Where +- *env GGML_KLEIDIAI_SME="1"* sets the environment variable +- *taskset 2* sets the task affinity and binds the execution of llama-cli to CPU core 2 (the Arm C1-Pro core in our case) +- *-C 0x2 -Cb 0x2* sets the CPU affinity of the execution of operators +- *-t 1* sets the number of threads to 1 + +For performance comparison, we also run the model with SME2 microkernels disabled by the setting the environment variable, + +```bash +GGML_KLEIDIAI_SME="0" +``` +so that I8MM and Dotprod microkernels are used instead. + +```bash +env GGML_KLEIDIAI_SME="0" taskset 2 ./llama-cli -m ./ Llama-3.2-3B-Instruct-Q4_0.gguf -st -C 0x2 -Cb 0x2 -t 1 -p "input your prompt" +``` +We can profile the model execution with the approach introduced in [Profile llama.cpp performance with Arm Streamline and KleidiAI LLM kernels](https://learn.arm.com/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/). + + +The Streamline Timeline view and Annotate Markers in the figure below show that the token generation speeds up significantly at both Prefill and Decode stage. The PMU event counters show that many SME2 instructions, especially SME2 Integer Outer Product Accumulate instructions at the Prefill stage and SME2 Integer Outer Product instructions at the Decode stage, are used for acceleration. + +![Figure showing Streamline Timeline view alt-text#center](images/streamline_timeline_combined.jpg "Combined Streamline Timeline view with and without SME2") + +The Streamline Call Paths view below indicates similar speedup, it also shows that DotProd and I8MM KleidiAI microkernels are used instead when SME2 is not enabled. + +![Figure showing Streamline Call Paths view alt-text#center](images/streamline_call_paths_combined.jpg "Combined Streamline Call Paths view with and without SME2") + +To investigate which operators in the model graph are delegated to KleidiAI microkernels, we can add some codes as below to *./ggml/src/ggml-cpu/kleidiai/kleidiai.cpp* to print out the name of operators that make use of kleidiAI microkernels. This is only for debugging purpose. + +```c++ + bool compute_forward(struct ggml_compute_params * params, struct + ggml_tensor * dst) override { + if (dst->op == GGML_OP_MUL_MAT) { + if (dst->src[0]->type == GGML_TYPE_Q4_0) { + //add log for kai microkernel + std::cout << "kai matmul Q4_0" << dst->name << std::endl; + return compute_forward_q4_0(params, dst); + } else if (dst->src[0]->type == GGML_TYPE_Q8_0) { + //add log for kai microkernel + std::cout << "kai matmul Q8_0" << dst->name << std::endl; + return compute_forward_q8_0(params, dst); + } else if (dst->src[0]->type == GGML_TYPE_F16) { + //add log for kai microkernel + std::cout << "kai matmul fp16" << dst->name << std::endl; + return compute_forward_fp16(params, dst); + } +``` +When running the model, some log will be printed as below, +```text +kai matmul Q4_0 Qcur-27 +kai matmul Q4_0 Vcur-27 +kai matmul Q4_0 Kcur-27 +kai matmul Q4_0 attn_out-27 +kai matmul Q4_0 ffn_gate-27 +kai matmul Q4_0 ffn_up-27 +kai matmul Q4_0 ffn_out-27 +``` +Take one attention block of Llama-3.2-3B-Instruct-Q4_0 model for example, the operators that are accelerated by KleidiAI SME2-optimized microkernels are highlighted manually by the blue boxs in the figure of graph of the attention block. How to get the graph is beyond the scope of this learning path. Please refer to external resources. + +![Figure highlighting operators acclerated by KleidiAI SME2-optimized microkernels alt-text#center](images/one_attention_block.jpg "Operators acclerated by KleidiAI SME2-optimized microkernels in one attention block") + +KleidiAI support in llama.cpp is still evolving, more operators will be accelerated by KleidiAI microkernels, unleashing greater potential of SME2. + +## Summary +With out-of-box KleidiAI and SME2 support in llama.cpp, we can get significant performance uplift at both Prefill and Decode stage, which enhances the experience of running LLM locally on device. diff --git a/content/learning-paths/mobile-graphics-and-gaming/vulkan-ml-sample/_index.md b/content/learning-paths/mobile-graphics-and-gaming/vulkan-ml-sample/_index.md index 6cc575f0bf..c4780664c3 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/vulkan-ml-sample/_index.md +++ b/content/learning-paths/mobile-graphics-and-gaming/vulkan-ml-sample/_index.md @@ -29,6 +29,7 @@ armips: tools_software_languages: - Vulkan - RenderDoc + - NX operatingsystems: - Windows diff --git a/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/_index.md b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/_index.md new file mode 100644 index 0000000000..b7c338d600 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/_index.md @@ -0,0 +1,73 @@ +--- +title: Secure Multi-Architecture Containers with Trivy on Azure Cobalt 100 (Arm64) + +draft: true +cascade: + draft: true + +minutes_to_complete: 45 + +who_is_this_for: This learning path is designed for developers and DevOps engineers who want to integrate security scanning into CI/CD pipelines for multi-architecture container images. + +learning_objectives: + - Provision an Azure Arm64 virtual machine using Azure console, with Ubuntu Pro 24.04 LTS as the base image + - Build multi-architecture (amd64/arm64) container images for Azure Cobalt 100 + - Install and configure Trivy on Arm64 Ubuntu systems + - Scan container images for vulnerabilities locally and in CI + - Configure self-hosted GitHub Actions Arm runners + - Enforce security gates in CI/CD pipelines based on vulnerability severity + - Generate and analyze JSON reports for compliance and audit purposes + +prerequisites: + - A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6) + - Docker installed and basic knowledge of containerization + - Familiarity with CI/CD concepts + - Basic knowledge of Linux command-line operations + +author: Pareena Verma + +### Tags +skilllevels: Introductory +subjects: Containers and Virtualization +cloud_service_providers: Microsoft Azure + +armips: + - Neoverse + +tools_software_languages: + - Trivy + - Docker + - GitHub Actions + - YAML + +operatingsystems: + - Linux + +further_reading: + - resource: + title: Trivy Official Website + link: https://trivy.dev + type: website + - resource: + title: Trivy GitHub Repository + link: https://github.com/aquasecurity/trivy + type: website + - resource: + title: Docker Official Documentation + link: https://docs.docker.com/ + type: documentation + - resource: + title: GitHub Actions Documentation + link: https://docs.github.com/en/actions + type: documentation + - resource: + title: Microsoft Azure Cobalt 100 Overview + link: https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353 + type: documentation + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 +layout: "learningpathall" +learning_path_main_page: "yes" +--- diff --git a/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/background.md b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/background.md new file mode 100644 index 0000000000..c3f2051190 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/background.md @@ -0,0 +1,21 @@ +--- +title: "Overview" +weight: 2 + +### FIXED, DO NOT MODIFY +layout: "learningpathall" +--- + +## Cobalt 100 Arm-based processor + +Azure Cobalt 100 is Microsoft’s first-generation Arm-based processor, designed for cloud-native, scale-out Linux workloads. Based on Arm’s Neoverse-N2 architecture, it is a 64-bit CPU that delivers improved performance and energy efficiency. Running at 3.4 GHz, it provides a dedicated physical core for each vCPU, ensuring consistent and predictable performance. Typical workloads include web and application servers, data analytics, open-source databases, and caching systems. + +To learn more, see the Microsoft blog [Announcing the preview of new Azure virtual machines based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353). + +## Trivy + +Trivy is an open-source vulnerability scanner designed to detect security issues in container images, filesystems, and infrastructure configurations. It is widely used in modern DevSecOps workflows to identify known vulnerabilities in operating system packages and application dependencies. + +You can use Trivy to perform fast and reliable security scans on container images built for multiple architectures, including Arm64. It helps teams shift security left by detecting vulnerabilities early in the development and CI/CD pipeline. + +Learn more at the [Trivy official website](https://trivy.dev/) and in the [Trivy documentation](https://trivy.dev/docs/). diff --git a/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/arm64-runner.png b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/arm64-runner.png new file mode 100644 index 0000000000..dcc54bceeb Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/arm64-runner.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/final-vm.png b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/final-vm.png new file mode 100644 index 0000000000..5207abfb41 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/final-vm.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/instance.png b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/instance.png new file mode 100644 index 0000000000..285cd764a5 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/instance.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/instance1.png b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/instance1.png new file mode 100644 index 0000000000..b9d22c352d Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/instance1.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/instance4.png b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/instance4.png new file mode 100644 index 0000000000..2a0ff1e3b0 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/instance4.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/secrets.png b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/secrets.png new file mode 100644 index 0000000000..c3c98a6d97 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/secrets.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/security-scan.png b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/security-scan.png new file mode 100644 index 0000000000..d2c2b8330a Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/security-scan.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/trivy-multiarch.png b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/trivy-multiarch.png new file mode 100644 index 0000000000..aafbd08274 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/trivy-multiarch.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/ubuntu-pro.png b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/ubuntu-pro.png new file mode 100644 index 0000000000..d54bd75ca6 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/images/ubuntu-pro.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/instance.md b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/instance.md new file mode 100644 index 0000000000..e064907fc6 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/instance.md @@ -0,0 +1,66 @@ +--- +title: Create an Azure Cobalt 100 Arm64 virtual machine +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Prerequisites and setup + +There are several common ways to create an Arm-based Cobalt 100 virtual machine, and you can choose the method that best fits your workflow or requirements: + +- The Azure Portal +- The Azure CLI +- An infrastructure as code (IaC) tool + +In this section, you will launch the Azure Portal to create a virtual machine with the Arm-based Azure Cobalt 100 processor. + +This Learning Path focuses on general-purpose virtual machines in the Dpsv6 series. For more information, see the [Microsoft Azure guide for the Dpsv6 size series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series). + +While the steps to create this instance are included here for convenience, you can also refer to the [Deploy a Cobalt 100 virtual machine on Azure Learning Path](/learning-paths/servers-and-cloud-computing/cobalt/). + +## Create an Arm-based Azure virtual machine + +Creating a virtual machine based on Azure Cobalt 100 is no different to creating any other virtual machine in Azure. Follow the steps below to create an Azure virtual machine: + +- Launch the Azure portal and navigate to **Virtual Machines**. +- Select **Create**, and select **Virtual Machine** from the drop-down list. +- Inside the **Basic** tab, fill in the instance details such as **Virtual machine name** and **Region**. +- Select the image for your virtual machine (for example, Ubuntu Pro 24.04 LTS) and select **Arm64** as the VM architecture. +- In the **Size** field, select **See all sizes** and select the D-Series v6 family of virtual machines. +- Select **D4ps_v6** from the list as shown in the diagram below: + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance.png "Select the D-Series v6 family of virtual machines") + +- For **Authentication type**, select **SSH public key**. {{% notice Note %}} +Azure generates an SSH key pair for you and lets you save it for future use. This method is fast, secure, and easy for connecting to your virtual machine. +{{% /notice %}} +- Fill in the **Administrator username** for your VM. +- Select **Generate new key pair**, and select **RSA SSH Format** as the SSH Key Type. {{% notice Note %}} +RSA offers better security with keys longer than 3072 bits. +{{% /notice %}} +- Give your SSH key a key pair name. +- In the **Inbound port rules**, select **HTTP (80)** and **SSH (22)** as the inbound ports, as shown below: + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance1.png "Allow inbound port rules") + +- Now select the **Review + Create** tab and review the configuration for your virtual machine. It should look like the following: + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/ubuntu-pro.png "Review and create an Azure Cobalt 100 Arm64 VM") + +- When you are happy with your selection, select the **Create** button and then **Download Private key and Create Resource** button. + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/instance4.png "Download private key and create resource") + +Your virtual machine should be ready and running in a few minutes. You can SSH into the virtual machine using the private key, along with the public IP details. + +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](images/final-vm.png "VM deployment confirmation in Azure portal") + +{{% notice Note %}} + +To learn more about Arm-based virtual machine in Azure, see “Getting Started with Microsoft Azure” in [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/azure). + +{{% /notice %}} + +Your Azure Cobalt 100 Arm64 virtual machine is now ready. Continue to the next step to install and configure Trivy. diff --git a/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/trivy-ci-integration.md b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/trivy-ci-integration.md new file mode 100644 index 0000000000..5611ea1218 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/trivy-ci-integration.md @@ -0,0 +1,314 @@ +--- +title: Enforce Container Security with Trivy in CI using Arm Runners on Azure Cobalt 100 +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Overview + +In this module, you will integrate Trivy into a CI pipeline using a self-hosted Arm64 GitHub Actions runner on Azure Cobalt 100. + +You will: + +- Set up an Arm-based GitHub runner +- Securely configure secrets +- Run Trivy scans automatically in CI +- Enforce security gates on vulnerable images +- Upload vulnerability reports for auditing + +This demonstrates a real-world DevSecOps workflow running natively on Arm infrastructure. + +You will move from local scanning to fully automated security enforcement in CI. + +## Prerequisites + +Ensure that you have: + +- Completed the local Trivy scanning module +- A GitHub repository +- An Azure Cobalt 100 Arm64 Ubuntu VM running +- A multi-architecture container image pushed to Docker Hub + +These components are required for building a secure CI pipeline. + + +## Create a New GitHub Repository + +Open in your browser: https://github.com/new + +**Fill in:** + +- Repository name: trivy-arm-ci-demo (or any name you prefer) +- Visibility: Public or Private +- Do NOT initialize with README (keep it empty) + +**Click:** + +- Create repository + +You now have an empty GitHub repository ready to receive your project files. + +## Initialize Git Repository Locally + +On your Azure Arm64 VM: + +```bash +cd ~/trivy-multiarch-demo +git init +``` + +Turns your local project folder into a Git repository. + +**Add and Commit Project Files:** + +```bash +git add . +git commit -m "Multi-arch nginx image for Trivy scan" +``` +Stage and save your Dockerfile and demo files into Git history. + +**Link Local Repo to GitHub:** + +```bash +git branch -M main +git remote add origin https://github.com//trivy-arm-ci-demo.git +``` + +You can connect your local repository to the remote GitHub repository. + +**Push Code to GitHub:** + +```bash +git push -u origin main +``` + +Uploads your project files to GitHub and sets the main branch for future pushes. + +## Setup GitHub Actions Arm Runner + +Navigate in GitHub: + +```text +Settings → Actions → Runners → Add Runner +``` + +Choose: + +Linux → ARM64 + +GitHub will display setup commands. + +This prepares GitHub to register a self-hosted Arm64 runner. + +![GitHub Actions Arm64 self-hosted runner connected alt-txt#center](images/arm64-runner.png "Arm64 GitHub Runner Status") + +## Create Runner Directory on VM + +Create a workspace for the GitHub runner software. +```bash +mkdir actions-runner +cd actions-runner +``` + +## Download Runner Package + +Use the download command provided by GitHub: + +```bash +curl -o actions-runner-linux-arm64.tar.gz -L https://github.com/actions/runner/releases/download/v2.331.0/actions-runner-linux-arm64-2.331.0.tar.gz +tar xzf actions-runner-linux-arm64.tar.gz +``` +Downloads and extracts the Arm64-compatible runner. + +**Configure Runner:** + +```bash +./config.sh --url https://github.com// --token +``` + +Registers the runner with your GitHub repository. + +**Start Runner:** + +```bash +./run.sh +``` + +```output +√ Connected to GitHub + +Current runner version: '2.331.0' +2026-01-28 05:30:41Z: Listening for Jobs +``` + +## Configure GitHub Secrets + +**In your GitHub repository:** + +```text +Settings → Secrets → Actions → New repository secret +``` +Add the following secrets: + +| Name | Value | +| ----------------- | ----------------------------------- | +| `DOCKER_USERNAME` | Your Docker Hub username | +| `DOCKER_PASSWORD` | Docker Hub password or access token | +| `IMAGE_NAME` | trivy-multiarch-nginx | +| `IMAGE_TAG` | latest | + +![GitHub Actions secrets configuration alt-txt#center](images/secrets.png "GitHub Secrets Configuration") + +## Create CI Workflow +Create Workflow Directory + +```bash +mkdir -p .github/workflows +``` +**Create file:** + +```bash +nano .github/workflows/trivy-scan.yml +``` + +Add: +Defines a CI pipeline that scans images on every push. + +```yaml +name: Trivy Scan on Arm Runner + +on: + push: + branches: [ "main" ] + +jobs: + security-scan: + runs-on: self-hosted + + env: + DOCKER_USER: ${{ secrets.DOCKER_USERNAME }} + DOCKER_PASS: ${{ secrets.DOCKER_PASSWORD }} + IMAGE_NAME: ${{ secrets.IMAGE_NAME }} + IMAGE_TAG: ${{ secrets.IMAGE_TAG }} + + steps: + - name: Checkout code + uses: actions/checkout@v4 + + - name: Login to Docker Hub + run: | + echo "$DOCKER_PASS" | docker login -u "$DOCKER_USER" --password-stdin + + - name: Pull multi-arch image + run: | + docker pull $DOCKER_USER/$IMAGE_NAME:$IMAGE_TAG + + - name: Run Trivy Scan + run: | + trivy image \ + --severity HIGH,CRITICAL \ + --exit-code 1 \ + $DOCKER_USER/$IMAGE_NAME:$IMAGE_TAG + + - name: Generate Trivy JSON Report + run: | + trivy image \ + --format json \ + -o trivy-report.json \ + $DOCKER_USER/$IMAGE_NAME:$IMAGE_TAG + + - name: Upload Report + uses: actions/upload-artifact@v4 + with: + name: trivy-report + path: trivy-report.json +``` + +## Trigger the CI Pipeline + +```bash +git add . +git commit -m "Add Trivy scan on Arm runner" +git push origin main +``` +Push triggers GitHub Actions automatically. + +## Verify CI Execution (Trivy Security Gate on Arm Runner) + +After pushing your code to GitHub, navigate to: + +```text +GitHub → Actions → Trivy Scan on Arm Runner +``` + +Select the latest workflow run. + +Expected Results: + +![GitHub Actions Trivy security scan workflow execution log alt-txt#center](images/security-scan.png "Trivy CI Security Scan Execution") + +You should observe the following: + +- The multi-architecture container image is pulled successfully on the Arm-based Azure Cobalt 100 runner +- Trivy performs a full vulnerability scan of the container image layers and installed packages +- The pipeline fails when HIGH or CRITICAL vulnerabilities are detected + +This failure is intentional and indicates that the security scan is functioning correctly. + +## How Trivy Works in This CI Pipeline + +Trivy is a container vulnerability scanner that: + +- Analyzes OS packages inside the image (like zlib, openssl, libc, etc.) +- Scans application dependencies (npm, pip, maven, etc. if present) +- Compares components against known CVE databases +- Assigns severity levels (LOW, MEDIUM, HIGH, CRITICAL) + +In this workflow: + +```bash +--severity HIGH,CRITICAL +--exit-code 1 +``` + +This means: + +- If Trivy finds any HIGH or CRITICAL vulnerability, it returns a non-zero exit code +- GitHub Actions marks the job as FAILED +- The image is blocked from moving forward in the pipeline + +This is known as a security gate. + +## Understanding the CI Scan Result +Is This Behavior Good or Bad? + +- This behavior is GOOD and expected in real-world CI/CD pipelines + +Modern DevSecOps practices integrate security tools like Trivy directly into Continuous Integration to automatically detect vulnerabilities. + +The pipeline is designed to: + +- Stop insecure container images early in the development lifecycle +-Prevent vulnerable software from reaching staging or production environments +- Enforce security and compliance policies automatically + +**What a Failing Pipeline Means** + +When the Trivy scan fails the job: + +- Trivy successfully detected HIGH or CRITICAL vulnerabilities in the container image +- The security gate blocked the insecure image +- The CI pipeline protected downstream environments + +**Why This Matters in Enterprise CI/CD** + +In enterprise environments, automated security enforcement is mandatory. + +- Failing fast on vulnerabilities ensures: +- Faster remediation +- Reduced security risk +- Strong compliance posture + +This behavior confirms that your Arm-based CI pipeline is working correctly. diff --git a/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/trivy-setup.md b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/trivy-setup.md new file mode 100644 index 0000000000..618df8adf2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/trivy-on-gcpp/trivy-setup.md @@ -0,0 +1,261 @@ +--- +title: Build and Scan Multi-Architecture Container Images with Trivy on Azure Cobalt 100 (Arm64) +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Overview + +In this module, you will build a multi-architecture container image and perform vulnerability scanning using Trivy on an Azure Cobalt 100 Arm64 Ubuntu VM. + +You will: + +- Install Docker on Arm64 +- Build and push multi-arch container images +- Install Trivy on Arm64 +- Scan container images locally +- Generate vulnerability reports + +## Prerequisites + +Ensure: + +- Azure Cobalt 100 Arm64 Ubuntu VM +- Docker Hub account +- Internet connectivity + +Verify architecture: + +```bash +uname -m +``` + +## Install Docker on Arm64 VM +Update your system and install required tools for Docker installation. + +```bash +sudo apt update +sudo apt install -y ca-certificates curl gnupg lsb-release +``` + +**Add Docker GPG key:** + +Adds Docker’s official signing key so packages are trusted. + +```bash +sudo mkdir -p /etc/apt/keyrings +curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg +``` + +**Add repository:** + +Register Docker’s package source for Arm64 Ubuntu. + +```bash +echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \ +$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null +``` + +**Install Docker:** + +Install Docker Engine and Buildx for multi-architecture builds. + +```bash +sudo apt update +sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin +``` + +**Enable Docker without sudo:** + +```bash +sudo usermod -aG docker $USER +newgrp docker +``` +Adds the user to the docker group and applies the new group permissions immediately, enabling non-root access to Docker. + +**Verify:** + +Confirm Docker is running natively on Arm64. + +```bash +docker info | grep Architecture +``` + +## Configure Docker Buildx for Multi-Architecture Builds + +**Create builder:** + +Create a special Docker builder capable of building images for multiple CPU architectures. + +```bash +docker buildx create --name multiarch-builder --use +``` + +**Initialize:** + +Prepare the builder to support multi-platform builds. + +```bash +docker buildx inspect --bootstrap +``` + +## Create Demo Application + +Creates a workspace for the demo container application. + +```bahs +mkdir trivy-multiarch-demo +cd trivy-multiarch-demo +``` + +## Create Dockerfile: + +```bash +nano Dockerfile +``` + +**Add:** + +Use NGINX as the base image and copy a custom webpage. + +```bash +FROM nginx:latest +COPY index.html /usr/share/nginx/html/index.html +``` + +## Create HTML file: + +```bash +nano index.html +``` + +**Add:** + +Create a simple webpage to confirm the container runs correctly. + +```bash +

Multi-Architecture NGINX on Azure Cobalt Arm64

+``` + +## Login to Docker Hub + +**Run the login command:** + +```bash +docker login +``` + +- Docker will display a one-time device code and a login URL: + +```text +https://login.docker.com/activate +``` + +**Steps to complete login:** + +- Open the displayed URL in your web browser +- Sign in to your Docker Hub account (or create one if you don’t have it) +- Enter the one-time confirmation code shown in the terminal +- Click Confirm / Activate + +Once authentication completes, the terminal will show: + +```output +WARNING! Your credentials are stored unencrypted in '/home/azureuser/.docker/config.json'. +Configure a credential helper to remove this warning. See +https://docs.docker.com/go/credential-store/ + +Login Succeeded +``` +This confirms your system is now authenticated with Docker Hub and ready to push or pull container images. + +## Build and Push Multi-Architecture Image + +```bash +docker buildx build \ + --platform linux/amd64,linux/arm64 \ + -t /trivy-multiarch-nginx:latest \ + --push . +``` + +- Builds the container for both amd64 and arm64 +- Pushes the multi-arch image to Docker Hub + +This allows the same image to run on different CPU architectures. + +![Trivy scanning multi-architecture container image alt-txt#center](images/trivy-multiarch.png "Trivy Multi-Arch Image Scan") + +## Install Trivy on Arm64 + +**Download:** + +Download the Arm64-compatible Trivy scanner. + +```bash +wget https://github.com/aquasecurity/trivy/releases/download/v0.68.1/trivy_0.68.1_Linux-ARM64.deb +``` +{{% notice Note %}} +The [Arm Ecosystem Dashboard](https://developer.arm.com/ecosystem-dashboard/) recommends Trivy 0.29.0 or later for Arm platforms. +{{% /notice %}} + +**Install:** + +Install Trivy on your system. + +```bash +sudo dpkg -i trivy_0.68.1_Linux-ARM64.deb +``` + +**Verify:** + +```bash +trivy version +``` + +```output +Version: 0.68.1 +``` + +## Scan Image Locally + +**Run scan:** + +```bash +trivy image /trivy-multiarch-nginx:latest +``` +Trivy analyzes the container image and lists security vulnerabilities. + +## Generate JSON report + +```bash +trivy image \ + --format json \ + -o trivy-report.json \ + /trivy-multiarch-nginx:latest +``` + +Create a machine-readable vulnerability report for audits and CI pipelines. + +```output +2026-01-23T06:42:30Z INFO [vuln] Vulnerability scanning is enabled +2026-01-23T06:42:30Z INFO [secret] Secret scanning is enabled +2026-01-23T06:42:30Z INFO [secret] If your scanning is slow, please try '--scanners vuln' to disable secret scanning +2026-01-23T06:42:30Z INFO [secret] Please see https://trivy.dev/docs/v0.68/guide/scanner/secret#recommendation for faster secret detection +2026-01-23T06:42:32Z INFO Detected OS family="debian" version="12.5" +2026-01-23T06:42:32Z INFO [debian] Detecting vulnerabilities... os_version="12" pkg_num=149 +2026-01-23T06:42:33Z INFO Number of language-specific files num=0 +2026-01-23T06:42:33Z WARN Using severities from other vendors for some vulnerabilities. Read https://trivy.dev/docs/v0.68/guide/scanner/vulnerability#severity-selection for details. +``` + +## Outcome + +You have: + +- Installed Docker on Arm64 +- Built multi-architecture container images +- Pushed images to Docker Hub +- Installed Trivy on Azure Cobalt 100 +- Scanned images for vulnerabilities +- Generated security reports diff --git a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/_index.md b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/_index.md deleted file mode 100644 index ae8a05eee8..0000000000 --- a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/_index.md +++ /dev/null @@ -1,52 +0,0 @@ ---- -title: Tune LLM CPU inference performance with multithreading - -minutes_to_complete: 30 - -who_is_this_for: This is an introductory topic for ML engineers optimizing LLM inference performance on Arm CPUs. - -learning_objectives: - - Understand how PyTorch uses multiple threads for CPU inference - - Measure the performance impact of thread count on LLM inference - - Tune thread count to optimize inference for specific models and systems - -prerequisites: - - An [Arm-based cloud instance](/learning-paths/servers-and-cloud-computing/csp/) or an Arm server with at least 16 cores - - Basic understanding of Python and PyTorch - - Ability to install Docker on your Arm system - -author: Kieran Hejmadi - -### Tags -skilllevels: Introductory -subjects: ML -armips: - - Neoverse -tools_software_languages: - - Python - - PyTorch -operatingsystems: - - Linux - - -further_reading: - - resource: - title: PyTorch CPU Threading Documentation - link: https://docs.pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html - type: documentation - - resource: - title: Arm Tool Solutions Repository - link: https://github.com/ARM-software/Tool-Solutions/tree/main/ML-Frameworks/pytorch-aarch64 - type: website - - resource: - title: Docker install guide - link: /install-guides/docker/ - type: install-guide - - -### FIXED, DO NOT MODIFY -# ================================================================================ -weight: 1 # _index.md always has weight of 1 to order correctly -layout: "learningpathall" # All files under learning paths have this same wrapper -learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. ---- diff --git a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/background.md b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/background.md deleted file mode 100644 index b6d401598d..0000000000 --- a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/background.md +++ /dev/null @@ -1,153 +0,0 @@ ---- -title: Understand PyTorch threading for CPU inference -weight: 3 - -### FIXED, DO NOT MODIFY -layout: learningpathall ---- - -## Understanding threading trade-offs in CPU inference - -A well-known challenge in parallel programming is choosing the right number of threads for a given amount of work. When multiple threads are created to perform a task, the actual computation must be large enough to justify the overhead of coordinating those threads. - -If a computation is split across many threads, the costs of creating the threads and synchronizing their results through shared memory can easily outweigh any performance gains from parallel execution. The same principle applies to generative AI workloads running on CPU. - -When work is distributed across multiple threads, communication and synchronization overhead increases the total amount of work the system must perform. This creates a trade-off between latency (the time to process a single request) and throughput (the number of requests processed per unit time). - -PyTorch attempts to automatically choose an appropriate number of threads. However, as you will learn, in some cases you can manually tune this configuration to improve performance. - -## Multithreading with PyTorch on CPU - -When running inference, PyTorch uses an Application Thread Pool. PyTorch supports two types of parallelism: inter-op parallelism spawns threads to run separate operations in a graph in parallel (for example, one thread for a matrix multiplication and another thread for a softmax), while intra-op parallelism spawns multiple threads to work on the same operation. - -The diagram below shows PyTorch's threading model from the [PyTorch documentation](https://docs.pytorch.org/docs/stable/index.html). - -![Diagram showing PyTorch's threading model with application thread pool, inter-op thread pool, and intra-op thread pool for CPU inference alt-txt#center](./pytorch-threading.jpg "PyTorch threading model") - -The `torch.set_num_threads()` [API](https://docs.pytorch.org/docs/stable/generated/torch.set_num_threads.html) sets the maximum number of threads to spawn in the Application Thread Pool. - -As of PyTorch 2.8.0, the default number of threads equals the number of CPU cores (see the [PyTorch CPU Threading Documentation](https://docs.pytorch.org/docs/2.8/notes/cpu_threading_torchscript_inference.html) for more detail). PyTorch determines the ideal number of threads based on the workload size, as shown in this code snippet from [ParallelOpenMP.h](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/ParallelOpenMP.h): - -```cpp -int64_t num_threads = omp_get_num_threads(); - if (grain_size > 0) { - num_threads = std::min(num_threads, divup((end - begin), grain_size)); - } - -... -inline int64_t divup(int64_t x, int64_t y) { - return (x + y - 1) / y; -} -``` - -In PyTorch builds that use OpenMP, the maximum size of the application's thread pool can be configured at runtime using the `OMP_NUM_THREADS` environment variable. The actual number of threads used scales up to this limit depending on the workload and the `grain_size`. - -The example below demonstrates that the default settings on many-core systems might not provide optimal performance for all workloads. - -## Basic PyTorch example - -Create a new file named `pytorch_omp_example.py` with the following Python script. The script performs a matrix multiplication in eager mode on two 256×256 random matrices, showing the default performance of PyTorch's parallelism and printing the parallel configuration: - -```python -import os -import time -import torch - - -def main(): - print(f"PyTorch version: {torch.__version__}") - - # Read OMP_NUM_THREADS from the environment - omp_threads = os.environ.get("OMP_NUM_THREADS") - print(f"OMP_NUM_THREADS in environment: {omp_threads}") - - # If it's set and looks like a number, use it to set PyTorch's intra-op threads - if omp_threads and omp_threads.isdigit(): - torch.set_num_threads(int(omp_threads)) - - # Show how many threads PyTorch will actually use for intra-op parallelism - print(f"torch.get_num_threads(): {torch.get_num_threads()}\n") - - # A simple operation to illustrate parallelism: - size = 256 - a = torch.randn(size, size) - b = torch.randn(size, size) - - start = time.time() - c = a @ b # matrix multiplication (runs in a parallel region on CPU) - end = time.time() - - print(f"Result shape: {c.shape}") - print(f"Matrix multiply time: {end - start:.5f} seconds") - print(f"\nThreading Information = {torch.__config__.parallel_info()}") - -if __name__ == "__main__": - main() -``` - -Run the Python script: - -```bash -python pytorch_omp_example.py -``` - -The output is similar to: - - -```output -PyTorch version: 2.10.0.dev20251124 -OMP_NUM_THREADS in environment: None -torch.get_num_threads(): 96 - -Result shape: torch.Size([256, 256]) -Matrix multiply time: 0.00224 seconds - -Threading Information = ATen/Parallel: - at::get_num_threads() : 96 - at::get_num_interop_threads() : 96 -OpenMP 201511 (a.k.a. OpenMP 4.5) - omp_get_max_threads() : 96 -Intel(R) MKL-DNN v3.11.0 (Git Hash 0b8a866c009b03f322e6526d7c33cfec84a4a97a) -std::thread::hardware_concurrency() : 96 -Environment variables: - OMP_NUM_THREADS : [not set] -ATen parallel backend: OpenMP -``` - -PyTorch uses all 96 cores, and the execution time is 2.24 ms. - -Now reduce the number of OpenMP threads using the `OMP_NUM_THREADS` environment variable: - -```bash -OMP_NUM_THREADS=16 python pytorch_omp_example.py -``` - -The output shows a different `Matrix multiply time` of of 0.64 ms. - -```output -PyTorch version: 2.10.0.dev20251124 -OMP_NUM_THREADS in environment: 16 -torch.get_num_threads(): 16 - -Result shape: torch.Size([256, 256]) -Matrix multiply time: 0.00064 seconds - -Threading Information = ATen/Parallel: - at::get_num_threads() : 16 - at::get_num_interop_threads() : 96 -OpenMP 201511 (a.k.a. OpenMP 4.5) - omp_get_max_threads() : 16 -Intel(R) MKL-DNN v3.11.0 (Git Hash 0b8a866c009b03f322e6526d7c33cfec84a4a97a) -std::thread::hardware_concurrency() : 96 -Environment variables: - OMP_NUM_THREADS : 16 -ATen parallel backend: OpenMP -``` - -The execution time varies with the number of threads and the processor type in your system. - -## What you've accomplished and what's next - -You've learned how PyTorch manages threads for CPU inference and seen how thread count affects performance in a simple example. The optimal thread count depends on both the workload size and system architecture. - -Next, you'll apply these concepts to a more realistic workload by tuning thread settings for large language model inference. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/build.md b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/build.md deleted file mode 100644 index a05f6ad0ed..0000000000 --- a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/build.md +++ /dev/null @@ -1,107 +0,0 @@ ---- -title: Set up the environment -weight: 2 - -### FIXED, DO NOT MODIFY -layout: learningpathall ---- - -## Before you begin - -Before you can tune PyTorch threading for LLM inference on Arm CPUs, you need to set up your development environment with Docker, PyTorch, and access to the Gemma-3 models from Hugging Face. This section walks you through creating your Hugging Face account, configuring an Arm server, and running the PyTorch container with all necessary dependencies. - -{{% notice Note %}} -This Learning Path uses Arm's downstream canary release of PyTorch, which includes ready-to-use examples and scripts. This release provides access to the latest features but is intended for experimentation rather than production use. -{{% /notice %}} - -## Create a Hugging Face account - -Create a [Hugging Face account](https://huggingface.co/) if you don't already have one. After creating your account, request access to the [1B](https://huggingface.co/google/gemma-3-1b-it) and [270M](https://huggingface.co/google/gemma-3-270m-it) variants of Google's Gemma-3 model. Access is typically granted within 15 minutes. - -## Connect to an Arm system and install Docker - -If this is your first time using Arm-based cloud instances, see the [getting started guide](/learning-paths/servers-and-cloud-computing/csp/). - -The example code uses an AWS Graviton 4 (`m8g.24xlarge`) instance running Ubuntu 24.04 LTS, based on the Neoverse V2 architecture. You can use any Arm server with at least 16 cores. Keep note of your CPU count so you can adjust the example code as needed. - -Install Docker using the [Docker install guide](/install-guides/docker/) or the [official documentation](https://docs.docker.com/engine/install/ubuntu/). Follow the post-installation steps to configure Docker for non-root usage. - -## Run the PyTorch-aarch64 Docker container image - -You have two options for the Docker container. You can use a container image from Docker Hub or you can build the container image from source. Using a ready-made container makes it easier to get started, and building from source provides the latest software. The container image on Docker Hub is updated about once a month. - -Open a terminal or use SSH to connect to your Arm Linux system. - -### Use container image from Docker Hub - -Download the ready-made container image from Docker Hub: - -```bash -docker pull armlimited/pytorch-arm-neoverse:latest -``` - -Create a new container: - -```bash -docker run --rm -it armlimited/pytorch-arm-neoverse:latest -``` - -The shell prompt will appear, and you are ready to start. - -```output -aarch64_pytorch ~> -``` - -### Build from source - -To build from source, clone the repository. - -```bash -git clone https://github.com/ARM-software/Tool-Solutions.git -cd Tool-Solutions/ML-Frameworks/pytorch-aarch64/ -``` - -Build the container: - -```bash -./build.sh -n $(nproc - 1) -``` - -On a 96-core instance such as AWS `m8g.24xlarge`, this build takes approximately 20 minutes. - -After the build completes, create a Docker container. Replace `` with the version of torch and torchao that was built: - -```bash -./dockerize.sh ./results/torch--linux_aarch64.whl ./results/torchao--py3-none-any.whl -``` - -The shell prompt will appear, and you are ready to start. - -```output -aarch64_pytorch ~> -``` - -## Log in to Hugging Face - -Create a new Read token on Hugging Face by navigating to [Create new Access Token](https://huggingface.co/settings/tokens/new?tokenType=read). - -![Screenshot of Hugging Face token creation interface showing a dialog box with fields for token name and type, with the 'Read' option selected and a 'Create token' button visible alt-txt#center](./hf-access-token.jpg "Hugging Face token creation interface") - -Provide a token name, create the token, and copy the generated value. From within the Docker container, run the following command and paste the token when prompted: - -```bash -huggingface-cli login -``` - -Messages indicating the token is valid and login is successful are printed. - -Be aware that the login doesn't persist after the Docker container exits. You'll need to log in again if you restart the container. - -## What you've accomplished and what's next - -You've set up your environment with: -- A Hugging Face account with access to the Gemma-3 models -- An Arm server or cloud instance with Docker installed -- The PyTorch-aarch64 container running and authenticated - -You're now ready to run LLM inference experiments and measure how thread count affects performance. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/hf-access-token.jpg b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/hf-access-token.jpg deleted file mode 100644 index c02ba89881..0000000000 Binary files a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/hf-access-token.jpg and /dev/null differ diff --git a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/prefill_throughput.png b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/prefill_throughput.png deleted file mode 100644 index d67b397259..0000000000 Binary files a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/prefill_throughput.png and /dev/null differ diff --git a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/pytorch-threading.jpg b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/pytorch-threading.jpg deleted file mode 100644 index 6b63f93cf1..0000000000 Binary files a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/pytorch-threading.jpg and /dev/null differ diff --git a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/tune.md b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/tune.md deleted file mode 100644 index 55ece1a16d..0000000000 --- a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/tune.md +++ /dev/null @@ -1,167 +0,0 @@ ---- -title: Tune thread count for LLM inference -weight: 4 - -### FIXED, DO NOT MODIFY -layout: learningpathall ---- - -## Run inference experiments with different thread counts - -Now that you understand how PyTorch threading works and have your environment configured, you're ready to tune thread settings for actual LLM inference workloads. This section shows you how to measure inference performance across different thread counts using Google's Gemma-3 models on Arm CPUs. You'll run experiments with both the 270M and 1B parameter variants to understand how model size affects optimal thread configuration. - -This section runs inference on Google's [Gemma-3](https://huggingface.co/google/gemma-3-1b-it) model and measures how inference performance varies with thread count for both the 270 million parameter and 1 billion parameter models. The `transformers_llm_text_gen.py` script applies groupwise, layout-aware INT4 quantization by default. - -Create a file named `comparison-1b.sh` with the following script: - -```bash -#!/usr/bin/env bash -set -euo pipefail - -# Loop over OMP_NUM_THREADS: powers of 2 plus 96 -for t in 2 4 8 16 32 64 96; do - echo "===============================" - echo "Running with OMP_NUM_THREADS=$t" - echo "===============================" - - TORCHINDUCTOR_CPP_WRAPPER=1 \ - TORCHINDUCTOR_FREEZING=1 \ - OMP_NUM_THREADS="$t" \ - python transformers_llm_text_gen.py --model google/gemma-3-1b-it 2>&1 | \ - grep -E \ - "^(Prefill Tokens|Prefill time|E2E Generation time|Decoded Tokens|Decode time|Prefill Tokens per second|Decode Tokens per second):" - - echo # blank line between runs -done -``` - -Create a second script named `comparison-270m.sh` for the 270M model: - -```bash -#!/usr/bin/env bash -set -euo pipefail - -# Loop over OMP_NUM_THREADS: powers of 2 plus 96 -for t in 2 4 8 16 32 64 96; do - echo "===============================" - echo "Running with OMP_NUM_THREADS=$t" - echo "===============================" - - TORCHINDUCTOR_CPP_WRAPPER=1 \ - TORCHINDUCTOR_FREEZING=1 \ - OMP_NUM_THREADS="$t" \ - python transformers_llm_text_gen.py --model google/gemma-3-270m-it 2>&1 | \ - grep -E \ - "^(Prefill Tokens|Prefill time|E2E Generation time|Decoded Tokens|Decode time|Prefill Tokens per second|Decode Tokens per second):" - - echo # blank line between runs -done -``` - -Run both scripts from the directory containing the `transformers_llm_text_gen.py` file. The output shows only the final statistics for clarity: - -```bash -./comparison-1b.sh -./comparison-270m.sh -``` - -To observe real-time CPU utilization and thread spawning, run the following command in a separate terminal session: - -```bash -watch -n 0.1 'pid=$(pgrep -n python); [ -n "$pid" ] && ps -L -p "$pid" -o pid,tid,psr,pcpu,stat,comm' -``` - -The expected output is similar to: - -```output - PID TID PSR %CPU STAT COMMAND - 10600 10600 31 85.3 Rl+ python - 10600 10606 32 2.4 Sl+ python - 10600 10607 33 2.4 Sl+ python - 10600 10608 34 2.4 Sl+ python -``` - -This output shows the CPU utilization of each thread, demonstrating how new threads (both inter-op and intra-op) are created and used over time. - -## Results - -The output summarizes the statistics for each run as the script sweeps through different thread counts: - -```out -=============================== -Running with OMP_NUM_THREADS=2 -=============================== -Prefill Tokens: 55 -Prefill time: 0.07 seconds -E2E Generation time: 1.50 seconds -Decoded Tokens: 65 -Decode time: 1.44 seconds -Prefill Tokens per second: 834.48 -Decode Tokens per second: 45.23 - -... - -``` - -The graph below shows how prefill tokens per second change with the number of OpenMP threads for the 270M and 1B variants of Gemma-3: - -![Line graph comparing prefill throughput performance of Gemma-3 270M and 1B models across different thread counts from 2 to 96. The y-axis shows tokens per second (0-3000), and the x-axis shows number of OpenMP threads. Both lines peak at 16-32 threads, with the 270M model achieving higher throughput but showing a steeper decline after peak performance alt-txt#center](./prefill_throughput.png "Prefill throughput versus thread count for Gemma-3 models") - -As expected, the smaller 270M model runs faster. Both models reach their optimal token generation rate at around 16 to 32 threads, though the 270M model exhibits a sharper performance drop-off beyond this range compared with the 1B variant. - - - -## Use PyTorch compilation mode - -The examples so far have used PyTorch's eager execution mode. PyTorch's compile mode can provide additional performance improvements. - -Before testing compile mode, install a C++ compiler and dependencies: - -```bash -sudo apt update && sudo apt install g++ python3.10-dev build-essential -y -``` - -Run the `gemma-3-270m` model with the `--compile` flag using the default number of OpenMP threads: - -```bash -TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 python transformers_llm_text_gen.py --compile --model google/gemma-3-270m-it -``` - -The output is similar to: - -```output -E2E Generation time: 6.15 seconds -Decoded Tokens: 65 -Decode time: 5.74 seconds -Prefill Tokens per second: 133.52 -Decode Tokens per second: 11.33 -``` - -Run the same command with `OMP_NUM_THREADS` set to 16: - -```bash -TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python transformers_llm_text_gen.py --compile --model google/gemma-3-270m-it -``` - -The output is similar to: - -```output -E2E Generation time: 0.63 seconds -Decoded Tokens: 65 -Decode time: 0.61 seconds -Prefill Tokens per second: 2728.34 -Decode Tokens per second: 107.37 -``` - -Reducing the thread count from 96 (default) to 16 provides a significant reduction in end-to-end generation time. - -## What you've accomplished and what's next - -You've explored how the number of OpenMP threads impacts LLM inference performance on Arm CPUs and learned that: - -- Default thread settings on many-core systems don't always provide optimal performance -- Smaller models typically benefit from fewer threads because of lower synchronization overhead -- The optimal thread count depends on both model size and system architecture -- PyTorch's compile mode provides additional performance improvements when combined with thread tuning - -For your specific workloads, experiment with different thread counts to find the optimal setting. Start with powers of 2 (8, 16, 32) and measure the actual throughput and latency for your use case. The performance characteristics you observed in this Learning Path apply to other LLM inference workloads on Arm CPUs. diff --git a/themes/arm-design-system-hugo-theme/layouts/index.html b/themes/arm-design-system-hugo-theme/layouts/index.html index 3c9ea888c4..7dd4fdd9f4 100644 --- a/themes/arm-design-system-hugo-theme/layouts/index.html +++ b/themes/arm-design-system-hugo-theme/layouts/index.html @@ -30,9 +30,8 @@

Arm Learning Paths

Install Guides for Tools - + + @@ -66,7 +65,7 @@

Arm Learning Paths