diff --git a/.gitignore b/.gitignore index 261203bea1..1c485b5379 100644 --- a/.gitignore +++ b/.gitignore @@ -26,4 +26,4 @@ z_local_saved/ tags # Generated spell check config -.spellcheck-non-draft.yml +.spellcheck-non-draft.yml \ No newline at end of file diff --git a/.wordlist.txt b/.wordlist.txt index 2a68d60fd2..03084239ea 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -6190,3 +6190,14 @@ auxv elasticsearch esrally geonames +DeepSpeed +DeepSpeed's +GridSearchCV +Implementers +XGBoost +ZeRO +deepspeed +dimensionality +dmlc +palletsprojects +xgboost \ No newline at end of file diff --git a/README.md b/README.md index 78236f7c2b..9419181154 100644 --- a/README.md +++ b/README.md @@ -22,7 +22,7 @@ Note that all site content, including new contributions, is licensed under a [Cr ## AI Development Tools -When using AI coding assistants (GitHub Copilot, Kiro, Gemini, Cursor, etc.), refer to `.github/copilot-instructions.md` for project-specific guidelines including content requirements, writing style standards, and Arm terminology conventions for Learning Paths. +When using AI coding assistants (GitHub Copilot, Kiro, Antigravity, Cursor, etc.), refer to `.github/copilot-instructions.md` for project-specific guidelines including content requirements, writing style standards, and Arm terminology conventions for Learning Paths.
diff --git a/assets/contributors.csv b/assets/contributors.csv index 1c566509fb..6ca16604cc 100644 --- a/assets/contributors.csv +++ b/assets/contributors.csv @@ -67,7 +67,7 @@ Daniel Nguyen,,,,, Joe Stech,Arm,JoeStech,joestech,, visualSilicon,,,,, Konstantinos Margaritis,VectorCamp,,,, -Kieran Hejmadi,,,,, +Kieran Hejmadi,Arm,kieranhejmadi01,kieran-hejmadi-88920815b,, Alex Su,,,,, Chaodong Gong,,,,, Owen Wu,Arm,,,, @@ -121,4 +121,5 @@ Parichay Das,,parichaydas,parichaydas,, Johnny Nunez,NVIDIA,johnnynunez,johnnycano,, Raymond Lo,NVIDIA,raymondlo84,raymondlo84,, Kavya Sri Chennoju,Arm,kavya-chennoju,kavya-sri-chennoju,, -Akash Malik,Arm,akashmalik19973,akash-malik-a65bab219,, \ No newline at end of file +Akash Malik,Arm,akashmalik19973,akash-malik-a65bab219,, +Sue Wu,Arm,,,, \ No newline at end of file diff --git a/content/install-guides/antigravity.md b/content/install-guides/antigravity.md new file mode 100644 index 0000000000..0303e4def5 --- /dev/null +++ b/content/install-guides/antigravity.md @@ -0,0 +1,404 @@ +--- +title: Antigravity CLI +author: Jason Andrews +minutes_to_complete: 15 +official_docs: https://antigravity.google/ + +draft: true + +test_maintenance: true +test_images: +- ubuntu:latest + +layout: installtoolsall +multi_install: false +multitool_install_part: false +tool_install: true +weight: 1 +--- + +Antigravity CLI is Google's terminal-based interface for interacting with the Antigravity 2.0 agent-first development platform. You can use it to ask questions, perform multi-file code editing, and invoke AI agents directly from your terminal. + +It supports multiple operating systems, including Arm Linux distributions and macOS, and provides powerful AI assistance for developers working on Arm platforms. + +In this guide, you'll learn how to install Antigravity CLI on macOS and Arm Linux. + +## Before you begin + +You need a Google account to use Antigravity CLI. If you don't have one, visit [Google Account Creation](https://accounts.google.com/signup) to create an account. + +After installation, running the tool (via the `agy` command) will initiate the authentication process: +- **Local Machine:** It will automatically open your default browser for Google Sign-In. +- **Remote/SSH Sessions:** It will detect the environment and print a secure authorization URL that you can copy and open in your local browser to complete the login. + + +## Install Antigravity CLI on macOS + +You can install Antigravity CLI on macOS using either the official one-liner script or Homebrew. + +### Option 1: Install using the official installer script (Recommended) + +First, verify that `curl` is available on your system, then run the installer: + +```console +curl -fsSL https://antigravity.google/cli/install.sh | bash +``` + +The installer detects your macOS environment and downloads the appropriate binary. By default, the binary is installed to `~/.local/bin`. + +To run the command globally, ensure this directory is included in your system's `PATH`. Add the following line to your shell configuration file (e.g., `~/.zshrc` or `~/.bash_profile`): + +```console +export PATH="$HOME/.local/bin:$PATH" +``` + +Apply the changes to your current terminal session: + +```console +source ~/.zshrc +``` + +### Option 2: Install using Homebrew + +If you prefer using Homebrew for package management, you can install the CLI using the Homebrew cask: + +```console +brew install --cask antigravity-cli +``` + +## Install Antigravity CLI on Arm Linux + +You can install Antigravity CLI on Arm Linux distributions using the official installer script. This method works on all major Arm Linux distributions including Ubuntu, Debian, and CentOS. + +### Prerequisite packages + +Before running the installer, make sure you have `curl` installed on your system. + +Install `curl` on Ubuntu/Debian systems: + +```bash +sudo apt update && sudo apt install -y curl +``` + +If you are not using Ubuntu/Debian, use your distribution's package manager to install `curl`. + +### Install using the installer script + +With `curl` installed, run the installation script: + +```bash +curl -fsSL https://antigravity.google/cli/install.sh | bash +``` + +The script automatically detects the CPU architecture (such as `aarch64` / `arm64`) and installs the compatible Arm Linux binary to `~/.local/bin`. + +Ensure the installation directory is in your `PATH` by adding it to your shell configuration file (e.g., `~/.bashrc` or `~/.zshrc`): + +```bash +export PATH="$HOME/.local/bin:$PATH" +``` + +Apply the changes: + +```bash +source ~/.bashrc +``` + +## Confirm Antigravity CLI is working + +Verify the installation is successful by checking the version of the `agy` binary: + +```bash +agy --version +``` + +The output is similar to: + +```output +1.0.1 +``` + +Start an interactive session to authenticate and test basic functionality: + +```console +agy +``` + +This launches the terminal user interface (TUI). On your first run, follow the prompt to authenticate with your Google account. Once authenticated, you can immediately begin asking questions. + +### View the available command-line options + +To print the available commands and options, use the `--help` flag: + +```bash +agy --help +``` + +Inside the interactive TUI session, you can type `?` to list all available slash commands (e.g., `/settings`, `/clear`, `/fork`, `/logout`). + +If you are migrating from the older Gemini CLI, you can use the built-in migration command to import your existing settings, skills, and configuration: + +```console +agy plugin import gemini +``` + +## Configure context for Arm development + +Context configuration allows you to provide the Antigravity agent with persistent information about your development environment, preferences, and project details. This helps it generate highly relevant and tailored responses for Arm architecture development. + +### Create a context file + +Antigravity CLI respects both global and workspace-level context files to guide agent behavior: +- **Global Context:** The CLI automatically loads and enforces user-wide rules located at `~/.gemini/GEMINI.md` across all workspaces. +- **Workspace Context:** The CLI reads `.antigravity.md` (recommended for Antigravity CLI) or `GEMINI.md` (fully supported for backward compatibility) as well as `AGENTS.md` from your active project directory. If both `.antigravity.md` and `GEMINI.md` are present, `.antigravity.md` takes precedence. + +Create the global configuration directory if it does not exist: + +```console +mkdir -p ~/.gemini +``` + +Create a global context file with your Arm development preferences: + +```console +cat > ~/.gemini/GEMINI.md << 'EOF' +I am an Arm Linux developer. I prefer Ubuntu and other Debian based distributions. I don't use any x86 computers so please provide all information assuming I'm working on Arm Linux. Sometimes I use macOS and Windows on Arm, but please only provide information about these operating systems when I ask for it. +EOF +``` + +### Managing settings + +Antigravity CLI settings are stored in `~/.gemini/antigravity-cli/settings.json`. You can manage settings in two ways: +1. **Interactive Menu:** Run `agy` and type `/settings` or `/config` to open a full-screen overlay menu to browse and modify settings. +2. **Manual Editing:** Open `~/.gemini/antigravity-cli/settings.json` in a text editor to update your preferences manually. + +--- + +## Integrate the Arm MCP server with Antigravity CLI + +The Arm Model Context Protocol (MCP) server provides Antigravity CLI with specialized tools and knowledge for Arm architecture development, migration, and optimization. By integrating the Arm MCP server, you gain access to Arm-specific documentation, code analysis tools, and optimization recommendations. + +Unlike the older Gemini CLI which stored MCP settings inline inside `settings.json`, Antigravity CLI uses a dedicated configuration file for managing MCP servers. + +### Set up the Arm MCP server with Docker + +The Arm MCP server runs as a Docker container that Antigravity CLI connects to via the Model Context Protocol. + +First, ensure Docker is installed and running on your system. If needed, follow the [Docker installation guide](/install-guides/docker/). + +Pull the Arm MCP server Docker image: + +```console +docker pull armlimited/arm-mcp:latest +``` + +### Configure Antigravity CLI to use the Arm MCP server + +Create or update the dedicated global MCP configuration file at `~/.gemini/antigravity-cli/mcp_config.json` (or `.agents/mcp_config.json` inside your active workspace to enable it only for a specific project). + +Add the following JSON configuration to `~/.gemini/antigravity-cli/mcp_config.json`: + +```json +{ + "mcpServers": { + "arm_mcp_server": { + "command": "docker", + "args": [ + "run", + "--rm", + "-i", + "--pull=always", + "-v", + "/path/to/your/workspace:/workspace", + "-v", + "/path/to/your/ssh/private_key:/run/keys/ssh-key.pem:ro", + "-v", + "/path/to/your/ssh/known_hosts:/run/keys/known_hosts:ro", + "armlimited/arm-mcp:latest" + ], + "env": {} + } + } +} +``` + +Replace `/path/to/your/workspace`, `/path/to/your/ssh/private_key`, and `/path/to/your/ssh/known_hosts` with your actual workspace directory, SSH private key, and `known_hosts` file to enable remote testing features on your target device. + +### Optional: Use alternative container tools + +If you prefer not to use Docker, you can run the Arm MCP server using other compatible container tools such as Podman, Finch, Colima, or Rancher Desktop. + +Select your container tool from the tabs below to view setup instructions and configuration for `~/.gemini/antigravity-cli/mcp_config.json`: + +{{< tabpane-normal >}} + {{< tab header="Podman" >}} +Install: [Podman](https://podman.io/docs/installation) + +Pull the Arm MCP Server image: +```console +podman pull armlimited/arm-mcp:latest +``` + +Add the following configuration to `~/.gemini/antigravity-cli/mcp_config.json`: +```json +{ + "mcpServers": { + "arm_mcp_server": { + "command": "podman", + "args": [ + "run", + "--rm", + "-i", + "--pull=always", + "-v", + "/path/to/your/workspace:/workspace", + "-v", + "/path/to/your/ssh/private_key:/run/keys/ssh-key.pem:ro", + "-v", + "/path/to/your/ssh/known_hosts:/run/keys/known_hosts:ro", + "armlimited/arm-mcp:latest" + ], + "env": {} + } + } +} +``` + {{< /tab >}} + {{< tab header="Finch" >}} +Install: [Finch](/install-guides/finch/) + +Pull the Arm MCP Server image: +```console +finch pull armlimited/arm-mcp:latest +``` + +Add the following configuration to `~/.gemini/antigravity-cli/mcp_config.json`: +```json +{ + "mcpServers": { + "arm_mcp_server": { + "command": "finch", + "args": [ + "run", + "--rm", + "-i", + "--pull=always", + "-v", + "/path/to/your/workspace:/workspace", + "-v", + "/path/to/your/ssh/private_key:/run/keys/ssh-key.pem:ro", + "-v", + "/path/to/your/ssh/known_hosts:/run/keys/known_hosts:ro", + "armlimited/arm-mcp:latest" + ], + "env": {} + } + } +} +``` + {{< /tab >}} + {{< tab header="Colima" >}} +Install: [Colima](https://github.com/abiosoft/colima#installation) + +Colima is a lightweight, open-source command-line alternative to Docker Desktop (primarily for macOS). It runs a minimal virtual machine in the background, allowing you to use your standard `docker` command-line tool without installing Docker Desktop. + +Pull the Arm MCP Server image: +```console +docker pull armlimited/arm-mcp:latest +``` + +Add the following configuration to `~/.gemini/antigravity-cli/mcp_config.json`: +```json +{ + "mcpServers": { + "arm_mcp_server": { + "command": "docker", + "args": [ + "run", + "--rm", + "-i", + "--pull=always", + "-v", + "/path/to/your/workspace:/workspace", + "-v", + "/path/to/your/ssh/private_key:/run/keys/ssh-key.pem:ro", + "-v", + "/path/to/your/ssh/known_hosts:/run/keys/known_hosts:ro", + "armlimited/arm-mcp:latest" + ], + "env": {} + } + } +} +``` + {{< /tab >}} + {{< tab header="Rancher Desktop" >}} +Install: [Rancher Desktop](https://docs.rancherdesktop.io/getting-started/installation/) + +Rancher Desktop uses Docker via Moby. + +Pull the Arm MCP Server image: +```console +docker pull armlimited/arm-mcp:latest +``` + +Add the following configuration to `~/.gemini/antigravity-cli/mcp_config.json`: +```json +{ + "mcpServers": { + "arm_mcp_server": { + "command": "docker", + "args": [ + "run", + "--rm", + "-i", + "--pull=always", + "-v", + "/path/to/your/workspace:/workspace", + "-v", + "/path/to/your/ssh/private_key:/run/keys/ssh-key.pem:ro", + "-v", + "/path/to/your/ssh/known_hosts:/run/keys/known_hosts:ro", + "armlimited/arm-mcp:latest" + ], + "env": {} + } + } +} +``` + {{< /tab >}} +{{< /tabpane-normal >}} + +### Verify the Arm MCP server is working + +Start an interactive Antigravity CLI session: + +```console +agy +``` + +Use the `/mcp` command to list the active MCP servers and verify that `arm_mcp_server` is running and ready: + +```console +/mcp +``` + +The Arm MCP server tools are listed in the output: + +```output +MCP Servers + +Plugins (~/.gemini/antigravity-cli/plugins) +> ✓ arm_mcp_server Tools: knowledge_base_search, check_image, sysreport_instructions, + migrate_ease_scan, apx_recipe_run, +2 more +``` + + +### Use Arm prompt files with the MCP Server + +To guide the agent in using MCP tools effectively across common Arm development tasks, pair the server with Arm-specific prompt files. + +Browse the [agent integrations directory](https://github.com/arm/mcp/tree/main/agent-integrations/gemini) to find prompt files for specific use cases, such as: +- **Arm migration** ([arm-migration.toml](https://github.com/arm/mcp/blob/main/agent-integrations/gemini/arm-migration.toml)): Helps the agent systematically migrate applications from x86 to Arm, including dependency analysis, compatibility checks, and optimization recommendations. + +If you are facing issues or have questions, reach out to mcpserver@arm.com. diff --git a/content/install-guides/go.md b/content/install-guides/go.md index 928c14ddea..3c525002b2 100644 --- a/content/install-guides/go.md +++ b/content/install-guides/go.md @@ -79,4 +79,6 @@ The output is similar to: go version go1.24.5 linux/arm64 ``` -You are now ready to use the Go programming language on your Arm machine running Ubuntu. +## Next steps + +You are now ready to use the Go programming language on your Arm machine running Ubuntu. You can explore Learning Paths for working with Go on Arm, such as [Deploy Golang on Azure Cobalt 100 on Arm](/learning-paths/servers-and-cloud-computing/golang-on-azure/) and [Benchmark Go performance with Sweet and Benchstat](/learning-paths/servers-and-cloud-computing/go-benchmarking-with-sweet/). diff --git a/content/install-guides/java.md b/content/install-guides/java.md index 32dd1c8149..c5d0185654 100644 --- a/content/install-guides/java.md +++ b/content/install-guides/java.md @@ -406,5 +406,6 @@ INFO: Created user preferences directory. Copyright (c) 1999-2024 The Apache Software Foundation ``` +## Next steps -You are now ready to use Java on your Arm Linux system. +You are now ready to use Java on your Arm Linux system. You can explore Learning Paths for working with Java on Arm, such as [Run Java applications on Google Axion processors](/learning-paths/servers-and-cloud-computing/java-on-axion/) and [Tune the performance of the Java garbage collector](/learning-paths/servers-and-cloud-computing/java-gc-tuning/). diff --git a/content/install-guides/terraform.md b/content/install-guides/terraform.md index 0512a03387..da0666fc87 100644 --- a/content/install-guides/terraform.md +++ b/content/install-guides/terraform.md @@ -101,5 +101,6 @@ The output for macOS is similar to: Terraform v1.14.9 on darwin_arm64 ``` +## Next steps -You are now ready to use Terraform. +You are now ready to use Terraform. You can explore Learning Paths to work with Terraform on Arm, such as [Deploy Arm virtual machines on Google Cloud Platform (GCP) using Terraform](/learning-paths/servers-and-cloud-computing/gcp/) and [Deploy Arm instances on AWS using Terraform](/learning-paths/servers-and-cloud-computing/aws-terraform/). diff --git a/content/learning-paths/embedded-and-microcontrollers/device-connect-server/_index.md b/content/learning-paths/embedded-and-microcontrollers/device-connect-server/_index.md index ce31a7de52..b22a83edc5 100644 --- a/content/learning-paths/embedded-and-microcontrollers/device-connect-server/_index.md +++ b/content/learning-paths/embedded-and-microcontrollers/device-connect-server/_index.md @@ -4,7 +4,7 @@ title: Deploy multi-network device meshes using Device Connect server and NATS description: Connect devices and AI agents across networks using Device Connect server. Learn to provision NATS credentials, commission devices, manage persistent registry, and orchestrate multi-network IoT fleets with secure authentication. minutes_to_complete: 30 -who_is_this_for: This Learning Path is for developers who have completed the Device-to-device Learning Path and want to add a server layer to their Device Connect mesh. You'll learn to use persistent registry, distributed state, and security features (commissioning, ACLs) to operate a multi-network fleet. If you're new to Device Connect, start with the device-to-device Learning Path first. +who_is_this_for: This Learning Path is for developers who have completed the Device-to-device Learning Path and want to build a globally connected fleet of devices and AI agents on top of their Device Connect mesh. You'll add a server layer that gives you persistent registry, distributed state, and security features (commissioning, ACLs) so devices and agents on different networks can find and call each other through a single namespace. If you're new to Device Connect, start with the device-to-device Learning Path first. learning_objectives: - Understand what the Device Connect server adds on top of the edge SDK and when you'd reach for it diff --git a/content/learning-paths/embedded-and-microcontrollers/device-connect-server/background.md b/content/learning-paths/embedded-and-microcontrollers/device-connect-server/background.md index 14db041405..6c0f66aad7 100644 --- a/content/learning-paths/embedded-and-microcontrollers/device-connect-server/background.md +++ b/content/learning-paths/embedded-and-microcontrollers/device-connect-server/background.md @@ -19,7 +19,7 @@ This model works well for: ### When D2D isn't enough -As your fleet grows, D2D mode has limitations. You might need devices on different networks to communicate, or a registry that remembers devices after they disconnect. You might also need stronger identity controls, credential rotation, or audit logs. +As your fleet grows beyond a single local network, D2D mode has limitations. You might need devices in different sites or regions to find and call each other, or a registry that remembers devices after they disconnect. You might also need stronger identity controls, credential rotation, or audit logs. ### When to add a server diff --git a/content/learning-paths/embedded-and-microcontrollers/zephyr_cs320_mps4/_index.md b/content/learning-paths/embedded-and-microcontrollers/zephyr_cs320_mps4/_index.md new file mode 100644 index 0000000000..4a6712186e --- /dev/null +++ b/content/learning-paths/embedded-and-microcontrollers/zephyr_cs320_mps4/_index.md @@ -0,0 +1,64 @@ +--- +title: Port Zephyr RTOS and run applications on the Corstone-320 MPS4 platform +description: Port Zephyr RTOS to the Arm Corstone-320 MPS4 FPGA platform by creating board support files and device tree configuration, then build and run the hello_world sample on the physical board. + +minutes_to_complete: 45 + +draft: true +cascade: + draft: true + +who_is_this_for: This is an introductory topic for embedded developers who want to port Zephyr RTOS to the Arm Corstone-320 MPS4 FPGA platform. + +learning_objectives: + - Set up the Zephyr build environment and Arm GNU Toolchain for Corstone-320 MPS4 development + - Create board support files, including device tree, Kconfig, and board metadata, to port Zephyr to the Corstone-320 MPS4 FPGA platform + - Build and run the hello_world sample on the Corstone-320 MPS4 board to validate the port + +prerequisites: + - Basic familiarity with embedded C programming + - Basic knowledge of Zephyr RTOS + - A Corstone-320 MPS4 FPGA development board + - A Linux development environment, for example Ubuntu 20.04 or later + - Git + - Python 3.8 or higher + +author: Sue Wu + +skilllevels: Introductory +subjects: RTOS Fundamentals +armips: + - Cortex-M +tools_software_languages: + - Zephyr + - GCC + - C +operatingsystems: + - Linux + +further_reading: + - resource: + title: Zephyr Project documentation + link: https://docs.zephyrproject.org/latest/index.html + type: website + - resource: + title: Zephyr sample applications and demos + link: https://docs.zephyrproject.org/latest/samples/index.html + type: website + - resource: + title: Arm Corstone SSE-320 FPGA image for MPS4 (FI101) + link: https://developer.arm.com/downloads/view/FI101 + type: website + - resource: + title: SSE-320 FPGA image for MPS4 application note + link: https://developer.arm.com/documentation/109762/0100/?lang=en + type: website + - resource: + title: Arm MPS4 FPGA prototyping board technical reference manual + link: https://developer.arm.com/documentation/102577/latest/ + type: website + +weight: 1 +layout: "learningpathall" +learning_path_main_page: "yes" +--- diff --git a/content/learning-paths/embedded-and-microcontrollers/zephyr_cs320_mps4/_next-steps.md b/content/learning-paths/embedded-and-microcontrollers/zephyr_cs320_mps4/_next-steps.md new file mode 100644 index 0000000000..727b395ddd --- /dev/null +++ b/content/learning-paths/embedded-and-microcontrollers/zephyr_cs320_mps4/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # The weight controls the order of the pages. _index.md always has weight 1. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/embedded-and-microcontrollers/zephyr_cs320_mps4/how-to-1.md b/content/learning-paths/embedded-and-microcontrollers/zephyr_cs320_mps4/how-to-1.md new file mode 100644 index 0000000000..285c3f55c5 --- /dev/null +++ b/content/learning-paths/embedded-and-microcontrollers/zephyr_cs320_mps4/how-to-1.md @@ -0,0 +1,15 @@ +--- +title: Set up the Zephyr build environment +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Set up the development environment +This section describes the tools and environment you need for Corstone-320 MPS4 development with Zephyr. + +### Install the Zephyr build tools + +- Follow the Zephyr Project [Getting Started Guide — Zephyr Project Documentation](https://docs.zephyrproject.org/latest/develop/getting_started/index.html) to install the required packages and set up the Zephyr workspace. +- Download and install the Arm GNU Toolchain from the [Arm GNU Toolchain downloads page](https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads). Select the `arm-none-eabi` package for your host architecture: `aarch64-arm-none-eabi` for aarch64 Linux, or `x86_64-arm-none-eabi` for x86_64 Linux. diff --git a/content/learning-paths/embedded-and-microcontrollers/zephyr_cs320_mps4/how-to-2.md b/content/learning-paths/embedded-and-microcontrollers/zephyr_cs320_mps4/how-to-2.md new file mode 100644 index 0000000000..b8dec1d9d7 --- /dev/null +++ b/content/learning-paths/embedded-and-microcontrollers/zephyr_cs320_mps4/how-to-2.md @@ -0,0 +1,269 @@ +--- +title: Add Zephyr board support for the Corstone-320 MPS4 platform +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Port Zephyr and run an application on Corstone-320 MPS4 + +### CS320 MPS4 Platform overview + +The Arm® Corstone™ SSE-320 FPGA Image for MPS4 (FI101) provides an FPGA implementation that runs on the MPS4 board. The image includes an Arm Cortex-M85 processor, an Arm Ethos-U85 NPU, and an Arm CoreLink DMA-350 direct memory access (DMA) controller. This setup provides a practical environment for developing and evaluating embedded applications, including machine learning workloads. + +Download the latest Corstone-320 FPGA image and review the platform documentations: +* [Arm® Corstone™ SSE-320 with Cortex®-M85 and Ethos™-U85 : Example FPGA (FI101)](https://developer.arm.com/downloads/view/FI101) +* [SSE-320 FPGA Image for MPS4 Application Note](https://developer.arm.com/documentation/109762/0100/?lang=en) +* [Arm® MPS4 FPGA Prototyping Board Technical Reference Manual](https://developer.arm.com/documentation/102577/latest/) +* [Arm® Corstone™ SSE-320 Example Subsystem Software Programmers Guide](https://developer.arm.com/documentation/109759/latest/) + + +### Add Zephyr board support for Corstone-320 MPS4 + +#### Understanding Zephyr board support architecture +Zephyr organizes hardware support in a hierarchy: + +``` +Board → SoC → CPU Cluster → CPU Core → Architecture +``` + +For Corstone-320 MPS4, this hierarchy looks like: +- **Board**: `mps4` (your custom board name in Zephyr) +- **SoC**: `corstone320` (Corstone-320 subsystem) +- **CPU Cluster**: `m85` (Cortex-M85 cluster) +- **CPU Core**: Single Cortex-M85 core +- **Architecture**: ARMv8.1-M with Helium + +#### Create the board directory structure + +Create a board directory under boards/arm/mps4/. Use the following structure: + +``` +boards/arm/mps4/ +├── board.yml # Board metadata +├── board.cmake # Build system integration +├── doc/ # Optional documentation +│ ├── index.rst +├── Kconfig.mps4 # Board Kconfig entry +├── Kconfig.defconfig # Default Kconfig settings +├── mps4_corstone320_fpga_defconfig # Board defconfig fragment +├── mps4_corstone320_fpga.dts # Device tree source +└── mps4_corstone320_fpga.yaml # Test runner metadata +``` + +#### Add the essential board files + +- board.yml +board.yml is board metadata, use board.yml to describe the board name, vendor, SoC, and variants. + +``` +board: + name: mps4 + full_name: MPS4 + vendor: arm + socs: + - name: 'corstone320' + variants: + - name: 'fpga' +``` + +- mps4_corstone320_fpga.dts + +The device tree describes the Corstone-320 MPS4 hardware. Base on the content on [SSE-320 FPGA Image for MPS4 Application Note](https://developer.arm.com/documentation/109762/0100/?lang=en) and tailor it to the peripherals and memory map you use. + +The following example shows a device tree that defines memory regions and enables UART and Ethos-U: + + +```dts +/dts-v1/; + +#include +#include +#include +#include + +{ + compatible = "arm,mps4-fpga"; + #address-cells = <1>; + #size-cells = <1>; + + chosen { + zephyr,console = &uart0; + zephyr,shell-uart = &uart0; + zephyr,sram = &sram; + zephyr,flash = &isram; + }; + + cpus { + #address-cells = <1>; + #size-cells = <0>; + + cpu@0 { + device_type = "cpu"; + compatible = "arm,cortex-m85"; + reg = <0>; + #address-cells = <1>; + #size-cells = <1>; + + mpu: mpu@e000ed90 { + compatible = "arm,armv8.1m-mpu"; + reg = <0xe000ed90 0x40>; + }; + }; + }; + + ethosu { + #address-cells = <1>; + #size-cells = <0>; + interrupt-parent = <&nvic>; + + ethosu0: ethosu@50004000 { + compatible = "arm,ethos-u"; + reg = <0x50004000>; + interrupts = <16 3>; + secure-enable; + privilege-enable; + status = "okay"; + }; + }; + + + itcm: itcm@10000000 { + compatible = "zephyr,memory-region"; + reg = <0x10000000 DT_SIZE_K(32)>; + zephyr,memory-region = "ITCM"; + }; + + sram: sram@12000000 { + compatible = "zephyr,memory-region", "mmio-sram"; + reg = <0x12000000 DT_SIZE_M(2)>; + zephyr,memory-region = "SRAM"; + }; + + rom: rom@11000000 { + compatible = "zephyr,memory-region"; + reg = <0x11000000 DT_SIZE_K(128)>; + zephyr,memory-region = "ROM"; + }; + + dtcm: dtcm@30000000 { + compatible = "zephyr,memory-region"; + reg = <0x30000000 DT_SIZE_K(32)>; + zephyr,memory-region = "DTCM"; + }; + + isram: sram@31000000 { + compatible = "zephyr,memory-region", "mmio-sram"; + reg = <0x31000000 DT_SIZE_M(4)>; + zephyr,memory-region = "ISRAM"; + }; + + + soc { + peripheral@50000000 { + #address-cells = <1>; + #size-cells = <1>; + ranges = <0x0 0x50000000 0x10000000>; + + #include "mps4_common_soc_peripheral_fpga.dtsi" + }; + }; +}; + +#include "mps4_common.dtsi" +``` +- mps4_common_soc_peripheral_fpga.dtsi + +This file defines the SoC peripherals for the MPS4 FPGA build. The following example configures a fixed system clock and two UART instances. + +``` +sysclk: system-clock { + compatible = "fixed-clock"; + clock-frequency = <50000000>; + #clock-cells = <0>; +}; + +uart0: uart@9303000 { + compatible = "arm,cmsdk-uart"; + reg = <0x9303000 0x1000>; + interrupts = <34 3 49 3>; + interrupt-names = "tx", "rx"; + clocks = <&sysclk>; + current-speed = <115200>; +}; + +uart1: uart@9304000 { + compatible = "arm,cmsdk-uart"; + reg = <0x9304000 0x1000>; + interrupts = <36 3 35 3>; + interrupt-names = "tx", "rx"; + clocks = <&sysclk>; + current-speed = <115200>; +}; + +pinctrl: pinctrl { + compatible = "arm,mps4-pinctrl"; + status = "okay"; +}; + +``` + +- Kconfig Files + +Zephyr uses Kconfig to configure build-time features. The MPS4 platform uses three Kconfig-related files: + +- Kconfig.mps4 +- Kconfig.defconfig +- Kconfig + +Kconfig.mps4 is the base configuration, it selects the SoC series and the specific SoC variant. + +```kconfig.mps4 +config BOARD_MPS4 + select SOC_SERIES_MPS4 + select SOC_MPS4_CORSTONE315 if BOARD_MPS4_CORSTONE315_FVP || BOARD_MPS4_CORSTONE315_FVP_NS + select SOC_MPS4_CORSTONE320 if BOARD_MPS4_CORSTONE320_FVP || BOARD_MPS4_CORSTONE320_FVP_NS || BOARD_MPS4_CORSTONE320_FPGA + +``` + +Kconfig.defconfig and Kconfig are to provide default values for features and drivers that your board requires. + +```kconfig.defconfig +if BOARD_MPS4_CORSTONE315_FVP || BOARD_MPS4_CORSTONE320_FVP || BOARD_MPS4_CORSTONE320_FPGA + +config UART_INTERRUPT_DRIVEN + default y # 串口默认启用中断驱动 + +config ROMSTART_REGION_ADDRESS + default $(dt_nodelabel_reg_addr_hex,rom) if BOARD_MPS4_CORSTONE320_FPGA + default $(dt_nodelabel_reg_addr_hex,itcm) + +config ROMSTART_REGION_SIZE + default $(dt_nodelabel_reg_size_hex,rom,0,k) if BOARD_MPS4_CORSTONE320_FPGA + default $(dt_nodelabel_reg_size_hex,itcm,0,k) + +``` + +The mps4_corstone320_fpga_defconfig file is a Kconfig fragment that Zephyr merges into the final .config when you build an application for this board. The following example enables TrustZone, MPU support, GPIO, and console over UART, and it builds a Secure image that relocates the ROM start region. + +```kconfig +CONFIG_RUNTIME_NMI=y +CONFIG_ARM_TRUSTZONE_M=y +CONFIG_ARM_MPU=y + +# GPIOs +CONFIG_GPIO=y + +# Serial +CONFIG_CONSOLE=y +CONFIG_UART_CONSOLE=y +CONFIG_SERIAL=y + +# Build a Secure firmware image +CONFIG_TRUSTED_EXECUTION_SECURE=y +# ROMSTART_REGION address and size are defined in Kconfig.defconfig +CONFIG_ROMSTART_RELOCATION_ROM=y + +``` + diff --git a/content/learning-paths/embedded-and-microcontrollers/zephyr_cs320_mps4/how-to-3.md b/content/learning-paths/embedded-and-microcontrollers/zephyr_cs320_mps4/how-to-3.md new file mode 100644 index 0000000000..569881006a --- /dev/null +++ b/content/learning-paths/embedded-and-microcontrollers/zephyr_cs320_mps4/how-to-3.md @@ -0,0 +1,55 @@ +--- +title: Build and run the hello_world sample on the Corstone-320 MPS4 platform +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Build the hello_world sample for MPS4 + +The Zephyr hello_world sample prints “Hello World” to the console. Use it to validate that your board support and toolchain configuration work. + +1. Activate your Python virtual environment for Zephyr. +2. Set the toolchain environment variables. Set `GNUARMEMB_TOOLCHAIN_PATH` to the directory where you installed the Arm GNU Toolchain. The directory name includes your host architecture: `arm-gnu-toolchain--aarch64-arm-none-eabi` on aarch64, or `arm-gnu-toolchain--x86_64-arm-none-eabi` on x86_64. + ```bash + export ZEPHYR_TOOLCHAIN_VARIANT=gnuarmemb + export GNUARMEMB_TOOLCHAIN_PATH= + ``` +3. Build the sample for the Corstone-320 FPGA variant: + ```bash + west build -p always -b mps4/corstone320/fpga zephyr/samples/hello_world -- -DCONFIG_ROMSTART_RELOCATION_ROM=y + ``` +After a successful build, the output file zephyr.elf is available under build/zephyr/. The ELF image contains the application and the Zephyr kernel libraries. + +## Run the application on the MPS4 board +1. Download the board files from [FI101](https://developer.arm.com/downloads/view/FI101?sortBy=availableBy&revision=r1p0-00eac0-2), +2. Set up the MPS4 platform according to the [Using the FI101 on MPS4 board](https://developer.arm.com/documentation/109762/0100/?lang=en). + +For the hello_world application, place the vector table in the FPGA boot ROM at address 0x11000000, and place the remaining code and data in SRAM at address 0x31000000. Create vector.bin and app.bin from zephyr.elf by using arm-none-eabi-objcopy. + +Update images.txt under /MB/HBI0376B/FI101 to load the two images: + + ``` +IMAGE0PORT: 2 +IMAGE0ADDRESS: 0x00_1100_0000 ; Address to load into +IMAGE0UPDATE: RAM +IMAGE0FILE: \SOFTWARE\vector.bin ; Image/data to be loaded + +IMAGE1PORT: 1 +IMAGE1ADDRESS: 0x31000000 ; Address to load into +IMAGE1UPDATE: RAM +IMAGE1FILE: \SOFTWARE\app.bin ; Image/data to be loaded + ``` + +Copy vector.bin and app.bin to \SOFTWARE, then power on the board. +If the setup is correct, the UART console prints the “Hello World” message, similar to the following example: + + ![alt text](image.png) + +## What you accomplished +In this Learning Path, you learned +- How to explore the Corstone‑320 architecture, created board support files and configured device tree and Kconfig options to port Zephyr RTOS for the target hardware. +- How to built and run the Zephyr hello_world sample on MPS4 board. + +These steps help you further customize Zephyr on the CS320 MPS4 platform and validate a complete build-and-run workflow. diff --git a/content/learning-paths/embedded-and-microcontrollers/zephyr_cs320_mps4/image.png b/content/learning-paths/embedded-and-microcontrollers/zephyr_cs320_mps4/image.png new file mode 100644 index 0000000000..5628bb32e6 Binary files /dev/null and b/content/learning-paths/embedded-and-microcontrollers/zephyr_cs320_mps4/image.png differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/vulkan-ml-sample/3-first-sample.md b/content/learning-paths/mobile-graphics-and-gaming/vulkan-ml-sample/3-first-sample.md index 697430837c..a2987de7b8 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/vulkan-ml-sample/3-first-sample.md +++ b/content/learning-paths/mobile-graphics-and-gaming/vulkan-ml-sample/3-first-sample.md @@ -12,10 +12,10 @@ The **Simple Tensor and Data Graph** sample is your starting point for working w ## Clone the Vulkan Samples -With the environment set up, you can now grab the sample code. These examples are maintained in a fork of the Khronos Group's repository. +With the environment set up, you can now grab the sample code. These examples sit in the the Khronos Group's repository. ```bash -git clone --recurse-submodules https://github.com/ARM-software/Vulkan-Samples --branch tensor_and_data_graph +git clone --recurse-submodules https://github.com/KhronosGroup/Vulkan-Samples.git cd Vulkan-Samples ``` diff --git a/content/learning-paths/servers-and-cloud-computing/arm_pmu/perf_event_open.md b/content/learning-paths/servers-and-cloud-computing/arm_pmu/perf_event_open.md index 7ac96479df..a05e392051 100644 --- a/content/learning-paths/servers-and-cloud-computing/arm_pmu/perf_event_open.md +++ b/content/learning-paths/servers-and-cloud-computing/arm_pmu/perf_event_open.md @@ -212,7 +212,8 @@ int main() { for(int i = 0; i < counter_results.nr; i++) { for(int j = 0; j < TOTAL_EVENTS ;j++){ if(counter_results.values[i].id == id[j]){ - pe_val[i] = counter_results.values[i].value; + pe_val[j] = counter_results.values[i].value; + break; } } } diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-demo/install.md b/content/learning-paths/servers-and-cloud-computing/bolt-demo/install.md index a5165be927..2552772735 100644 --- a/content/learning-paths/servers-and-cloud-computing/bolt-demo/install.md +++ b/content/learning-paths/servers-and-cloud-computing/bolt-demo/install.md @@ -16,19 +16,28 @@ Package manager versions might be older, so verify the installed version before Install BOLT from a prebuilt [LLVM release](https://github.com/llvm/llvm-project/releases). This method provides a consistent version across systems. It also lets you use newer releases when available. -The following example uses LLVM 22.1.0. +The following example uses LLVM 22.1.5, the latest LLVM release available at the time of writing (May 2026). + +{{% notice Please Note %}} + +If you are using a 1st generation Arm AGI CPU, we recommend installing the latest LLVM release to ensure support for the processor. However due to backwards compatibility, LLVM BOLT 22.1.0 or later can still be used to complete this learning path. + +Arm AGI CPU support is expected to be introduced no earlier than LLVM 23. + +{{% /notice %}} + Download and extract LLVM: ```bash -wget https://github.com/llvm/llvm-project/releases/download/llvmorg-22.1.0/LLVM-22.1.0-Linux-ARM64.tar.xz -tar xf LLVM-22.1.0-Linux-ARM64.tar.xz +wget https://github.com/llvm/llvm-project/releases/download/llvmorg-22.1.5/LLVM-22.1.5-Linux-ARM64.tar.xz +tar xf LLVM-22.1.5-Linux-ARM64.tar.xz ``` Add LLVM tools to your PATH: ```bash -export PATH="$(pwd)/LLVM-22.1.0-Linux-ARM64/bin:$PATH" +export PATH="$(pwd)/LLVM-22.1.5-Linux-ARM64/bin:$PATH" ``` diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-demo/setup.md b/content/learning-paths/servers-and-cloud-computing/bolt-demo/setup.md index 02e8aac47f..fcc69af504 100644 --- a/content/learning-paths/servers-and-cloud-computing/bolt-demo/setup.md +++ b/content/learning-paths/servers-and-cloud-computing/bolt-demo/setup.md @@ -59,6 +59,14 @@ clang bsort.cpp -o out/bsort -O3 -fuse-ld=lld -ffunction-sections -Wl,--emit-rel {{< /tab >}} {{< /tabpane >}} +{{% notice Please Note %}} + +If you want to use BOLT with your own application running on the 1st-generation Arm AGI CPU, we recommend using the latest version of GCC/LLVM. + +For the GNU toolchain, the `-mcpu=armagicpu` defintion was added in [GCC 16.1.0](https://github.com/gcc-mirror/gcc/commit/0f5f728854d2ea93e6806a8632c04383502b0386). As of May 2026, this is the same as the `-march=neoverse-v3ae` option available from [GCC 15](https://gcc.gnu.org/gcc-15/changes.html) onwards. However, in the future there may be differences between `neoverse-v3ae` and `armagicpu`. Similarly for LLVM, Arm AGI CPU support is expected to be introduced no earlier than LLVM 23. + +{{% /notice %}} + ## Verify the function order Verify that the compiler preserved the intended function order by inspecting the symbols in the `.text` section of the binary. Run the following command: diff --git a/content/learning-paths/servers-and-cloud-computing/deepspeed-on-axion/_index.md b/content/learning-paths/servers-and-cloud-computing/deepspeed-on-axion/_index.md index 80921a075b..61420e9026 100644 --- a/content/learning-paths/servers-and-cloud-computing/deepspeed-on-axion/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/deepspeed-on-axion/_index.md @@ -1,9 +1,5 @@ --- -title: Train and Benchmark AI Workloads with DeepSpeed on Google Cloud C4A Axion VM - -draft: true -cascade: - draft: true +title: Train and benchmark AI workloads with DeepSpeed on Google Cloud C4A Axion VMs description: Set up PyTorch and DeepSpeed on Google Cloud C4A Axion Arm VMs running SUSE Linux to train neural network models, benchmark AI workloads, and validate scalable CPU-based AI execution on Arm64 processors. @@ -12,10 +8,10 @@ minutes_to_complete: 30 who_is_this_for: This is an introductory topic for DevOps engineers, ML engineers, and software developers who want to run AI training and benchmarking workloads using PyTorch and DeepSpeed on SUSE Linux Enterprise Server (SLES) Arm64, validate CPU-based neural network execution, and benchmark AI performance on Arm processors. learning_objectives: - - Install and configure PyTorch and DeepSpeed on Google Cloud C4A Axion processors for Arm64 + - Install and configure PyTorch and DeepSpeed on Arm-based Google Cloud C4A Axion VMs - Create and execute neural network training workloads using PyTorch - Benchmark CPU-based AI workloads on Arm64 processors - - Validate scalable AI execution and workload performance on GCP Axion Arm VMs + - Validate scalable AI execution and workload performance on Google Axion Arm VMs prerequisites: - A [Google Cloud Platform (GCP)](https://cloud.google.com/free) account with billing enabled diff --git a/content/learning-paths/servers-and-cloud-computing/deepspeed-on-axion/background.md b/content/learning-paths/servers-and-cloud-computing/deepspeed-on-axion/background.md index ab14e42f28..f0a4aed329 100644 --- a/content/learning-paths/servers-and-cloud-computing/deepspeed-on-axion/background.md +++ b/content/learning-paths/servers-and-cloud-computing/deepspeed-on-axion/background.md @@ -1,5 +1,5 @@ --- -title: Learn about DeepSpeed and Google Axion C4A for AI training +title: Understand DeepSpeed and Google Axion C4A for AI training weight: 2 layout: "learningpathall" @@ -15,16 +15,18 @@ To learn more, see the Google blog [Introducing Google Axion Processors, our new ## DeepSpeed for scalable AI training on Arm -DeepSpeed is an open-source deep learning optimization framework developed by Microsoft to enable efficient and scalable training of large AI models. It is widely used for distributed deep learning, memory optimization, large language model (LLM) training, efficient inference execution, and high-performance AI workloads. Its core capabilities include ZeRO (Zero Redundancy Optimizer) memory optimization, distributed training acceleration, mixed precision training, pipeline and tensor parallelism, and optimized inference execution. +DeepSpeed is an open-source deep learning optimization framework developed by Microsoft for efficient and scalable training of large AI models. It is widely used for distributed deep learning, memory optimization, large language model (LLM) training, efficient inference execution, and high-performance AI workloads. Its core capabilities include Zero Redundancy Optimizer (ZeRO) memory optimization, distributed training acceleration, mixed precision training, pipeline and tensor parallelism, and optimized inference execution. -Running DeepSpeed on Google Axion C4A Arm-based infrastructure enables efficient CPU-based AI training and benchmarking by using multi-core Arm processors and optimized memory performance, which improves performance-per-watt and reduces infrastructure costs. +By running DeepSpeed on Google Axion C4A Arm-based infrastructure, you can perform efficient CPU-based AI training and benchmarking with multi-core Arm processors and optimized memory performance. This improves performance-per-watt and reduces infrastructure costs. -On SUSE Linux Enterprise Server Arm64 environments, some DeepSpeed native CPU communication extensions require GCC 9 or later to compile. The default SUSE image ships with GCC 7.5.0, so this Learning Path installs DeepSpeed in compatibility mode alongside PyTorch CPU execution. This provides a stable and reproducible AI training and benchmarking environment on GCP Axion Arm64. +On SUSE Linux Enterprise Server Arm64 environments, some DeepSpeed native CPU communication extensions require GCC 9 or later to compile. Because the default SUSE image ships with GCC 7.5.0., you'll install DeepSpeed in compatibility mode alongside PyTorch CPU execution. This provides a stable and reproducible AI training and benchmarking environment. -Common use cases include neural network training, AI benchmarking, scalable experimentation pipelines, and CPU-based inference validation. +Common use cases of DeepSpeed include neural network training, AI benchmarking, scalable experimentation pipelines, and CPU-based inference validation. To learn more, see the [DeepSpeed documentation](https://www.deepspeed.ai/) and the [DeepSpeed GitHub repository](https://github.com/microsoft/DeepSpeed). ## What you've learned and what's next -This section introduced Google Axion C4A Arm-based virtual machines and DeepSpeed as a scalable AI training framework suited to Arm processors. Next, you'll provision a C4A VM and install PyTorch and DeepSpeed to begin running training and benchmarking workloads. +You've now learned about Google Axion C4A Arm-based virtual machines and DeepSpeed as a scalable AI training framework suited to Arm processors. + +Next, you'll provision a C4A virtual machine, then install PyTorch and DeepSpeed to run training and benchmarking workloads. diff --git a/content/learning-paths/servers-and-cloud-computing/deepspeed-on-axion/install-deepspeed-arm.md b/content/learning-paths/servers-and-cloud-computing/deepspeed-on-axion/install-deepspeed-arm.md index a10d76b76e..018cf31551 100644 --- a/content/learning-paths/servers-and-cloud-computing/deepspeed-on-axion/install-deepspeed-arm.md +++ b/content/learning-paths/servers-and-cloud-computing/deepspeed-on-axion/install-deepspeed-arm.md @@ -1,5 +1,5 @@ --- -title: Setup PyTorch and DeepSpeed on GCP Axion (Arm) +title: Set up PyTorch and DeepSpeed on a Google Axion C4A virtual machine weight: 4 ### FIXED, DO NOT MODIFY @@ -8,18 +8,17 @@ layout: learningpathall ## Set up the Python environment -This section walks through installing Python 3.11, creating a virtual environment, and installing PyTorch and DeepSpeed on the GCP Axion VM running SUSE Linux. +First, install Python 3.11 and create a virtual environment on the Google Axion virtual machine (VM) running SUSE Linux. +### Verify Arm64 architecture -## Verify ARM64 architecture - -Verify that the VM is running on Arm64 architecture. +Verify that the VM is running on Arm64 architecture: ```bash uname -m ``` -Expected output: +The output is similar to: ```text aarch64 @@ -45,15 +44,15 @@ Vendor ID: ARM The `Neoverse-V2` model name confirms you're running on a Google Axion processor. The `aarch64` architecture confirms the 64-bit Arm environment that PyTorch and DeepSpeed will target. -## Install Python +### Install Python -The default Python version on SUSE Linux may conflict with PyTorch and DeepSpeed dependencies. Python 3.11 provides stable support for both frameworks and avoids compatibility issues commonly seen with older or newer releases: +The default Python version on SUSE Linux might conflict with PyTorch and DeepSpeed dependencies. Python 3.11 provides stable support for both frameworks and avoids compatibility issues commonly seen with older or newer releases: ```bash sudo zypper install -y python311 python311-pip python311-devel ``` -## Create a Python virtual environment +### Create a Python virtual environment Create an isolated Python environment to prevent dependency conflicts with system packages: @@ -61,13 +60,13 @@ Create an isolated Python environment to prevent dependency conflicts with syste python3.11 -m venv deepspeed-env ``` -Activate environment: +Activate the virtual environment: ```bash source ~/deepspeed-env/bin/activate ``` -Verify: +Verify the Python version in the environment: ```bash python --version @@ -80,7 +79,7 @@ Python 3.11.10 ``` -## Upgrade pip +### Upgrade pip Upgrade pip, setuptools, and wheel before installing packages. Outdated packaging tools can cause installation failures or wheel compatibility issues, particularly on Arm64: @@ -88,15 +87,17 @@ Upgrade pip, setuptools, and wheel before installing packages. Outdated packagin pip install --upgrade pip setuptools wheel ``` -## Install Ninja +### Install Ninja + +Ninja is a lightweight build system used by PyTorch and DeepSpeed to compile native extensions at runtime. -Ninja is a lightweight build system used by PyTorch and DeepSpeed to compile native extensions at runtime. Install it via pip rather than zypper to avoid SUSE repository dependency issues sometimes seen on cloud Arm64 images: +To avoid SUSE repository dependency issues sometimes seen on cloud Arm64 images, install Ninja using `pip` rather than `zypper`: ```bash pip install ninja ``` -Verify: +Verify the installation: ```bash ninja --version @@ -107,33 +108,28 @@ The output is similar to: ```output 1.13.0.git.kitware.jobserver-pipe-1 ``` +## Install PyTorch and DeepSpeed -## Install CPU-only PyTorch +After setting up the Python environment, install PyTorch and DeepSpeed on the VM. -GCP Axion VMs have no GPU, so install the CPU-only PyTorch build. This avoids unnecessary CUDA dependencies and reduces package size: +### Install CPU-only PyTorch + +Google Axion VMs are CPU-only systems and don't contain NVIDIA GPUs. To avoid unnecessary CUDA dependencies and reduce package size, install the CPU-only PyTorch build: ```bash pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu ``` -## Why CPU-only PyTorch is used - -GCP Axion VMs are CPU-only systems and do not contain NVIDIA GPUs. +### Verify PyTorch installation -The CPU-only build: - -- Reduces package size -- Avoids unnecessary CUDA dependencies -- Improves installation stability -- Matches the Axion hardware architecture - -## Verify PyTorch installation +Verify that you installed PyTorch successfully: ```bash python -c "import torch; print(torch.__version__)" ``` The output is similar to: + ```output 2.12.0+cpu ``` @@ -145,30 +141,18 @@ python -c "import torch; print(torch.cuda.is_available())" ``` The output is similar to: + ```output False ``` -This is expected because GCP Axion VMs are CPU-only systems. - - -## DeepSpeed limitation on SUSE Arm64 - -DeepSpeed's distributed CPU extensions require GCC 9 or later to compile. The default SUSE Linux image on GCP Axion ships with GCC 7.5.0: - -```bash -gcc --version -``` - -```output -gcc (SUSE Linux) 7.5.0 -``` +This is expected because Google Axion VMs are CPU-only systems. -When DeepSpeed initializes its launcher, it attempts to compile the `deepspeed_shm_comm` shared memory communication extension. This compilation fails on GCC 7.5.0. To work around this, install DeepSpeed with all native extension compilation disabled. +### Install DeepSpeed -## Install DeepSpeed +DeepSpeed's distributed CPU extensions require GCC 9 or later to compile. The default SUSE Linux image on Google Axion ships with GCC 7.5.0. When DeepSpeed initializes its launcher, it attempts to compile the `deepspeed_shm_comm` shared memory communication extension. This compilation fails on GCC 7.5.0. -Install DeepSpeed with native extension compilation disabled. Each variable tells the build system to skip a specific extension that requires GCC 9 or later: +To work around this, install DeepSpeed with all native extension compilation disabled. Each variable tells the build system to skip a specific extension that requires GCC 9 or later: | Variable | Purpose | |---|---| @@ -181,8 +165,9 @@ Install DeepSpeed with native extension compilation disabled. Each variable tell DS_BUILD_OPS=0 DS_BUILD_SHM_COMM=0 DS_BUILD_CPU_ADAM=0 DS_BUILD_AIO=0 pip install deepspeed ``` +### Verify DeepSpeed installation -## Verify DeepSpeed installation +Verify that DeepSpeed was installed successfully: ```bash ds_report @@ -222,7 +207,7 @@ deepspeed wheel compiled w. ...... torch 0.0 shared memory (/dev/shm) size .... 7.80 GB ``` -The CPU accelerator warning is expected — GCP Axion VMs have no GPU. Most ops show `[NO] ... [OKAY]`, meaning they are not pre-installed but are compatible for just-in-time compilation via Ninja if needed at runtime. The one exception is `async_io`, which shows `[NO] ... [NO]` because it requires the `libaio-devel` system package. Since async I/O is not needed for the training workloads in this Learning Path and was disabled with `DS_BUILD_AIO=0`, you can ignore this warning. +The CPU accelerator warning is expected because Google Axion VMs have no GPU. Most ops show `[NO] ... [OKAY]`, meaning they are not pre-installed but are compatible for just-in-time compilation with Ninja if needed at runtime. The one exception is `async_io`, which shows `[NO] ... [NO]` because it requires the `libaio-devel` system package. Because async I/O isn't needed for the training workloads in this Learning Path, and it was disabled with `DS_BUILD_AIO=0`, you can ignore this warning. ## Create a project directory @@ -235,15 +220,17 @@ cd ~/deepspeed-demo ``` {{% notice Note %}} -Do not run `deepspeed train.py` directly on this VM. DeepSpeed's launcher attempts to compile the `deepspeed_shm_comm` native extension during initialization, which requires GCC 9 or later. Use `python train.py` instead, as shown in the next section. +Don't run `deepspeed train.py` directly on this VM. DeepSpeed's launcher attempts to compile the `deepspeed_shm_comm` native extension during initialization, which requires GCC 9 or later. Use `python train.py` instead, as shown in the next section. {{% /notice %}} -## Troubleshooting +## Troubleshoot setup issues + +Use the following guidance to troubleshoot issues with setting up the Python environment for the project. ### SUSE repository refresh issue -You may see the following error during `zypper` commands: +You might see the following error during `zypper` commands: ```output Receive: script died unexpectedly @@ -254,4 +241,6 @@ If Python 3.11 is already installed when this occurs, you can continue. Install ## What you've accomplished and what's next -You've installed Python 3.11, PyTorch, and DeepSpeed on a GCP Axion Arm64 VM running SUSE Linux, verified the environment with `ds_report`, and created the project directory for training scripts. Next, you'll create and run neural network training and benchmarking workloads on the Axion processor. +You've now installed Python 3.11, PyTorch, and DeepSpeed on a Google Axion C4A VM running SUSE Linux, verified the environment with `ds_report`, and created the project directory for training scripts. + +Next, you'll create and run neural network training and benchmarking workloads on the VM. diff --git a/content/learning-paths/servers-and-cloud-computing/deepspeed-on-axion/instance.md b/content/learning-paths/servers-and-cloud-computing/deepspeed-on-axion/instance.md index c2a983eea9..27af7294aa 100644 --- a/content/learning-paths/servers-and-cloud-computing/deepspeed-on-axion/instance.md +++ b/content/learning-paths/servers-and-cloud-computing/deepspeed-on-axion/instance.md @@ -8,18 +8,18 @@ layout: learningpathall ## Set up the virtual machine -Create a Google Axion C4A Arm-based virtual machine on Google Cloud Platform. This Learning Path uses the `c4a-standard-4` machine type, which provides 4 vCPUs and 16 GB of memory. This VM hosts PyTorch and DeepSpeed training and benchmarking workloads. +Create a Google Axion C4A Arm-based virtual machine (VM) on Google Cloud Platform. You'll use the `c4a-standard-4` machine type with 4 vCPUs and 16 GB of memory. This VM will host PyTorch and DeepSpeed training and benchmarking workloads. -{{% notice Note %}}For help with GCP setup, see the Learning Path [Getting started with Google Cloud Platform](/learning-paths/servers-and-cloud-computing/csp/google/).{{% /notice %}} +{{% notice Note %}}For help with Google Cloud Platform setup, see the Learning Path [Getting started with Google Cloud Platform](/learning-paths/servers-and-cloud-computing/csp/google/).{{% /notice %}} -To create a C4A virtual machine in the Google Cloud Console: +To create a C4A virtual machine in the Google Cloud console: -1. Navigate to the [Google Cloud Console](https://console.cloud.google.com/). +1. Navigate to the [Google Cloud console](https://console.cloud.google.com/). 2. Go to **Compute Engine** > **VM Instances** and select **Create Instance**. 3. Under **Machine configuration**, populate fields such as **Instance name**, **Region**, and **Zone**. 4. Set **Series** to `C4A`, then select `c4a-standard-4` for **Machine type**. -![Screenshot of the Google Cloud Console showing the Machine configuration section. The Series dropdown is set to C4A and the machine type c4a-standard-4 is selected#center](images/gcp-vm.png "Configuring machine type to C4A in Google Cloud Console") +![Screenshot of the Google Cloud console showing the Machine configuration section. The Series dropdown is set to C4A and the machine type c4a-standard-4 is selected#center](images/gcp-vm.png "Configuring machine type to C4A in Google Cloud Console") 5. Under **OS and storage**, select **Change** and then choose an Arm64-based operating system image. For this Learning Path, select **SUSE Linux Enterprise Server**. 6. For the license type, choose **Pay as you go**. @@ -28,7 +28,7 @@ To create a C4A virtual machine in the Google Cloud Console: After the instance starts, select **SSH** next to the VM in the instance list to open a browser-based terminal session. -![Google Cloud Console VM instances page displaying running instance with green checkmark and SSH button in the Connect column#center](images/gcp-pubip-ssh.png "Connecting to a running C4A VM using SSH") +![Google Cloud console VM instances page displaying running instance with green checkmark and SSH button in the Connect column#center](images/gcp-pubip-ssh.png "Connecting to a running C4A VM using SSH") A new browser window opens with a terminal connected to your VM. @@ -36,4 +36,6 @@ A new browser window opens with a terminal connected to your VM. ## What you've accomplished and what's next -You've provisioned a Google Axion C4A Arm VM and connected to it using SSH. Next, you'll install PyTorch and DeepSpeed and configure the Python environment for AI training. +You've now provisioned a Google Axion C4A Arm VM and connected to it using SSH. + +Next, you'll install PyTorch and DeepSpeed and configure the Python environment for AI training. diff --git a/content/learning-paths/servers-and-cloud-computing/deepspeed-on-axion/train-benchmark-deepspeed-arm.md b/content/learning-paths/servers-and-cloud-computing/deepspeed-on-axion/train-benchmark-deepspeed-arm.md index 381012e6df..df473f5ccd 100644 --- a/content/learning-paths/servers-and-cloud-computing/deepspeed-on-axion/train-benchmark-deepspeed-arm.md +++ b/content/learning-paths/servers-and-cloud-computing/deepspeed-on-axion/train-benchmark-deepspeed-arm.md @@ -1,14 +1,14 @@ --- -title: Train and Benchmark AI Workloads on GCP Axion (Arm) +title: Train and benchmark AI workloads on an Arm-based Google Axion virtual machine weight: 5 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Train and benchmark AI workloads +## Prepare training and benchmarking AI workloads -This section demonstrates neural network training and benchmarking on GCP Axion Arm64 processors using PyTorch. Two workloads are covered: a small baseline model to verify the environment, and a larger benchmark to evaluate CPU scaling behavior. +In this section, you'll run neural network training and benchmarking workloads on an Arm-based Google Axion C4A VM using PyTorch. You'll run two workloads: a small baseline model to verify the environment, and a larger benchmark to evaluate CPU scaling behavior. If you're continuing in the same SSH session from the previous section, the `deepspeed-env` virtual environment is already active and your working directory is `~/deepspeed-demo`. If you've opened a new session, re-activate the environment and navigate to the project directory: @@ -16,10 +16,13 @@ If you're continuing in the same SSH session from the previous section, the `dee source ~/deepspeed-env/bin/activate cd ~/deepspeed-demo ``` +## Set up a baseline -## Baseline training workload +First, create and run a baseline model to verify the environment. -Create a lightweight neural network training script to verify the environment. The script defines a three-layer feedforward network, generates synthetic training data, runs five epochs of mini-batch gradient descent using the Adam optimizer, and prints the total training time: +### Create a baseline training workload + +Create a lightweight neural network training script to verify the environment. The script defines a three-layer feedforward network and generates synthetic training data. It runs five epochs of mini-batch gradient descent using the Adam optimizer, and prints the total training time: ```bash cat > train.py << 'EOF' @@ -81,13 +84,15 @@ print("Total Training Time:", end - start) EOF ``` -### Run the baseline training +### Run the baseline training workload + +Run the training script: ```bash python train.py ``` -Expected output: +The output is similar to: ```output Epoch 1, Loss: 155.41862654685974 @@ -100,7 +105,7 @@ Total Training Time: 0.7545099258422852 The loss decreases across all five epochs, confirming that gradient updates are working correctly and PyTorch is running properly on Arm64. -### Benchmark with timing +### Run the baseline with timing Run the same script under `time` and save the output for comparison: @@ -108,7 +113,7 @@ Run the same script under `time` and save the output for comparison: time python train.py | tee pytorch_baseline_result.txt ``` -Example output: +The output is similar to: ```output Epoch 1, Loss: 160.0170536339283 @@ -123,9 +128,13 @@ user 0m3.700s sys 0m0.137s ``` -The `real` time is total wall-clock duration. The `user` time exceeds `real` here because PyTorch uses multiple threads across all 4 vCPUs, so CPU time is summed across cores. +The `real` time is the total wall-clock duration. The `user` time exceeds `real` time here because PyTorch uses multiple threads across all 4 vCPUs, so CPU time is summed across cores. + +## Set up a large-scale benchmark + +After creating and running the baseline workload, create a large-scale benchmark to evaluate CPU behavior. -## Large-scale benchmark +### Create a large-scale benchmark This workload increases dataset size, input dimensionality, batch size, and model depth to stress CPU compute, memory bandwidth, and tensor operation throughput. It also calls `torch.set_num_threads(os.cpu_count())` to explicitly pin PyTorch to all available cores: @@ -196,15 +205,17 @@ EOF To observe CPU utilization while this runs, open a second terminal and run `top`. Look for the Python process — the CPU percentage reflects multi-threaded utilization across all 4 vCPUs. -### Run the large benchmark +### Run the large benchmark with timing + +Run the benchmark script under `time`: ```bash time python train_large.py | tee pytorch_large_result.txt ``` -Expected output: +The output is similar to: -```text +```output Epoch 1, Loss: 319.07712411880493 Epoch 2, Loss: 308.4675619006157 Epoch 3, Loss: 273.5877128839493 @@ -217,11 +228,15 @@ user 0m19.630s sys 0m0.251s ``` -Training time scales roughly linearly with dataset size and model depth. The `user` time being approximately 4x `real` time confirms that PyTorch is distributing work across all 4 vCPUs effectively. +Training time scales roughly linearly with dataset size and model depth. The `user` time being approximately 4x `real` time indicates that PyTorch is distributing work across all 4 vCPUs effectively. -## Verify generated files +## Review benchmark outputs -After both scripts complete, confirm the output files are present: +After both scripts complete, review the timing outputs and compare the two workloads. + +### Verify generated files for both workloads + +Confirm that the output files are present: ```bash ls -lh @@ -236,15 +251,17 @@ The output is similar to: -rw-r--r-- 1 user user 1.2K May 15 13:48 train_large.py ``` -## Benchmark summary +### Compare training times + +The following table describes approximate training times: | Workload | Approximate training time | |---|---| | Baseline model (5K samples, 128 features) | ~0.7–0.8 seconds | | Large benchmark (20K samples, 512 features) | ~4.8–5.4 seconds | -Both workloads trained to completion with steadily decreasing loss, confirming stable PyTorch CPU execution on GCP Axion Arm64. Your results may vary depending on VM load at the time of the run. +Both workloads trained to completion with steadily decreasing loss, indicating stable PyTorch CPU execution on Google Axion C4A. Your results may vary depending on VM load at the time of the run. -## What you've accomplished and what's next +## What you've accomplished -You've run two PyTorch training workloads on a GCP Axion Arm64 VM, measured wall-clock and CPU execution time, and confirmed stable multi-threaded neural network training on SUSE Linux. +You've run two PyTorch training workloads on a Google Axion Arm64 VM, measured wall-clock and CPU execution time, and confirmed stable multi-threaded neural network training on SUSE Linux. diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/_index.md b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/_index.md new file mode 100644 index 0000000000..70da52675d --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/_index.md @@ -0,0 +1,66 @@ +--- +title: Optimize memory access behavior using Arm Performix and the Arm MCP Server + +description: Learn how to profile and optimize memory access behavior in a C++ particle simulation on Arm Linux using the Arm Performix Memory Access recipe through the Arm MCP Server. + +minutes_to_complete: 45 + +who_is_this_for: This is an introductory topic for C++ developers who want to use Arm Performix and the Arm MCP Server to diagnose cache and address translation behavior in applications running on Arm Neoverse servers. + +learning_objectives: + - Explain how L1 cache hits, TLB misses, and page walks affect C++ application performance. + - Build and visualize the orbiting galaxies example on an Arm Neoverse server. + - Inspect and optimize the particle data structure using insights from the memory access recipe. + - Use the Arm MCP Server in combination with Arm Performix for an agentic solution. + +prerequisites: + - Access to an Arm Neoverse bare metal server. + - Basic understanding of memory hierarchy within a CPU. + - Basic C++ development experience. + - Familiarity with the Linux command line. + +author: Kieran Hejmadi + +### Tags +skilllevels: Introductory +subjects: Performance and Architecture +armips: + - Neoverse +tools_software_languages: + - Arm Performix + - MCP + - C++ + - CMake + - Python + - perf +operatingsystems: + - Linux + +further_reading: + - resource: + title: Identify code hotspots using Arm Performix through the Arm MCP Server + link: /learning-paths/servers-and-cloud-computing/performix-mcp-agent/ + type: learning-path + - resource: + title: Find Code Hotspots with Arm Performix + link: /learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/ + type: learning-path + - resource: + title: Optimize application performance using Arm Performix CPU microarchitecture analysis + link: /learning-paths/servers-and-cloud-computing/performix-microarchitecture/ + type: learning-path + - resource: + title: Automate x86-to-Arm application migration using Arm MCP Server + link: /learning-paths/servers-and-cloud-computing/arm-mcp-server/ + type: learning-path + - resource: + title: Arm Performix + link: https://developer.arm.com/servers-and-cloud-computing/arm-performix + type: website + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/_next-steps.md new file mode 100644 index 0000000000..727b395ddd --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # The weight controls the order of the pages. _index.md always has weight 1. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/agent_screen_shot.webp b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/agent_screen_shot.webp new file mode 100644 index 0000000000..040e004115 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/agent_screen_shot.webp differ diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/codex_prompt.webp b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/codex_prompt.webp new file mode 100644 index 0000000000..2c4e753529 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/codex_prompt.webp differ diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/data_layout_comparison_compressed.gif b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/data_layout_comparison_compressed.gif new file mode 100644 index 0000000000..d97b00a3e4 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/data_layout_comparison_compressed.gif differ diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/galaxy_compressed.gif b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/galaxy_compressed.gif new file mode 100644 index 0000000000..c62f07b6f5 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/galaxy_compressed.gif differ diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-0.md b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-0.md new file mode 100644 index 0000000000..bc2c910e96 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-0.md @@ -0,0 +1,83 @@ +--- +title: Understand CPU memory hierarchy and address translation +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Review of the CPU memory hierarchy + +In this section, you'll learn the memory hierarchy concepts the worked example builds on. It's not an exhaustive explanation, but it covers what you'll need to interpret the profiling results. + +Modern Arm Neoverse server CPUs use a hierarchy of memories to reduce the cost of loading and storing data. The fastest storage sits close to each CPU core, while larger memories sit farther away and take more cycles to access. + +You usually see the following: + +- L1 data cache (`L1d`) and L1 instruction cache (`L1i`) close to each core with each access usually taking up to 10 cycles. +- L2 cache, often private to each core, with each access usually taking 10 to 20 cycles. +- Last-level cache, often shared across multiple cores, and usually taking more than 20 cycles. +- DRAM, which is much larger but much slower than on-chip cache. + +To inspect cache topology on an Arm Neoverse server, see the [Learning Path for Arm's Sysreport tool](/learning-paths/servers-and-cloud-computing/sysreport/) or use the `lscpu` command. Unlike `lscpu`, Sysreport also reports the set associativity for each cache level. For example, you can run the following command on a system with `git` and `python` installed: + +```bash +git clone https://github.com/ArmDeveloperEcosystem/sysreport.git +cd sysreport +python3 src/sysreport.py | grep -i cache -A 4 +``` + +Depending on your system, the output is similar to: + +```output + + cache info: size, associativity, sharing + cache line size: 64 + Caches: + 64 x L1D 64K 4-way 64b-line + 64 x L1I 64K 4-way 64b-line + 64 x L2U 1M 8-way 64b-line + 1 x L3U 32M 16-way 64b-line +``` + +For a more visual view, install `hwloc` and generate a topology image: + +```bash +sudo apt update +sudo apt install -y hwloc +hwloc-ls --of png > topology.png +``` + +![Hardware locality topology for an Arm server showing per-core L1 and L2 caches and a shared L3 cache across all cores, which helps you verify cache hierarchy before profiling.#center](./topology.webp "Example hardware locality topology") + +The diagram shows cache tiers on an AWS Graviton3 bare metal server based on Neoverse V1. Each of the 64 cores has private `L1d`, `L1i`, and `L2` caches, and all cores share one `L3` cache, sometimes referred to as last-level cache (LLC). Cache sizes, especially at later levels, are not fixed by the Neoverse architecture. Implementers such as AWS or Google can configure larger or smaller caches based on design goals. + +Non-uniform memory access (NUMA) means memory latency can depend on which processor or socket owns the memory being accessed. On this AWS Graviton3 instance, there is only one NUMA node. + +To get a comprehensive system-level understanding of the memory subsystem, see the Learning Path on the [Arm system characterization tool](/learning-paths/servers-and-cloud-computing/memory-subsystem/). + +## Memory and translation terminology + +Applications use virtual addresses, which are the addresses a program sees instead of physical DRAM locations. With virtual addressing, the operating system isolates processes, protects memory, and maps each program's address space to available physical memory. The processor translates virtual addresses to physical addresses before it accesses memory. + +### Translation lookaside buffer + +The translation lookaside buffer (TLB) caches recent virtual-to-physical translations at page granularity to avoid page table walks. A TLB miss occurs when the needed translation is not cached, so the processor performs a page table walk to find the mapping. Page walks add latency before a load or store can complete. Large working sets and irregular access patterns, such as strides larger than the typical 4KB page size, can increase TLB pressure because the program touches many pages with little reuse. + +### Page faults + +A minor page fault is usually harmless: the data is already in RAM, and the kernel only creates the mapping. This fault commonly happens during anonymous paging when Linux lazily backs newly allocated heap or stack memory on first touch. A major page fault is more expensive because the kernel must fetch the page from disk, such as from a file or swap, so repeated major faults are a real performance concern. + +### Working set size + +The working set is the data your program actively touches during a period of execution. It differs from resident set size (RSS), which is the amount of physical memory currently resident for a process. A process can have a large RSS while the hot loop actively uses only a smaller working set. + +## Memory access from a programmer's perspective + +From a programmer's perspective, much of the cache and memory subsystem is a black box defined by processor architecture and implementation. Features such as cache associativity, prefetching, and translation caching are designed to hide latency across many workloads. Your main software levers are data structure layout, allocation patterns, and choices such as page size. The layout of your C++ data structures can determine whether the memory hierarchy helps or hurts performance. The compiler generally can't reorder structure fields or split objects automatically because that would change program semantics. + +## What you've learned and what's next + +You've now learned about CPU memory hierarchy, memory access, and relevant memory and translation terminology to understand profiling results for the example application that you'll use in this Learning Path. + +Next, you'll set up and build the example C++ application. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-1.md b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-1.md new file mode 100644 index 0000000000..dec5b59822 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-1.md @@ -0,0 +1,118 @@ +--- +title: Set up and build the example application +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Set up the build environment + +In this section, you'll install the required system packages, clone the orbiting galaxies example repository, and build the workload binaries. You can also run a visualization to confirm the simulation is working before you profile it. + +Use your remote Arm server for all build and run steps. This example uses an Amazon EC2 `c7g.metal` instance running Ubuntu 24.04 LTS. + +### Install Arm Performix + +Install and configure Arm Performix using the [Performix install guide](/install-guides/performix/) on both your local machine and the remote Arm server. + +### Install the required system packages + +Run the following command, replacing `apt` with the package manager for your Linux distribution. + +```bash +sudo apt update +sudo apt install -y git cmake build-essential python3 python3-venv python3-pip +``` + +### Enable the Arm SPE PMU driver if not already loaded + +To check whether the driver is already loaded, run: + +```bash +lsmod | grep arm_spe_pmu +``` + +If the command returns output, the driver is loaded and you can skip this step. If it returns nothing, run the following commands to load it. This is required on Ubuntu 24.04 LTS in AWS, but may not be needed on other platforms. + + +```bash +sudo apt install -y linux-modules-extra-$(uname -r) +sudo modprobe arm_spe_pmu +``` + +If you're using a `c7g.metal` instance, you also need to turn Kernel Page Table Isolation (KPTI) off. + +The fastest way on AWS is to use an editor to add `kpti=off` to the `GRUB_CMDLINE_LINUX_DEFAULT` line in `/etc/default/grub.d/50-cloudimg-settings.cfg`. + +After editing the file, run: + +```bash +sudo update-grub +sudo reboot +``` + +For a complete explanation of SPE, see [Enable Arm SPE for Performix memory access analysis](/learning-paths/servers-and-cloud-computing/spe-on-performix/). + +## Build the sample application + +After setting up the build environment, clone and build the sample application. + +### Clone the example repository + +Clone the orbiting galaxies repository and check out the tagged release to work from a known starting point: + +```bash +git clone https://github.com/arm-education/Orbiting-Galaxy-Example.git +cd Orbiting-Galaxy-Example +git checkout -b my-work v1.0.3 +``` + +### Build with CMake + +Build the project using CMake: + +```bash +mkdir -p build +cd build +cmake .. +cmake --build . --parallel +``` + +This produces three binaries in `build/`: + +- `baseline` — the unoptimized reference binary used for profiling +- `users_solution` — an editable copy of `baseline` for you to optimize manually +- `optimized` — a pre-built reference solution showing the expected outcome + +## Set up a Python virtual environment and run visualization + +After building the application, from the repository root, run: + +```bash +cd .. +python3 -m venv venv +source venv/bin/activate +pip install --upgrade pip +pip install -r scripts/requirements.txt +``` + +Generate simulation frames and create the GIF: + +```bash +cd build +./baseline --visualize +python3 ../scripts/visualize.py galaxy_baseline.bin +``` + +The script reads simulation data from `galaxy_baseline.bin` and writes a GIF file `assets/galaxy_baseline.gif`. + +![Animated orbiting galaxy simulation generated by the baseline workload, showing particle motion over time so you can verify that the simulation output looks correct before profiling.#center](galaxy_compressed.gif "Orbiting galaxies workload visualization") + +Use `--visualize` only for understanding the workload behavior. Don't include visualization mode in profiling runs because file I/O alters the measured runtime characteristics. + +## What you've accomplished and what's next + +You've now set up and built an orbiting galaxy application on an Arm-based instance by setting up a build environment and cloning the app from a GitHub repo. You've also run a visualization to confirm that the application works as expected. + +Next, you'll profile memory access behavior using Arm Performix. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-2.md b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-2.md new file mode 100644 index 0000000000..c1f306b122 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-2.md @@ -0,0 +1,105 @@ +--- +title: Profile memory access behavior with Arm Performix +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Inspect the particle data structure + +Start by inspecting the baseline particle model in `src/baseline/particle.hpp`. + +{{% notice Tip %}} + +If you are using an IDE or editor with an LLM-based coding assistant, the `AGENTS.md` file can improve your learning experience. The `AGENTS.md` file provides the repository context and helps guide the agent to give more useful assistance. + +![Screenshot showing the AGENTS.md file in the repository, highlighting the context file your coding assistant uses to provide more relevant guidance during this task.#center](./agent_screen_shot.webp "Screenshot of GitHub Copilot in VSCode using AGENTS.md as a system prompt to act as a learning assistant.") + +{{% /notice %}} + +The baseline implementation stores every property for one particle in a single structure: + +```cpp +struct Particle { + float x, y, z; // position (12 bytes) + float vx, vy, vz; // velocity (12 bytes) + float mass, charge, temperature; // properties (12 bytes) + float pressure, energy, density; // (12 bytes) + float spin_x, spin_y, spin_z; // (12 bytes) + float pad; // padding (4 bytes) +}; +``` + +The ownership container in the same file is: + +```cpp +class ParticleOwner { + // Stores particle references used by the simulation. + std::vector particles_; +}; +``` + +The update loop in `src/baseline/baseline.cpp` repeatedly updates particle positions: + +```cpp +for (int iter = 0; iter < iters; ++iter) { + update_positions(particles.data(), NUM_PARTICLES, dt); +} +``` + +This baseline design can create avoidable memory overhead: + +- `ParticleOwner` stores pointers to separately allocated `Particle` objects, so the hot loop must follow an extra level of indirection. +- Each `Particle` is 64 bytes, but the position update uses only `x`, `y`, `z`, `vx`, `vy`, and `vz`. +- Loading whole particle objects can waste cache capacity and memory bandwidth when the loop needs only a subset of fields. + +Before you optimize anything, profile and measure. + +## Run the Performix Memory Access recipe + +Open the Performix GUI on your local machine and select the **Memory Access** recipe. + +Configure the recipe to launch the baseline workload on your remote Arm target: + +1. Select the configured remote target. +2. Set **Workload type** to **Launch a new process**. +3. Set **Workload** to the baseline executable: + +```output +~/Orbiting-Galaxy-Example/build/baseline +``` + +Keep the default profiling duration so Performix records until the workload exits. + +![Performix Memory Access recipe setup showing the selected remote Arm target and the workload path field populated with the baseline binary, which confirms the run configuration before profiling starts.#center](./setup.webp "Configure the Performix Memory Access recipe") + +Start the recipe and wait for the results to load. + +## Assess performance + +![Performix Memory Access results for the baseline binary showing update_positions with about 66 percent L1C load hits and around 26-cycle average L1C latency, indicating weak cache locality in the hot path.#center](./performix_before_optimizations.webp "Baseline memory access results before optimization") + +Look at the memory access results for the baseline binary. Most samples are associated with the `update_positions()` function. The `L1C % Loads` value shows that only about two-thirds of loads hit in L1 cache, and the average L1 cache load latency is about 26 cycles. A cache-friendly hot loop should have a much higher L1 hit rate and lower average latency. + +To investigate further, check the TLB walk data. As described in the background section, the TLB caches virtual-to-physical address translations. As per the following image, the `TLB Walk Breakdown` tab shows no significant TLB walks. That means address translation is not the main issue. + +![Performix Memory Access results show 0% TLB walks across all functions in the baseline binary, indicating that TLB pressure and costly address translation misses are not contributing to the performance issue.#center](./no_tlb_walks.webp "TLB walk results showing 0 page table walks for all functions in baseline implementation") + +In summary: + +- Average load latency is about 26 cycles, indicating frequent accesses beyond L1 cache. +- SPE samples are concentrated in `update_positions()`, confirming this loop dominates execution. +- TLB misses are not significant, so page walks are not the source of the slowdown. + +Double-click the `update_positions()` row to open the source code view. The source view shows that the samples concentrate on the per-particle position updates. + +![Performix source code view for update_positions showing sample concentration on the x, y, and z update statements, helping you confirm that this loop is the main optimization target.#center](./source_code.webp "Baseline source-level samples in update_positions") + +The majority of samples are associated with accessing the `Particle` data structure, and the samples fall back to L2 cache approximately one-third of the time. Considering this, to improve the execution time of the example, you'll need to focus on more efficient ways, if any, of accessing the `Particle` member variables. For example, there might be an alternative data structure that has better cache utilization. + +## What you've accomplished and what's next + +You've now used Arm Performix to assess the memory performance of the orbiting galaxy particle simulator application using the Memory Access recipe. + +Next, you'll use these performance results to guide optimization of the application. diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-3.md b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-3.md new file mode 100644 index 0000000000..c9ca107484 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-3.md @@ -0,0 +1,173 @@ +--- +title: Optimize the application manually and with the Arm MCP Server +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Manually optimize the application + +The `src/users_solution/` directory is an editable copy of `src/baseline`. Using the data collected from Performix, refactor the `Particle` data structure and associated function signatures and call sites to improve the L1 cache hit rate. The baseline result showed that `update_positions()` dominated the samples, had a low L1 cache hit rate, and did not show significant TLB walks. + +{{% notice Hint %}} + +Consider how the `Particle` data structure maps to a 64-byte cache line. Also consider which member variables in the `Particle` struct are used in the hot loop. + +{{% /notice %}} + +After you make changes in `src/users_solution/`, rebuild the binary with the following commands: + +```bash +cd ~/Orbiting-Galaxy-Example/build +cmake --build . --parallel +``` + +Use the Performix GUI to assess performance changes for the `~/Orbiting-Galaxy-Example/build/users_solution` binary. A reference solution is available in `src/optimized`. + +To measure wall time and compare it against the baseline, run: + +```bash +/usr/bin/time -v ~/Orbiting-Galaxy-Example/build/users_solution +``` + +The hot loop is instrumented with `scopedTimer`, so you'll also see the loop duration printed directly to the terminal. Compare it with the baseline result of 571 milliseconds shown at the end of the section. + +## Optimize with an AI agent and the Arm MCP Server + +You can use the Arm Model Context Protocol (MCP) Server with a code assistant such as Kiro, Gemini, Codex, or GitHub Copilot to optimize the application. The MCP server includes direct tool support to invoke Performix on a remote target. It integrates with MCP-compatible coding assistants and can provide performance insights to create a useful feedback loop. The following example shows how to connect to OpenAI Codex. For other tools, see [your preferred coding assistant](/learning-paths/servers-and-cloud-computing/arm-mcp-server/1-overview/). + +{{% notice Note %}} + +You need an OpenAI account to use the Codex CLI. + +{{% /notice %}} + +[Install Docker](/install-guides/docker/) and pull the MCP server image. + +```bash +docker pull armlimited/arm-mcp:latest +``` + +To ensure the MCP server can invoke `performix` on remote machines, pass optional Docker arguments for your SSH private key and known hosts file. For example, use this TOML format for the Codex CLI by adding the following to `~/.codex/config.toml`: + +```toml +[mcp_servers.arm-mcp] +command = "docker" +args = [ + "run", + "--rm", + "-i", + "-v", "/path/to/your/workspace:/workspace", + "-v", "/path/to/your/ssh/private_key:/run/keys/ssh-key.pem:ro", + "-v", "/path/to/your/ssh/known_hosts:/run/keys/known_hosts:ro", + "armlimited/arm-mcp" +] +``` + +Restart Codex and ask your coding assistant to run the `memory access` recipe, interpret the results, and inspect the relevant source code. Your prompt can include the remote target, workload binary, and source directory: + +![Codex prompt requesting the Arm MCP server to run memory access and code hotspot recipes on the remote baseline workload, showing how to pass target, binary path, and source directory details.#center](./codex_prompt.webp "Prompting Codex to analyze the baseline workload with Arm MCP") + +Alternatively, you can use the curated [arm-full-optimization.md](https://github.com/arm/mcp/blob/main/agent-integrations/codex/arm-full-optimization.md) prompt file. + +## Review the optimized solution + +A reference solution is available in the `src/optimized` directory of the repository. The baseline stores a vector of `Particle*` values, where each `Particle` is allocated separately and contains all particle fields in one 64-byte structure. The hot loop needs only `x`, `y`, `z`, `vx`, `vy`, and `vz`, but the baseline layout still steps through whole particle objects and performs unnecessary pointer chasing. + +The optimized version changes the layout to a Structure of Arrays (SoA). Each field is stored in its own contiguous `std::vector`: + +```cpp +struct ParticlesSoA { + std::vector x, y, z; + std::vector vx, vy, vz; + std::vector mass, charge, temperature; + std::vector pressure, energy, density; + std::vector spin_x, spin_y, spin_z; +}; +``` + +The `update_positions()` function then walks the hot position and velocity arrays directly: + +```cpp +void update_positions(ParticlesSoA& p, int n, float dt) { + for (int i = 0; i < n; ++i) { + p.x[i] += p.vx[i] * dt; + p.y[i] += p.vy[i] * dt; + p.z[i] += p.vz[i] * dt; + } +} +``` + +This removes `Particle*` indirection and improves cache-line utilization because the hot loop streams through only the data it uses. + +The following diagram compares the baseline and optimized layouts. Even though each particle is padded to a 64-byte cache line, many struct members are not read or written in the hot loop, so they remain cold. With a structure-of-arrays layout, all particles are still owned together, but cache lines contain more of the data that the loop actually touches. + +![Animation comparing baseline and structure-of-arrays layouts, showing how the optimized layout packs hot fields together so cache lines carry useful data for position updates.#center](./data_layout_comparison_compressed.gif) + +## Confirm with Performix + +To see what fully optimized results look like, run the Performix Memory Access recipe against the pre-built reference binary. In the Performix GUI, rerun the recipe and change the binary path from `~/Orbiting-Galaxy-Example/build/baseline` to `~/Orbiting-Galaxy-Example/build/optimized`. + +![Performix Memory Access results for the optimized binary showing 100 percent L1C load hits for the selected function and lower average L1C latency, confirming improved memory locality after the data layout change.#center](./performix_after_optimization.webp "Memory access results after the Structure of Arrays optimization") + +The optimized result shows much stronger L1 cache behavior. The hot update path now has `100%` L1C loads in the captured result and a lower average L1C latency than the baseline. This confirms that the data layout change improved locality, not just wall-clock time. + +## Measure wall time and memory usage + +Run the binaries directly on the remote machine without Performix to compare both wall time and memory usage: + +```bash +/usr/bin/time -v ~/Orbiting-Galaxy-Example/build/baseline +/usr/bin/time -v ~/Orbiting-Galaxy-Example/build/optimized +``` + +The hot loop is also instrumented with `scopedTimer`, so you can directly observe the speedup from the change. + +```output +Baseline took 571 milliseconds + Command being timed: "./build/baseline" + User time (seconds): 0.66 + System time (seconds): 0.02 + Percent of CPU this job got: 99% + Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.69 + Average shared text size (kbytes): 0 + Average unshared data size (kbytes): 0 + Average stack size (kbytes): 0 + Average total size (kbytes): 0 + Maximum resident set size (kbytes): 92720 + Average resident set size (kbytes): 0 + Major (requiring I/O) page faults: 0 + Minor (reclaiming a frame) page faults: 22655 +... +Optimized took 279 milliseconds + Command being timed: "./build/optimized" + User time (seconds): 0.35 + System time (seconds): 0.02 + Percent of CPU this job got: 100% + Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.37 + Average shared text size (kbytes): 0 + Average unshared data size (kbytes): 0 + Average stack size (kbytes): 0 + Average total size (kbytes): 0 + Maximum resident set size (kbytes): 64044 + Average resident set size (kbytes): 0 + Major (requiring I/O) page faults: 0 + Minor (reclaiming a frame) page faults: 15500 +``` + + +| Metric | Baseline | Optimized | Explanation | +|-----------------------|--------------|--------------|---------------------------------------------------------------------------------------------| +| Wall time (ms) | 571 | 279 | The optimized layout improves cache usage and removes pointer chasing, roughly halving execution time. | +| Max RSS (KB) | 92,720 | 64,044 | Structure of Arrays reduces memory footprint by removing per-object overhead and cold fields. | +| Minor page faults | 22,655 | 15,500 | Fewer pages are touched due to more compact, contiguous storage of only needed data fields. | +| L1 cache hit rate (%) | 66.3 | 99.3 | Hot data is now accessed in a cache-friendly pattern, maximizing L1 cache effectiveness. | +| L1 avg latency (cycles)| 26.2 | 11.7 | Each L1 load takes fewer cycles because pointer chasing is removed. | + + +## What you've accomplished + +You used Arm Performix and the Arm MCP Server to identify a memory access bottleneck in a C++ particle simulation. You then connected the profile data to source code, found that the hot loop suffered from poor data layout and unnecessary pointer chasing, and improved the implementation with a Structure of Arrays layout. You validated the change with direct wall-time measurements and a second Performix run. + +This approach combines measurement tools, code context, and focused prompts to iterate on real bottlenecks. diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/no_tlb_walks.webp b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/no_tlb_walks.webp new file mode 100644 index 0000000000..323d65bca2 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/no_tlb_walks.webp differ diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/performix_after_optimization.webp b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/performix_after_optimization.webp new file mode 100644 index 0000000000..3b26d07994 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/performix_after_optimization.webp differ diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/performix_before_optimizations.webp b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/performix_before_optimizations.webp new file mode 100644 index 0000000000..36632a6ab4 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/performix_before_optimizations.webp differ diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/setup.webp b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/setup.webp new file mode 100644 index 0000000000..8723450f2f Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/setup.webp differ diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/source_code.webp b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/source_code.webp new file mode 100644 index 0000000000..d2a3186812 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/source_code.webp differ diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/topology.webp b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/topology.webp new file mode 100644 index 0000000000..1126c103e2 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/topology.webp differ diff --git a/content/learning-paths/servers-and-cloud-computing/sve/sve_armie.md b/content/learning-paths/servers-and-cloud-computing/sve/sve_armie.md index aa6a3dc189..4241d36a69 100644 --- a/content/learning-paths/servers-and-cloud-computing/sve/sve_armie.md +++ b/content/learning-paths/servers-and-cloud-computing/sve/sve_armie.md @@ -55,8 +55,8 @@ Compile the applications using the commands shown: {{< tab header="GNU" >}} gcc -march=armv8-a+sve -O3 -fopt-info-vec sve_add.c -o sve_add.exe {{< /tab >}} - {{< tab header="Arm Compiler for Linux" >}} - armclang --march=armv8-a+sve -O3 -Rpass=vector sve_add.c -o sve_add.exe + {{< tab header="Arm Toolchain for Linux" >}} + armclang -march=armv8-a+sve -O3 -Rpass=vector sve_add.c -o sve_add.exe {{< /tab >}} {{% /tabpane %}} @@ -97,7 +97,7 @@ You can also run the application containing SVE instructions using the the Arm I Download and install the [Arm Instruction Emulator](https://developer.arm.com/downloads/-/arm-instruction-emulator) (see [installation instructions](/install-guides/armie/) ) on any Arm v8-A system. The Arm Instruction Emulator intercepts and emulates unsupported SVE instructions. It also support plugins for application analysis. {{% notice Note %}} -The Arm Instruction Emulator has been deprecated. It is still available for download, but there is no active development. +As of May 2026 the latest release is 25.0. {{% /notice %}} ## Arm Instruction Emulator Usage diff --git a/content/learning-paths/servers-and-cloud-computing/sve/sve_basics.md b/content/learning-paths/servers-and-cloud-computing/sve/sve_basics.md index ed85ea51ba..bc11c26b4b 100644 --- a/content/learning-paths/servers-and-cloud-computing/sve/sve_basics.md +++ b/content/learning-paths/servers-and-cloud-computing/sve/sve_basics.md @@ -54,10 +54,10 @@ Run the program: ./sve ``` -On AWS Graviton3 processors, based on the Neoverse V2, the output is: +On the 1st generation AGI CPU processors the output is: ```output -SVE vector length is: 32 bytes +SVE vector length is: 16 bytes ``` If the hardware doesn't support SVE the program will crash with an illegal instruction. diff --git a/content/learning-paths/servers-and-cloud-computing/sve/sve_compile.md b/content/learning-paths/servers-and-cloud-computing/sve/sve_compile.md index 4e119c3da4..8f8da53c8f 100644 --- a/content/learning-paths/servers-and-cloud-computing/sve/sve_compile.md +++ b/content/learning-paths/servers-and-cloud-computing/sve/sve_compile.md @@ -29,7 +29,7 @@ gfortran -march=armv8-a+sve myapp.f90 -o myapp_f90.out ### Autovectorization -With GCC autovectorization is enabled with the `-03` option. To disable autovectorization, use `-fno-tree-vectorize` compiler option. +With GCC autovectorization is fully enabled with high-level `-03` option or manually with the `-ftree-vectorize` flag. To disable autovectorization, use `-fno-tree-vectorize` compiler option. Compare the disassembly of a simple program shown below with and without the use of autovectorization: @@ -37,6 +37,14 @@ Compare the disassembly of a simple program shown below with and without the use Note the use of double-word register `d0`, `d1` instead of SVE registers `z0.d` and `z1` when you disable vectorization. +{{% notice Autovectorization on the Arm AGI CPU %}} + +If specifically targeting the 1st generation Arm AGI CPU, the `-mcpu=armagicpu` defintion was added in [GCC 16.1.0](https://github.com/gcc-mirror/gcc/commit/0f5f728854d2ea93e6806a8632c04383502b0386). As of May 2026, this is the same as the `-march=neoverse-v3ae` option available from [GCC 15](https://gcc.gnu.org/gcc-15/changes.html) onwards. However, in the future there may be differences between `neoverse-v3ae` and `armagicpu`. + +As such, we recommend installing the latest version of GCC/G++ if you are targeting the Arm AGI CPU. Use the `-mcpu=native` flag if compiling on the target machine or `-mcpu=armagicpu` if cross compiling. + +{{% /notice %}} + ### Compiler insights With GCC, the use of compiler option `-fopt-info-vec` returns which loops were vectorized. To return which loop failed to vectorize, use the `-fopt-info-vec-missed` compiler option. @@ -55,9 +63,9 @@ Refer to the [Arm Performance Libraries install guide](/install-guides/armpl/) f gcc -O3 -march=armv8-a+sve -I $ARMPL_INCLUDES dgemm.c -o dgemm.out -L $ARMPL_LIBRARIES -larmpl ``` -## Compiling for SVE with Arm Compiler for Linux +## Compiling for SVE with Arm toolchain for Linux (ATfL) -Shown below are example commands to compile an application with support for SVE instructions using Arm Compiler for Linux: +Shown below are example commands to compile an application with support for SVE instructions using Arm toolchain for Linux: ### Arm C/C++ Compiler @@ -71,26 +79,37 @@ armclang -march=armv8-a+sve myapp.c -o myapp_c.out armflang -march=armv8-a+sve myapp.f90 -o myapp_f90.out ``` -### Compiling for a specific SVE target with Arm Compiler for Linux +### Compiling for a specific SVE target with Arm Toolchain for Linux -If you are compiling for a SVE-capable target, you can use the `-march=native` compiler option. For specific CPUs with SVE support, use the `-mcpu` option: +If you are compiling for a SVE-capable target, you can use the `-march=native` compiler option. To target specific CPUs with SVE support, use the `-mcpu` option: CPU | Flag ----------|--------- Neoverse-N1 | `-mcpu=neoverse-n1` Neoverse-V1 | `-mcpu=neoverse-v1` +Neoverse-V2 | `-mcpu=neoverse-v2` +Neoverse-V3 | `-mcpu=neoverse-v3` +Arm AGI CPU (first generation)* | `-mcpu=neoverse-v3ae` (as of ATfL 22.1.0) + +{{% notice Please Note %}} + +Support for the 1st generation Arm AGI CPU, based on the Neoverse V3-AE core, is expected to be added in LLVM version 23. Once available, ATfL is expected to support this target through the dedicated compiler option `-mcpu=armagicpu`. + +If you are targeting the Arm AGI CPU, we recommend using the latest available version of ATfL to ensure support for the most recent compiler optimizations and features. + +{{% /notice %}} ### Autovectorization -With Arm Compiler for Linux autovectorization is enabled with the `-02` option and above. To disable autovectorization, use `-fno-vectorize`. +With Arm toolchain for Linux autovectorization is enabled with the `-02` option and above. To disable autovectorization, use `-fno-vectorize`. ### Compiler insights -With Arm Compiler for Linux, the option `-Rpass=vector` and `-Rpass=sve-loop-vectorize` return which loops were vectorized. To return the loops that failed to vectorize, use `-Rpass-missed=vector`. +With Arm toolchain for Linux, the option `-Rpass=vector` and `-Rpass=sve-loop-vectorize` return which loops were vectorized. To return the loops that failed to vectorize, use `-Rpass-missed=vector`. ### Use Arm Performance Libraries -To use Arm Performance Libraries with Arm Compiler for Linux use the `-armpl=sve` option. This ensures the SVE version of the library is used. Example command shown here: +To use Arm Performance Libraries with Arm toolchain for Linux use the `-armpl=sve` option. This ensures the SVE version of the library is used. Example command shown here: ```bash armclang -O3 -march=armv8-a+sve -armpl=sve dgemm.c -o dgemm.out diff --git a/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/_index.md b/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/_index.md index 460dc828d5..f6be502b7b 100644 --- a/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/_index.md @@ -1,9 +1,5 @@ --- title: Train and deploy XGBoost models on Google Cloud C4A Axion VM - -draft: true -cascade: - draft: true description: Set up XGBoost on Google Cloud C4A Axion Arm VMs running SUSE Linux to train machine learning models, tune model performance, benchmark large-scale datasets, and deploy trained models as REST APIs. @@ -12,10 +8,10 @@ minutes_to_complete: 90 who_is_this_for: This is an introductory topic for DevOps engineers, ML engineers, data engineers, and software developers who want to train and deploy XGBoost machine learning models on SUSE Linux Enterprise Server (SLES) Arm64, optimize model performance, benchmark training workloads, and expose models through scalable inference APIs. learning_objectives: - - Install and configure XGBoost on Google Cloud C4A Axion processors for Arm64 - - Train and evaluate machine learning models using XGBoost - - Tune model hyperparameters and benchmark large-scale datasets - - Deploy trained XGBoost models as REST APIs and validate inference workflows + - Install and configure XGBoost on Google Cloud C4A Axion processors for Arm64 + - Train and evaluate machine learning models using XGBoost + - Tune model hyperparameters and benchmark large-scale datasets + - Deploy trained XGBoost models as REST APIs and validate inference workflows prerequisites: - A [Google Cloud Platform (GCP)](https://cloud.google.com/free) account with billing enabled diff --git a/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/background.md b/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/background.md index 43e5941f5e..2ed0f99c2b 100644 --- a/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/background.md +++ b/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/background.md @@ -1,5 +1,5 @@ --- -title: Learn about XGBoost and Google Axion C4A for machine learning +title: Understand XGBoost and Google Axion C4A for machine learning weight: 2 layout: "learningpathall" @@ -7,17 +7,19 @@ layout: "learningpathall" ## Google Axion C4A Arm instances for machine learning -Google Axion C4A is a family of Arm-based virtual machines built on Google’s custom Axion CPU, which is based on Arm Neoverse V2 cores. Designed for high-performance and energy-efficient computing, these virtual machines offer strong performance for modern cloud workloads such as CI/CD pipelines, microservices, media processing, and general-purpose applications. +Google Axion C4A is a family of Arm-based virtual machines built on Google’s custom Axion CPU, which is based on Arm Neoverse V2 cores. Designed for high-performance and energy-efficient computing, these virtual machines offer strong performance for modern cloud workloads. Such workloads include CI/CD pipelines, microservices, media processing, and general-purpose applications. -The C4A series provides a cost-effective alternative to x86 virtual machines while using the scalability and performance benefits of the Arm architecture in Google Cloud. +The C4A series provides a cost-effective alternative to x86 virtual machines while benefiting from the scalability and performance of Arm architecture in Google Cloud. To learn more, see the Google blog [Introducing Google Axion Processors, our new Arm-based CPUs](https://cloud.google.com/blog/products/compute/introducing-googles-new-arm-based-cpu). ## XGBoost for scalable machine learning on Arm -XGBoost (Extreme Gradient Boosting) is a high-performance machine learning library designed for supervised learning tasks such as classification, regression, and ranking. It's widely used for tabular machine learning workloads because of its speed, scalability, and strong predictive accuracy. It provides parallelized tree boosting for fast model training, built-in regularization to reduce overfitting, hyperparameter tuning support, and efficient handling of large-scale datasets. XGBoost also supports model export and deployment for inference workloads, making it suitable for both experimentation and production use. +Extreme Gradient Boosting (XGBoost) is a high-performance machine learning library designed for supervised learning tasks such as classification, regression, and ranking. XGBoost is used for tabular machine learning workloads because of its speed, scalability, and strong predictive accuracy. It provides parallelized tree boosting for fast model training and built-in regularization to reduce overfitting. -Running XGBoost on Google Axion C4A Arm-based infrastructure enables efficient execution of machine learning workloads by using the high core-count architecture and optimized memory bandwidth available on Arm processors. This helps improve performance-per-watt, reduce infrastructure costs, and scale machine learning pipelines efficiently. +XGBoost supports hyperparameter tuning and efficient handling of large-scale datasets. It also supports model export and deployment for inference workloads, making it suitable for both experimentation and production use. + +By running XGBoost on Google Axion C4A Arm-based infrastructure, you can execute machine learning workloads efficiently with the high core-count architecture and optimized memory bandwidth available on Arm processors. This helps improve performance-per-watt, reduce infrastructure costs, and scale machine learning pipelines efficiently. Common use cases include fraud detection, recommendation systems, customer churn prediction, financial forecasting, and real-time inference APIs. XGBoost integrates with Python machine learning ecosystems such as scikit-learn, NumPy, and pandas, making it a practical choice across the full workflow from experimentation to production. @@ -25,4 +27,6 @@ To learn more, see the [XGBoost documentation](https://xgboost.readthedocs.io/en ## What you've learned and what's next -This section introduced Google Axion C4A Arm-based virtual machines and XGBoost as a high-performance machine learning library suited to Arm processors. Next, you'll create a firewall rule to expose the inference API port, then provision a C4A VM for model training and deployment. +You've now learned about Google Axion C4A Arm-based virtual machines and XGBoost as a high-performance machine learning library suited to Arm processors. + +Next, you'll create a firewall rule to expose the inference API port, then provision a C4A virtual machine for model training and deployment using XGBoost. diff --git a/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/deploy-xgboost-inference-api.md b/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/deploy-xgboost-inference-api.md index d8e5e4a42f..0c137cafa4 100644 --- a/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/deploy-xgboost-inference-api.md +++ b/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/deploy-xgboost-inference-api.md @@ -1,23 +1,23 @@ --- -title: Deploy and access XGBoost inference API -weight: 8 +title: Deploy and access an XGBoost inference API +weight: 6 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Deploy XGBoost inference API on SUSE Linux +## Run the XGBoost inference API on SUSE Linux -In this section, you deploy the trained XGBoost model as a Flask-based inference API on the GCP Axion Arm64 VM and test it with a sample prediction request. +In this section, you'll deploy the trained XGBoost model as a Flask-based inference API on the Google Axion Arm64 VM and test it with a sample prediction request. -Navigate to the XGBoost project directory and activate the virtual environment: +If you're continuing in the same SSH session from the previous section, the `xgb-env` virtual environment is already active and your working directory is `~/xgboost-learning-path`. If you've opened a new session, re-activate the environment and navigate to the project directory: ```bash cd ~/xgboost-learning-path source xgb-env/bin/activate ``` -## Install Flask +### Install Flask Flask is a lightweight Python web framework used to serve the XGBoost model over HTTP. Add it to the requirements file and install it: @@ -51,7 +51,7 @@ The output is similar to: Flask 3.1.3 ``` -## Create the inference API +### Create the inference API Create a Flask application that loads the trained XGBoost model and exposes two endpoints: a `GET /` route for browser health checks, and a `POST /predict` route that accepts a JSON array of features and returns a prediction. The model is loaded once at startup using `joblib` so it doesn't need to be reloaded on every request: @@ -95,7 +95,7 @@ if __name__ == "__main__": EOF ``` -## Start the inference API +### Start the inference API Start the Flask server in the background so you can continue using the same terminal for testing: @@ -113,7 +113,7 @@ The output is similar to: The server is now listening on all network interfaces, including the VM's external IP on port 8080. -## Access the API from a browser +### Access the API from a browser Open your browser and navigate to the VM public IP on port 8080: @@ -131,9 +131,9 @@ The page displays the HTML response from the `/` route, confirming the API is ru ![Browser window showing the XGBoost Inference API homepage running on a Google Cloud Axion Arm64 virtual machine. The page confirms that the inference API is active and accessible externally through port 8080 using the VM public IP address.#center](images/xgboost-api.png "XGBoost inference API running on Google Cloud Axion Arm64") -## Test inference +### Test inference -Send a prediction request to the `/predict` endpoint using `curl`. The input data is a 30-feature vector from the breast cancer dataset — the same format used during training. The `features` array must contain exactly 30 values to match the model's expected input shape: +Send a prediction request to the `/predict` endpoint using `curl`. The input data is a 30-feature vector from the breast cancer dataset — the same format used during training. The `features` array must contain exactly 30 values to match the model's expected input shape. For example: ```bash curl -X POST http://127.0.0.1:8080/predict \ @@ -147,8 +147,8 @@ The output is similar to: {"prediction":0} ``` -A prediction of `0` corresponds to a malignant classification in the breast cancer dataset (where `0` = malignant, `1` = benign). The model received the feature array, ran inference, and returned the result through the REST API. +A prediction of `0` corresponds to a malignant classification in the breast cancer dataset, and a prediction of `1` corresponds to a benign classification. The model received the feature array, ran inference, and returned the result through the REST API. -## What you've accomplished and what's next +## What you've accomplished -You've successfully deployed a trained XGBoost model as a Flask REST API on a GCP Axion Arm64 VM, confirmed browser access through the external IP, and validated inference with a live prediction request. +You've successfully deployed a trained XGBoost model as a Flask REST API on a Google Axion Arm64 VM, confirmed browser access through the external IP, and validated inference with a live prediction request. diff --git a/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/firewall.md b/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/firewall.md index 6792cf2b1c..3f3faaeb6e 100644 --- a/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/firewall.md +++ b/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/firewall.md @@ -1,5 +1,5 @@ --- -title: Configure Google Cloud firewall rules for XGBoost +title: Create Google Cloud firewall rules for XGBoost weight: 3 ### FIXED, DO NOT MODIFY @@ -8,25 +8,26 @@ layout: learningpathall ## Allow inbound access to the XGBoost inference API -Create a firewall rule in Google Cloud Console to expose the required port for the XGBoost inference API and browser-based access. +Create a firewall rule in Google Cloud console to expose the required port for the XGBoost inference API and browser-based access. -{{% notice Note %}} For help with GCP setup, see the Learning Path [Getting started with Google Cloud Platform](/learning-paths/servers-and-cloud-computing/csp/google/).{{% /notice %}} +{{% notice Note %}} For help with Google Cloud Platform setup, see the Learning Path [Getting started with Google Cloud Platform](/learning-paths/servers-and-cloud-computing/csp/google/).{{% /notice %}} -## Configure the firewall rule in Google Cloud Console +### Configure the firewall rule in Google Cloud Console To configure a firewall rule for the XGBoost inference API: -1. Navigate to the [Google Cloud Console](https://console.cloud.google.com/), go to **VPC Network > Firewall**, and select **Create firewall rule**. +1. Navigate to the [Google Cloud console](https://console.cloud.google.com/). +2. Go to **VPC Network > Firewall**, then select **Create firewall rule**. -![Google Cloud Console VPC Network Firewall page showing the Create firewall rule button in the top menu bar#center](images/firewall-rule.png "Create a firewall rule in Google Cloud Console") +![Google Cloud console VPC Network Firewall page showing the Create firewall rule button in the top menu bar#center](images/firewall-rule.png "Create a firewall rule in Google Cloud Console") -2. Set **Name** to `allow-xgboost-8080`, then select the network you want to bind to your virtual machine. +3. Set **Name** to `allow-xgboost-8080`, then select the network you want to bind to your virtual machine. -3. Set **Direction of traffic** to **Ingress** and **Action on match** to **Allow**. +4. Set **Direction of traffic** to **Ingress** and **Action on match** to **Allow**. -4. Set **Targets** to **Specified target tags** and enter `allow-xgboost-8080` in the **Target tags** field. +5. Set **Targets** to **Specified target tags** and enter `allow-xgboost-8080` in the **Target tags** field. You'll use this tag when you create a virtual machine in the next section. -5. Set **Source IPv4 ranges** to your current machine's public IP address. Run the following command in a terminal on your local machine to find it: +6. Set **Source IPv4 ranges** to your current machine's public IP address. Run the following command in a terminal on your local machine to find it: ```bash curl -4 ifconfig.me @@ -34,12 +35,12 @@ curl -4 ifconfig.me Take the returned address and append `/32` to convert it to CIDR notation, for example `203.0.113.42/32`. Restricting access to your own IP prevents port 8080 from being exposed to the public internet. -{{% notice Note %}}If your IP address changes or you need to access the API from a different machine, update this field with the new IP address. Using `0.0.0.0/0` opens the port to all traffic and is not recommended.{{% /notice %}} +{{% notice Note %}} If your IP address changes or you need to access the API from a different machine, update this field with the new IP address. Using `0.0.0.0/0` opens the port to all traffic and is not recommended.{{% /notice %}} ![Google Cloud Console Create firewall rule form with Name set to allow-xgboost-8080, Targets set to Specified target tags, and Direction of traffic set to Ingress#center](images/network-rule.png "Configuring the allow-xgboost-8080 firewall rule") -6. Under **Protocols and ports**, select **Specified protocols and ports**. -7. Select the **TCP** checkbox and enter: +7. Under **Protocols and ports**, select **Specified protocols and ports**. +8. Select the **TCP** checkbox and enter: ```text 8080 @@ -49,10 +50,10 @@ Port 8080 is used by the XGBoost inference API for browser-based validation and ![Google Cloud Console Protocols and ports section with TCP selected and port 8080 entered#center](images/network-port.png "Setting XGBoost ports in the firewall rule") -8. Select **Create**. +9. Select **Create**. ## What you've accomplished and what's next -You've created a firewall rule to expose the XGBoost inference API externally. You also enabled browser access and remote API connectivity for inference testing. +You've now created a firewall rule to expose the XGBoost inference API to your public IP address. You've also enabled browser access and remote API connectivity for inference testing. Next, you'll create a Google Axion C4A Arm virtual machine and attach it to this firewall rule. diff --git a/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/install-train-tune-xgboost.md b/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/install-train-tune-xgboost.md index b63fffce28..046058538b 100644 --- a/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/install-train-tune-xgboost.md +++ b/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/install-train-tune-xgboost.md @@ -1,6 +1,6 @@ --- title: Install XGBoost and train machine learning models -weight: 7 +weight: 5 ### FIXED, DO NOT MODIFY layout: learningpathall @@ -8,18 +8,18 @@ layout: learningpathall ## Install and configure XGBoost on SUSE Linux -In this section, you install XGBoost on a GCP Axion Arm64 VM running SUSE Linux with Python 3.11, train a machine learning model, and tune model performance using hyperparameter optimization. +In this section, you'll install XGBoost on a Google Axion Arm64 VM running SUSE Linux with Python 3.11. You'll then train a machine learning model and tune model performance using hyperparameter optimization. -## Update the system +### Update system packages -Update all system packages before installing Python and machine learning dependencies. This avoids package conflicts and ensures the latest security updates are applied: +Update all system packages before installing Python and machine learning dependencies. By updating packages, you can avoid package conflicts and ensure the latest security updates are applied: ```bash sudo zypper refresh sudo zypper update -y ``` -## Install required dependencies +### Install XGBoost dependencies Install Python 3.11, development libraries, and build tools required for XGBoost: @@ -35,13 +35,13 @@ sudo zypper install -y \ wget ``` -Verify that Python 3.11 is installed correctly. +Verify that Python 3.11 is installed correctly: ```bash python3.11 --version ``` -## Create a Python environment +### Create a Python virtual environment Create a dedicated project directory and an isolated Python virtual environment for the XGBoost Learning Path. Using a virtual environment keeps the XGBoost packages separate from the system Python installation: @@ -55,7 +55,7 @@ source xgb-env/bin/activate The virtual environment helps isolate Python packages from the system installation. -## Upgrade pip +### Upgrade pip Upgrade pip, setuptools, and wheel to ensure compatibility with the latest Python packages and build dependencies: @@ -63,7 +63,7 @@ Upgrade pip, setuptools, and wheel to ensure compatibility with the latest Pytho pip install --upgrade pip setuptools wheel ``` -## Create a requirements file +### Create a requirements file Create a requirements file listing the Python packages needed for XGBoost training and benchmarking: @@ -78,7 +78,7 @@ joblib EOF ``` -## Install dependencies +### Install machine learning dependencies Install all machine learning dependencies inside the virtual environment: @@ -86,7 +86,7 @@ Install all machine learning dependencies inside the virtual environment: pip install -r requirements.txt ``` -Verify that XGBoost is installed correctly. +Verify that XGBoost is installed correctly: ```bash python -c "import xgboost; print(xgboost.__version__)" @@ -97,8 +97,11 @@ The output is similar to: ```output 3.2.0 ``` +## Train a machine learning model -## Create XGBoost training script +After configuring XGBoost, you'll train and tune an XGBoost classification model. + +### Create XGBoost training script In this step, you'll create a machine learning training script using the Breast Cancer dataset from Scikit-learn. The script trains an XGBoost classification model, measures training time, evaluates accuracy, and saves the trained model for inference. @@ -150,7 +153,8 @@ print("Model saved successfully") EOF ``` -## Train the model +### Start model training + Run the training script to start XGBoost model training on the Arm64 processor. ```bash @@ -167,7 +171,7 @@ Model saved successfully The model trained on the breast cancer dataset and saved both a JSON and a pickle file to the project directory. -## Verify generated model files +### Verify generated model files Verify that the trained model files were created successfully after training. ```bash @@ -186,7 +190,8 @@ drwxr-xr-x 5 user user 4.0K May 13 10:20 xgb-env The `.json` and `.pkl` files are the trained model artifacts used later for inference API deployment. -## Hyperparameter tuning +## Use GridSearchCV to tune hyperparameters + In this step, you'll optimize model performance using GridSearchCV and multiple hyperparameter combinations. The script tests different values for tree depth, learning rate, and estimators to identify the best-performing model configuration. @@ -249,10 +254,13 @@ Best Parameters: The search tested 12 combinations (3 depths × 2 learning rates × 2 estimator counts) using 3-fold cross-validation. The best parameters shown are the combination that produced the highest cross-validated accuracy. ## Benchmark large-scale training + In this step, you'll benchmark XGBoost training performance using a larger synthetic dataset. The benchmark simulates large-scale tabular machine learning workloads on the GCP Axion Arm64 processor. +### Create the benchmark script + Create benchmark script: ```bash @@ -285,7 +293,8 @@ print(f"\nBenchmark completed in {end - start:.2f} seconds") EOF ``` -## Run benchmark +### Run the benchmark script + Run the benchmark script to measure large-scale training performance on Arm64. ```bash @@ -302,4 +311,6 @@ The benchmark used a synthetic dataset of 500,000 samples and 50 features. Your ## What you've accomplished and what's next -You've installed XGBoost on a GCP Axion Arm64 VM, trained a classification model on the breast cancer dataset, tuned hyperparameters with GridSearchCV, and benchmarked large-scale training performance. Next, you'll deploy the trained model as a Flask inference API and access it from your browser using the VM public IP. +You've now installed XGBoost on a GCP Axion Arm64 VM, trained and saved a classification model on the breast cancer dataset, tuned hyperparameters with GridSearchCV, and benchmarked large-scale training performance. + +Next, you'll deploy the trained model as a Flask inference API and access it from your browser using the VM public IP. diff --git a/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/instance.md b/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/instance.md index fa4c0a3bc5..f6d6cef485 100644 --- a/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/instance.md +++ b/content/learning-paths/servers-and-cloud-computing/xgboost-on-axion/instance.md @@ -8,18 +8,20 @@ layout: learningpathall ## Set up the virtual machine -Create a Google Axion C4A Arm-based virtual machine on Google Cloud Platform. This Learning Path uses the `c4a-standard-4` machine type, which provides 4 vCPUs and 16 GB of memory. This VM hosts XGBoost model training, hyperparameter tuning, benchmarking, and the inference API. +Create a Google Axion C4A Arm-based virtual machine (VM) on Google Cloud Platform. For this Learning Path, you'll use the `c4a-standard-4` machine type. `c4a-standard-4`provides 4 vCPUs and 16 GB of memory. -{{% notice Note %}}For help with GCP setup, see the Learning Path [Getting started with Google Cloud Platform](/learning-paths/servers-and-cloud-computing/csp/google/).{{% /notice %}} +The VM that you'll create will host XGBoost model training, hyperparameter tuning, benchmarking, and the inference API. -To create a C4A virtual machine in the Google Cloud Console: +{{% notice Note %}}For help with Google Cloud Platform setup, see the Learning Path [Getting started with Google Cloud Platform](/learning-paths/servers-and-cloud-computing/csp/google/).{{% /notice %}} -1. Navigate to the [Google Cloud Console](https://console.cloud.google.com/). +To create a C4A VM in the Google Cloud console: + +1. Navigate to the [Google Cloud console](https://console.cloud.google.com/). 2. Go to **Compute Engine** > **VM Instances** and select **Create Instance**. 3. Under **Machine configuration**, populate fields such as **Instance name**, **Region**, and **Zone**. 4. Set **Series** to `C4A`, then select `c4a-standard-4` for **Machine type**. -![Screenshot of the Google Cloud Console showing the Machine configuration section. The Series dropdown is set to C4A and the machine type c4a-standard-4 is selected#center](images/gcp-vm.png "Configuring machine type to C4A in Google Cloud Console") +![Screenshot of the Google Cloud console showing the Machine configuration section. The Series dropdown is set to C4A and the machine type c4a-standard-4 is selected#center](images/gcp-vm.png "Selecting machine type as C4A in the Google Cloud console") 5. Under **OS and storage**, select **Change** and then choose an Arm64-based operating system image. For this Learning Path, select **SUSE Linux Enterprise Server**. 6. For the license type, choose **Pay as you go**. @@ -30,7 +32,7 @@ To create a C4A virtual machine in the Google Cloud Console: After the instance starts, select **SSH** next to the VM in the instance list to open a browser-based terminal session. -![Google Cloud Console VM instances page displaying running instance with green checkmark and SSH button in the Connect column#center](images/gcp-pubip-ssh.png "Connecting to a running C4A VM using SSH") +![Google Cloud console VM instances page displaying running instance with green checkmark and SSH button in the Connect column#center](images/gcp-pubip-ssh.png "Connecting to a running C4A VM using SSH") A new browser window opens with a terminal connected to your VM. @@ -38,4 +40,6 @@ A new browser window opens with a terminal connected to your VM. ## What you've accomplished and what's next -You've provisioned a Google Axion C4A Arm VM and connected to it using SSH. The VM is linked to the firewall rule that exposes port 8080 for the XGBoost inference API. Next, you'll install XGBoost and configure a Python 3.11 environment for model training and benchmarking. +You've now provisioned a Google Axion C4A Arm VM and connected to it using SSH. The VM is linked to the firewall rule that allows access to port 8080 for the XGBoost inference API. + +Next, you'll install XGBoost and configure a Python 3.11 environment for model training and benchmarking.