Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,11 +1,7 @@
---
title: Analyze memory access behavior using Arm Performix and the Arm MCP Server
title: Optimize memory access behavior using Arm Performix and the Arm MCP Server

draft: true
cascade:
draft: true

description: Learn how to profile memory access behavior in a C++ particle simulation on Arm Linux using the Arm Performix Memory Access recipe through the Arm MCP Server.
description: Learn how to profile and optimize memory access behavior in a C++ particle simulation on Arm Linux using the Arm Performix Memory Access recipe through the Arm MCP Server.

minutes_to_complete: 45

Expand All @@ -14,7 +10,7 @@ who_is_this_for: This is an introductory topic for C++ developers who want to us
learning_objectives:
- Explain how L1 cache hits, TLB misses, and page walks affect C++ application performance.
- Build and visualize the orbiting galaxies example on an Arm Neoverse server.
- Inspect and optimize particle data structure using insights from the memory access recipe.
- Inspect and optimize the particle data structure using insights from the memory access recipe.
- Use the Arm MCP Server in combination with Arm Performix for an agentic solution.

prerequisites:
Expand All @@ -33,7 +29,7 @@ armips:
tools_software_languages:
- Arm Performix
- MCP
- C
- C++
- CMake
- Python
- perf
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,18 +8,18 @@ layout: learningpathall

## Review of the CPU memory hierarchy

This section recaps the memory hierarchy concepts the worked example builds on. It is not an exhaustive explanation, but covers what you need to interpret the profiling results.
In this section, you'll learn the memory hierarchy concepts the worked example builds on. It's not an exhaustive explanation, but it covers what you'll need to interpret the profiling results.

Modern Arm Neoverse server CPUs use a hierarchy of memories to reduce the cost of loading and storing data. The fastest storage sits close to each CPU core, while larger memories sit farther away and take more cycles to access.

You typically see:
You usually see the following:

- L1 data cache (`L1d`) and L1 instruction cache (`L1i`) close to each core with each access usually taking up to 10 cycles.
- L2 cache, often private to each core, with each access usually taking 10-20 cycles.
- Last-level cache, often shared across multiple cores, and usually taking 20+ cycles.
- L2 cache, often private to each core, with each access usually taking 10 to 20 cycles.
- Last-level cache, often shared across multiple cores, and usually taking more than 20 cycles.
- DRAM, which is much larger but much slower than on-chip cache.

You can inspect cache topology on an Arm Neoverse server with Arm's [Sysreport](/learning-paths/servers-and-cloud-computing/sysreport/) tool or the `lscpu` command. Unlike `lscpu`, Sysreport also reports the set associativity for each cache level. For example, you can run the following command on a system with `git` and `python` installed:
To inspect cache topology on an Arm Neoverse server, see the [Learning Path for Arm's Sysreport tool](/learning-paths/servers-and-cloud-computing/sysreport/) or use the `lscpu` command. Unlike `lscpu`, Sysreport also reports the set associativity for each cache level. For example, you can run the following command on a system with `git` and `python` installed:

```bash
git clone https://github.com/ArmDeveloperEcosystem/sysreport.git
Expand Down Expand Up @@ -50,29 +50,34 @@ hwloc-ls --of png > topology.png

![Hardware locality topology for an Arm server showing per-core L1 and L2 caches and a shared L3 cache across all cores, which helps you verify cache hierarchy before profiling.#center](./topology.webp "Example hardware locality topology")

The diagram illustrates cache tiers on an AWS Graviton3 bare metal server based on Neoverse V1. Each of the 64 cores has private `L1d`, `L1i`, and `L2` caches, and all cores share one `L3` cache, sometimes referred to as last-level cache (LLC). Cache sizes, especially at later levels, are not fixed by the Neoverse architecture; implementers such as AWS or Google can configure larger or smaller caches based on design goals.
The diagram shows cache tiers on an AWS Graviton3 bare metal server based on Neoverse V1. Each of the 64 cores has private `L1d`, `L1i`, and `L2` caches, and all cores share one `L3` cache, sometimes referred to as last-level cache (LLC). Cache sizes, especially at later levels, are not fixed by the Neoverse architecture. Implementers such as AWS or Google can configure larger or smaller caches based on design goals.

NUMA, or non-uniform memory access, means memory latency can depend on which processor or socket owns the memory being accessed. On this AWS Graviton3 instance, there is only one NUMA node.
Non-uniform memory access (NUMA) means memory latency can depend on which processor or socket owns the memory being accessed. On this AWS Graviton3 instance, there is only one NUMA node.

To get a comprehensive system-level understanding of the memory subsystem, see the Learning Path on the [Arm system characterization tool](/learning-paths/servers-and-cloud-computing/memory-subsystem/).

## Memory and translation terminology

Applications use virtual addresses, which are the addresses a program sees instead of physical DRAM locations. Virtual addressing lets the operating system isolate processes, protect memory, and map each program's address space to available physical memory. The processor translates virtual addresses to physical addresses before it accesses memory.
Applications use virtual addresses, which are the addresses a program sees instead of physical DRAM locations. With virtual addressing, the operating system isolates processes, protects memory, and maps each program's address space to available physical memory. The processor translates virtual addresses to physical addresses before it accesses memory.

### Translation lookaside buffer (TLB)
### Translation lookaside buffer

The translation lookaside buffer (TLB) caches recent virtual-to-physical translations at page granularity to avoid page table walks. A TLB miss occurs when the needed translation is not cached, so the processor performs a page table walk to find the mapping. Page walks add latency before a load or store can complete. Large working sets and irregular access patterns, such as strides larger than the typical 4KB page size, can increase TLB pressure because the program touches many pages with little reuse.

### Page faults

A minor page fault is usually harmless: the data is already in RAM, and the kernel only creates the mapping. This commonly happens during anonymous paging when Linux lazily backs newly allocated heap or stack memory on first touch. A major page fault is more expensive because the kernel must fetch the page from disk, such as from a file or swap, so repeated major faults are usually a real performance concern.
A minor page fault is usually harmless: the data is already in RAM, and the kernel only creates the mapping. This fault commonly happens during anonymous paging when Linux lazily backs newly allocated heap or stack memory on first touch. A major page fault is more expensive because the kernel must fetch the page from disk, such as from a file or swap, so repeated major faults are a real performance concern.

### Working set size

The working set is the data your program actively touches during a period of execution. It differs from resident set size (RSS), which is the amount of physical memory currently resident for a process. A process can have a large RSS while the hot loop actively uses only a smaller working set.

### Memory access from a programmer's perspective
## Memory access from a programmer's perspective

From a programmer's perspective, much of the cache and memory subsystem is a black box defined by processor architecture and implementation. Features such as cache associativity, prefetching, and translation caching are designed to hide latency across many workloads. Your main software levers are data structure layout, allocation patterns, and choices such as page size. The layout of your C++ data structures can determine whether the memory hierarchy helps or hurts performance. The compiler generally cannot reorder structure fields or split objects automatically because that would change program semantics.
From a programmer's perspective, much of the cache and memory subsystem is a black box defined by processor architecture and implementation. Features such as cache associativity, prefetching, and translation caching are designed to hide latency across many workloads. Your main software levers are data structure layout, allocation patterns, and choices such as page size. The layout of your C++ data structures can determine whether the memory hierarchy helps or hurts performance. The compiler generally can't reorder structure fields or split objects automatically because that would change program semantics.

## What you've learned and what's next

You've now learned about CPU memory hierarchy, memory access, and relevant memory and translation terminology to understand profiling results for the example application that you'll use in this Learning Path.

Next, you'll set up and build the example C++ application.
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,15 @@ layout: learningpathall

## Set up the build environment

On this page, you install the required system packages, clone the orbiting galaxies example repository, and build the workload binaries. You can also optionally run a visualization to confirm the simulation is working before you profile it.
In this section, you'll install the required system packages, clone the orbiting galaxies example repository, and build the workload binaries. You can also run a visualization to confirm the simulation is working before you profile it.

Use your remote Arm server for all build and run steps. This example uses an AWS `c7g.metal` instance running Ubuntu 24.04 LTS.
Use your remote Arm server for all build and run steps. This example uses an Amazon EC2 `c7g.metal` instance running Ubuntu 24.04 LTS.

## Install Arm Performix
### Install Arm Performix

Install and configure Arm Performix using the [Performix install guide](/install-guides/performix/) on both your local machine and the remote Arm server.

## Install the required system packages
### Install the required system packages

Run the following command, replacing `apt` with the package manager for your Linux distribution.

Expand All @@ -41,20 +41,24 @@ sudo apt install -y linux-modules-extra-$(uname -r)
sudo modprobe arm_spe_pmu
```

If you are using an AWS `c7g.metal` instance you also need turn Kernel Page Table Isolation (KPTI) off.
If you're using a `c7g.metal` instance, you also need to turn Kernel Page Table Isolation (KPTI) off.

The fastest way on AWS is to use an editor to add `kpti=off` to the `GRUB_CMDLINE_LINUX_DEFAULT` line in `/etc/default/grub.d/50-cloudimg-settings.cfg`.

After editing the file:
After editing the file, run:

```bash
sudo update-grub
sudo reboot
```

For a complete explanation of SPE refer to [Enable Arm SPE for Performix memory access analysis](/learning-paths/servers-and-cloud-computing/spe-on-performix/).
For a complete explanation of SPE, see [Enable Arm SPE for Performix memory access analysis](/learning-paths/servers-and-cloud-computing/spe-on-performix/).

## Clone the example repository
## Build the sample application

After setting up the build environment, clone and build the sample application.

### Clone the example repository

Clone the orbiting galaxies repository and check out the tagged release to work from a known starting point:

Expand All @@ -64,7 +68,7 @@ cd Orbiting-Galaxy-Example
git checkout -b my-work v1.0.3
```

## Build with CMake
### Build with CMake

Build the project using CMake:

Expand All @@ -83,7 +87,7 @@ This produces three binaries in `build/`:

## Set up a Python virtual environment and run visualization

From the repository root:
After building the application, from the repository root, run:

```bash
cd ..
Expand All @@ -105,4 +109,10 @@ The script reads simulation data from `galaxy_baseline.bin` and writes a GIF fil

![Animated orbiting galaxy simulation generated by the baseline workload, showing particle motion over time so you can verify that the simulation output looks correct before profiling.#center](galaxy_compressed.gif "Orbiting galaxies workload visualization")

Use `--visualize` only for understanding the workload behavior. Do not include visualization mode in profiling runs because file I/O alters the measured runtime characteristics.
Use `--visualize` only for understanding the workload behavior. Don't include visualization mode in profiling runs because file I/O alters the measured runtime characteristics.

## What you've accomplished and what's next

You've now set up and built an orbiting galaxy application on an Arm-based instance by setting up a build environment and cloning the app from a GitHub repo. You've also run a visualization to confirm that the application works as expected.

Next, you'll profile memory access behavior using Arm Performix.
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@ Start by inspecting the baseline particle model in `src/baseline/particle.hpp`.

{{% notice Tip %}}

If you are using an IDE or editor with an LLM-based coding assistant, the `AGENT.md` file can improve your learning experience. This file provides repository context and helps guide the agent to give more useful assistance.
If you are using an IDE or editor with an LLM-based coding assistant, the `AGENTS.md` file can improve your learning experience. The `AGENTS.md` file provides the repository context and helps guide the agent to give more useful assistance.

![Screenshot showing the AGENT.md file in the repository, highlighting the context file your coding assistant uses to provide more relevant guidance during this task.#center](./agent_screen_shot.webp "Screenshot of GitHub Copilot in VSCode using AGENTS.md as a system prompt to act as a learning assistant.")
![Screenshot showing the AGENTS.md file in the repository, highlighting the context file your coding assistant uses to provide more relevant guidance during this task.#center](./agent_screen_shot.webp "Screenshot of GitHub Copilot in VSCode using AGENTS.md as a system prompt to act as a learning assistant.")

{{% /notice %}}

Expand Down Expand Up @@ -51,20 +51,20 @@ for (int iter = 0; iter < iters; ++iter) {
This baseline design can create avoidable memory overhead:

- `ParticleOwner` stores pointers to separately allocated `Particle` objects, so the hot loop must follow an extra level of indirection.
- Each `Particle` is 64 bytes, but the position update only uses `x`, `y`, `z`, `vx`, `vy`, and `vz`.
- Loading whole particle objects can waste cache capacity and memory bandwidth when the loop only needs a subset of fields.
- Each `Particle` is 64 bytes, but the position update uses only `x`, `y`, `z`, `vx`, `vy`, and `vz`.
- Loading whole particle objects can waste cache capacity and memory bandwidth when the loop needs only a subset of fields.

Before you optimize anything, profile and measure.

## Run the Performix Memory Access Recipe
## Run the Performix Memory Access recipe

Open the Performix GUI on your local machine and select the **Memory Access** recipe.

Configure the recipe to launch the baseline workload on your remote Arm target:

- Select the configured remote target.
- Set **Workload type** to **Launch a new process**.
- Set **Workload** to the baseline executable:
1. Select the configured remote target.
2. Set **Workload type** to **Launch a new process**.
3. Set **Workload** to the baseline executable:

```output
~/Orbiting-Galaxy-Example/build/baseline
Expand All @@ -76,13 +76,13 @@ Keep the default profiling duration so Performix records until the workload exit

Start the recipe and wait for the results to load.

## Assess Performance
## Assess performance

![Performix Memory Access results for the baseline binary showing update_positions with about 66 percent L1C load hits and around 26-cycle average L1C latency, indicating weak cache locality in the hot path.#center](./performix_before_optimizations.webp "Baseline memory access results before optimization")

Look at the memory access results for the baseline binary. Most samples are associated with the `update_positions()` function. The `L1C % Loads` value shows that only about two thirds of loads hit in L1 cache, and the average L1 cache load latency is about 26 cycles. A cache-friendly hot loop should have a much higher L1 hit rate and lower average latency.
Look at the memory access results for the baseline binary. Most samples are associated with the `update_positions()` function. The `L1C % Loads` value shows that only about two-thirds of loads hit in L1 cache, and the average L1 cache load latency is about 26 cycles. A cache-friendly hot loop should have a much higher L1 hit rate and lower average latency.

To investigate further, check the TLB walk data. As described in the background section, the TLB caches virtual-to-physical address translations. As per the image below, the `TLB Walk Breakdown` tab shows no significant TLB walks. That means address translation is not the main issue.
To investigate further, check the TLB walk data. As described in the background section, the TLB caches virtual-to-physical address translations. As per the following image, the `TLB Walk Breakdown` tab shows no significant TLB walks. That means address translation is not the main issue.

![Performix Memory Access results show 0% TLB walks across all functions in the baseline binary, indicating that TLB pressure and costly address translation misses are not contributing to the performance issue.#center](./no_tlb_walks.webp "TLB walk results showing 0 page table walks for all functions in baseline implementation")

Expand All @@ -96,6 +96,10 @@ Double-click the `update_positions()` row to open the source code view. The sour

![Performix source code view for update_positions showing sample concentration on the x, y, and z update statements, helping you confirm that this loop is the main optimization target.#center](./source_code.webp "Baseline source-level samples in update_positions")

Given that the majority of samples are associated with accessing the `Particle` data structure and that we fall back to L2 cache ~1/3 of the time, to improve the execution time of this example we will need to focus on more efficient ways, if any, of accessing the `Particle` member variables. For example, there may be an alternative data structure that has better cache utilization.
The majority of samples are associated with accessing the `Particle` data structure, and the samples fall back to L2 cache approximately one-third of the time. Considering this, to improve the execution time of the example, you'll need to focus on more efficient ways, if any, of accessing the `Particle` member variables. For example, there might be an alternative data structure that has better cache utilization.

In the next section, you use this evidence to guide optimization.
## What you've accomplished and what's next

You've now used Arm Performix to assess the memory performance of the orbiting galaxy particle simulator application using the Memory Access recipe.

Next, you'll use these performance results to guide optimization of the application.
Loading
Loading