From b8add25b48eea9365a5a8617ae49409eedfa033e Mon Sep 17 00:00:00 2001 From: anupras-mohapatra-arm Date: Thu, 21 May 2026 12:20:49 -0500 Subject: [PATCH 1/5] first pass --- .../performix-memory-access/_index.md | 4 --- .../performix-memory-access/how-to-0.md | 25 +++++++++++------- .../performix-memory-access/how-to-1.md | 22 ++++++++++------ .../performix-memory-access/how-to-2.md | 26 +++++++++++-------- .../performix-memory-access/how-to-3.md | 22 ++++++++-------- 5 files changed, 55 insertions(+), 44 deletions(-) diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/_index.md b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/_index.md index 1bde54f4cc..14d5313318 100644 --- a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/_index.md @@ -1,10 +1,6 @@ --- title: Analyze memory access behavior using Arm Performix and the Arm MCP Server -draft: true -cascade: - draft: true - description: Learn how to profile memory access behavior in a C++ particle simulation on Arm Linux using the Arm Performix Memory Access recipe through the Arm MCP Server. minutes_to_complete: 45 diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-0.md b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-0.md index 1fd669d57e..f5d0a29aeb 100644 --- a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-0.md +++ b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-0.md @@ -8,18 +8,18 @@ layout: learningpathall ## Review of the CPU memory hierarchy -This section recaps the memory hierarchy concepts the worked example builds on. It is not an exhaustive explanation, but covers what you need to interpret the profiling results. +In this section, you'll learn the memory hierarchy concepts the worked example builds on. It is not an exhaustive explanation, but covers what you'll need to interpret the profiling results. Modern Arm Neoverse server CPUs use a hierarchy of memories to reduce the cost of loading and storing data. The fastest storage sits close to each CPU core, while larger memories sit farther away and take more cycles to access. -You typically see: +You usually see the following: - L1 data cache (`L1d`) and L1 instruction cache (`L1i`) close to each core with each access usually taking up to 10 cycles. - L2 cache, often private to each core, with each access usually taking 10-20 cycles. - Last-level cache, often shared across multiple cores, and usually taking 20+ cycles. - DRAM, which is much larger but much slower than on-chip cache. -You can inspect cache topology on an Arm Neoverse server with Arm's [Sysreport](/learning-paths/servers-and-cloud-computing/sysreport/) tool or the `lscpu` command. Unlike `lscpu`, Sysreport also reports the set associativity for each cache level. For example, you can run the following command on a system with `git` and `python` installed: +To inspect cache topology on an Arm Neoverse server, see the [Learning Path for Arm's Sysreport tool](/learning-paths/servers-and-cloud-computing/sysreport/) or use the `lscpu` command. Unlike `lscpu`, Sysreport also reports the set associativity for each cache level. For example, you can run the following command on a system with `git` and `python` installed: ```bash git clone https://github.com/ArmDeveloperEcosystem/sysreport.git @@ -50,29 +50,34 @@ hwloc-ls --of png > topology.png ![Hardware locality topology for an Arm server showing per-core L1 and L2 caches and a shared L3 cache across all cores, which helps you verify cache hierarchy before profiling.#center](./topology.webp "Example hardware locality topology") -The diagram illustrates cache tiers on an AWS Graviton3 bare metal server based on Neoverse V1. Each of the 64 cores has private `L1d`, `L1i`, and `L2` caches, and all cores share one `L3` cache, sometimes referred to as last-level cache (LLC). Cache sizes, especially at later levels, are not fixed by the Neoverse architecture; implementers such as AWS or Google can configure larger or smaller caches based on design goals. +The diagram illustrates cache tiers on an AWS Graviton3 bare metal server based on Neoverse V1. Each of the 64 cores has private `L1d`, `L1i`, and `L2` caches, and all cores share one `L3` cache, sometimes referred to as last-level cache (LLC). Cache sizes, especially at later levels, are not fixed by the Neoverse architecture. Implementers such as AWS or Google can configure larger or smaller caches based on design goals. -NUMA, or non-uniform memory access, means memory latency can depend on which processor or socket owns the memory being accessed. On this AWS Graviton3 instance, there is only one NUMA node. +Non-uniform memory access (NUMA) means memory latency can depend on which processor or socket owns the memory being accessed. On this AWS Graviton3 instance, there is only one NUMA node. To get a comprehensive system-level understanding of the memory subsystem, see the Learning Path on the [Arm system characterization tool](/learning-paths/servers-and-cloud-computing/memory-subsystem/). ## Memory and translation terminology -Applications use virtual addresses, which are the addresses a program sees instead of physical DRAM locations. Virtual addressing lets the operating system isolate processes, protect memory, and map each program's address space to available physical memory. The processor translates virtual addresses to physical addresses before it accesses memory. +Applications use virtual addresses, which are the addresses a program sees instead of physical DRAM locations. With virtual addressing, the operating system isolates processes, protects memory, and maps each program's address space to available physical memory. The processor translates virtual addresses to physical addresses before it accesses memory. -### Translation lookaside buffer (TLB) +### Translation lookaside buffer The translation lookaside buffer (TLB) caches recent virtual-to-physical translations at page granularity to avoid page table walks. A TLB miss occurs when the needed translation is not cached, so the processor performs a page table walk to find the mapping. Page walks add latency before a load or store can complete. Large working sets and irregular access patterns, such as strides larger than the typical 4KB page size, can increase TLB pressure because the program touches many pages with little reuse. ### Page faults -A minor page fault is usually harmless: the data is already in RAM, and the kernel only creates the mapping. This commonly happens during anonymous paging when Linux lazily backs newly allocated heap or stack memory on first touch. A major page fault is more expensive because the kernel must fetch the page from disk, such as from a file or swap, so repeated major faults are usually a real performance concern. +A minor page fault is usually harmless: the data is already in RAM, and the kernel only creates the mapping. This fault commonly happens during anonymous paging when Linux lazily backs newly allocated heap or stack memory on first touch. A major page fault is more expensive because the kernel must fetch the page from disk, such as from a file or swap, so repeated major faults are a real performance concern. ### Working set size The working set is the data your program actively touches during a period of execution. It differs from resident set size (RSS), which is the amount of physical memory currently resident for a process. A process can have a large RSS while the hot loop actively uses only a smaller working set. -### Memory access from a programmer's perspective +## Memory access from a programmer's perspective -From a programmer's perspective, much of the cache and memory subsystem is a black box defined by processor architecture and implementation. Features such as cache associativity, prefetching, and translation caching are designed to hide latency across many workloads. Your main software levers are data structure layout, allocation patterns, and choices such as page size. The layout of your C++ data structures can determine whether the memory hierarchy helps or hurts performance. The compiler generally cannot reorder structure fields or split objects automatically because that would change program semantics. +From a programmer's perspective, much of the cache and memory subsystem is a black box defined by processor architecture and implementation. Features such as cache associativity, prefetching, and translation caching are designed to hide latency across many workloads. Your main software levers are data structure layout, allocation patterns, and choices such as page size. The layout of your C++ data structures can determine whether the memory hierarchy helps or hurts performance. The compiler generally can't reorder structure fields or split objects automatically because that would change program semantics. +## What you've learned and what's next + +You've now learned about CPU memory hierarchy, memory access, and relevant memory and translation terminology to understand profiling results for the example application that you'll use in this Learning Path. + +Next, you'll set up and build the example C++ application. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-1.md b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-1.md index 3ae6d9b9d1..50d46d6f11 100644 --- a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-1.md +++ b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-1.md @@ -8,15 +8,15 @@ layout: learningpathall ## Set up the build environment -On this page, you install the required system packages, clone the orbiting galaxies example repository, and build the workload binaries. You can also optionally run a visualization to confirm the simulation is working before you profile it. +In this section, you'll install the required system packages, clone the orbiting galaxies example repository, and build the workload binaries. You can also run a visualization to confirm the simulation is working before you profile it. -Use your remote Arm server for all build and run steps. This example uses an AWS `c7g.metal` instance running Ubuntu 24.04 LTS. +Use your remote Arm server for all build and run steps. This example uses an Amazon EC2 `c7g.metal` instance running Ubuntu 24.04 LTS. -## Install Arm Performix +### Install Arm Performix Install and configure Arm Performix using the [Performix install guide](/install-guides/performix/) on both your local machine and the remote Arm server. -## Install the required system packages +### Install the required system packages Run the following command, replacing `apt` with the package manager for your Linux distribution. @@ -41,7 +41,7 @@ sudo apt install -y linux-modules-extra-$(uname -r) sudo modprobe arm_spe_pmu ``` -If you are using an AWS `c7g.metal` instance you also need turn Kernel Page Table Isolation (KPTI) off. +If you are using a `c7g.metal` instance, you also need turn Kernel Page Table Isolation (KPTI) off. The fastest way on AWS is to use an editor to add `kpti=off` to the `GRUB_CMDLINE_LINUX_DEFAULT` line in `/etc/default/grub.d/50-cloudimg-settings.cfg`. @@ -52,7 +52,7 @@ sudo update-grub sudo reboot ``` -For a complete explanation of SPE refer to [Enable Arm SPE for Performix memory access analysis](/learning-paths/servers-and-cloud-computing/spe-on-performix/). +For a complete explanation of SPE, see [Enable Arm SPE for Performix memory access analysis](/learning-paths/servers-and-cloud-computing/spe-on-performix/). ## Clone the example repository @@ -83,7 +83,7 @@ This produces three binaries in `build/`: ## Set up a Python virtual environment and run visualization -From the repository root: +From the repository root, run: ```bash cd .. @@ -105,4 +105,10 @@ The script reads simulation data from `galaxy_baseline.bin` and writes a GIF fil ![Animated orbiting galaxy simulation generated by the baseline workload, showing particle motion over time so you can verify that the simulation output looks correct before profiling.#center](galaxy_compressed.gif "Orbiting galaxies workload visualization") -Use `--visualize` only for understanding the workload behavior. Do not include visualization mode in profiling runs because file I/O alters the measured runtime characteristics. +Use `--visualize` only for understanding the workload behavior. Don't include visualization mode in profiling runs because file I/O alters the measured runtime characteristics. + +## What you've accomplished and what's next + +You've now set up and built an orbiting galaxy application on an Arm-based instance by setting up a build environment and cloning the app from a GitHub repo. You've also run a visualization to confirm that the application works as expected. + +Next, you'll profile memory access behavior using Arm Performix. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-2.md b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-2.md index 70d1bd85c7..0df961b666 100644 --- a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-2.md +++ b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-2.md @@ -51,20 +51,20 @@ for (int iter = 0; iter < iters; ++iter) { This baseline design can create avoidable memory overhead: - `ParticleOwner` stores pointers to separately allocated `Particle` objects, so the hot loop must follow an extra level of indirection. -- Each `Particle` is 64 bytes, but the position update only uses `x`, `y`, `z`, `vx`, `vy`, and `vz`. -- Loading whole particle objects can waste cache capacity and memory bandwidth when the loop only needs a subset of fields. +- Each `Particle` is 64 bytes, but the position update uses only `x`, `y`, `z`, `vx`, `vy`, and `vz`. +- Loading whole particle objects can waste cache capacity and memory bandwidth when the loop needs only a subset of fields. Before you optimize anything, profile and measure. -## Run the Performix Memory Access Recipe +## Run the Performix Memory Access recipe Open the Performix GUI on your local machine and select the **Memory Access** recipe. Configure the recipe to launch the baseline workload on your remote Arm target: -- Select the configured remote target. -- Set **Workload type** to **Launch a new process**. -- Set **Workload** to the baseline executable: +1. Select the configured remote target. +2. Set **Workload type** to **Launch a new process**. +3. Set **Workload** to the baseline executable: ```output ~/Orbiting-Galaxy-Example/build/baseline @@ -76,13 +76,13 @@ Keep the default profiling duration so Performix records until the workload exit Start the recipe and wait for the results to load. -## Assess Performance +## Assess performance ![Performix Memory Access results for the baseline binary showing update_positions with about 66 percent L1C load hits and around 26-cycle average L1C latency, indicating weak cache locality in the hot path.#center](./performix_before_optimizations.webp "Baseline memory access results before optimization") -Look at the memory access results for the baseline binary. Most samples are associated with the `update_positions()` function. The `L1C % Loads` value shows that only about two thirds of loads hit in L1 cache, and the average L1 cache load latency is about 26 cycles. A cache-friendly hot loop should have a much higher L1 hit rate and lower average latency. +Look at the memory access results for the baseline binary. Most samples are associated with the `update_positions()` function. The `L1C % Loads` value shows that only about two-thirds of loads hit in L1 cache, and the average L1 cache load latency is about 26 cycles. A cache-friendly hot loop should have a much higher L1 hit rate and lower average latency. -To investigate further, check the TLB walk data. As described in the background section, the TLB caches virtual-to-physical address translations. As per the image below, the `TLB Walk Breakdown` tab shows no significant TLB walks. That means address translation is not the main issue. +To investigate further, check the TLB walk data. As described in the background section, the TLB caches virtual-to-physical address translations. As per the following image, the `TLB Walk Breakdown` tab shows no significant TLB walks. That means address translation is not the main issue. ![Performix Memory Access results show 0% TLB walks across all functions in the baseline binary, indicating that TLB pressure and costly address translation misses are not contributing to the performance issue.#center](./no_tlb_walks.webp "TLB walk results showing 0 page table walks for all functions in baseline implementation") @@ -96,6 +96,10 @@ Double-click the `update_positions()` row to open the source code view. The sour ![Performix source code view for update_positions showing sample concentration on the x, y, and z update statements, helping you confirm that this loop is the main optimization target.#center](./source_code.webp "Baseline source-level samples in update_positions") -Given that the majority of samples are associated with accessing the `Particle` data structure and that we fall back to L2 cache ~1/3 of the time, to improve the execution time of this example we will need to focus on more efficient ways, if any, of accessing the `Particle` member variables. For example, there may be an alternative data structure that has better cache utilization. +The majority of samples are associated with accessing the `Particle` data structure, and we fall back to L2 cache ~1/3 of the time. Considering this, to improve the execution time of the example, we'll need to focus on more efficient ways, if any, of accessing the `Particle` member variables. For example, there might be an alternative data structure that has better cache utilization. -In the next section, you use this evidence to guide optimization. +## What you've accomplished and what's next + +You've now used Arm Performix to assess the memory performance of the oribiting galaxy particle simulator application using the Memory Access recipe. + +Next, you'll use these performance results to guide optimization of the application. diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-3.md b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-3.md index d4c68b77b7..9001f01681 100644 --- a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-3.md +++ b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-3.md @@ -6,7 +6,7 @@ weight: 5 layout: learningpathall --- -## Optimize manually +## Manually optimize the application The `src/users_solution/` directory is an editable copy of `src/baseline`. Using the data collected from Performix, refactor the `Particle` data structure and associated function signatures and call sites to improve the L1 cache hit rate. The baseline result showed that `update_positions()` dominated the samples, had a low L1 cache hit rate, and did not show significant TLB walks. @@ -16,7 +16,7 @@ Consider how the `Particle` data structure maps to a 64-byte cache line. Also co {{% /notice %}} -Once you make changes in `src/users_solution/`, rebuild the binary with the following commands. +After you make changes in `src/users_solution/`, rebuild the binary with the following commands: ```bash cd ~/Orbiting-Galaxy-Example/build @@ -35,9 +35,9 @@ The hot loop is instrumented with `scopedTimer`, so you'll also see the loop dur ## (Optional) Optimize with an AI agent and the Arm MCP server -If you have access to a code assistant such as Kiro, Gemini, Codex, or GitHub Copilot, you can also use the Arm Model Context Protocol (MCP) server. The MCP server includes direct tool support to invoke Performix on a remote target. It integrates with MCP-compatible coding assistants and can provide performance insights to create a useful feedback loop. The following example shows how to connect to OpenAI Codex; for other tools, see [your preferred coding assistant](/learning-paths/servers-and-cloud-computing/arm-mcp-server/1-overview/). +If you have access to a code assistant such as Kiro, Gemini, Codex, or GitHub Copilot, you can also use the Arm Model Context Protocol (MCP) server. The MCP server includes direct tool support to invoke Performix on a remote target. It integrates with MCP-compatible coding assistants and can provide performance insights to create a useful feedback loop. The following example shows how to connect to OpenAI Codex. For other tools, see [your preferred coding assistant](/learning-paths/servers-and-cloud-computing/arm-mcp-server/1-overview/). -{{% notice Please Note %}} +{{% notice Note %}} You need an OpenAI account to use the Codex CLI. @@ -73,9 +73,9 @@ Alternatively, you can use the curated [arm-full-optimization.md](https://github ## Review the optimized solution -A reference solution is available in the `src/optimized` directory. The baseline stores a vector of `Particle*` values, where each `Particle` is allocated separately and contains all particle fields in one 64-byte structure. The hot loop only needs `x`, `y`, `z`, `vx`, `vy`, and `vz`, but the baseline layout still steps through whole particle objects and performs unnecessary pointer chasing. +A reference solution is available in the `src/optimized` directory of the repository. The baseline stores a vector of `Particle*` values, where each `Particle` is allocated separately and contains all particle fields in one 64-byte structure. The hot loop needs only `x`, `y`, `z`, `vx`, `vy`, and `vz`, but the baseline layout still steps through whole particle objects and performs unnecessary pointer chasing. -The optimized version changes the layout to a Struct of Arrays (SoA). Each field is stored in its own contiguous `std::vector`: +The optimized version changes the layout to a Structure of Arrays (SoA). Each field is stored in its own contiguous `std::vector`: ```cpp struct ParticlesSoA { @@ -101,7 +101,7 @@ void update_positions(ParticlesSoA& p, int n, float dt) { This removes `Particle*` indirection and improves cache-line utilization because the hot loop streams through only the data it uses. -The following diagram compares the baseline and optimized layouts. The baseline implementation is on the right. Even though each particle is padded to a 64-byte cache line, many struct members are not read or written in the hot loop, so they remain cold. With a structure-of-arrays layout, all particles are still owned together, but cache lines contain more of the data that the loop actually touches. +The following diagram compares the baseline and optimized layouts. Even though each particle is padded to a 64-byte cache line, many struct members are not read or written in the hot loop, so they remain cold. With a structure-of-arrays layout, all particles are still owned together, but cache lines contain more of the data that the loop actually touches. ![Animation comparing baseline and structure-of-arrays layouts, showing how the optimized layout packs hot fields together so cache lines carry useful data for position updates.#center](./data_layout_comparison_compressed.gif) @@ -109,7 +109,7 @@ The following diagram compares the baseline and optimized layouts. The baseline To see what fully optimized results look like, run the Performix Memory Access recipe against the pre-built reference binary. In the Performix GUI, rerun the recipe and change the binary path from `~/Orbiting-Galaxy-Example/build/baseline` to `~/Orbiting-Galaxy-Example/build/optimized`. -![Performix Memory Access results for the optimized binary showing 100 percent L1C load hits for the selected function and lower average L1C latency, confirming improved memory locality after the data layout change.#center](./performix_after_optimization.webp "Memory access results after the Struct of Arrays optimization") +![Performix Memory Access results for the optimized binary showing 100 percent L1C load hits for the selected function and lower average L1C latency, confirming improved memory locality after the data layout change.#center](./performix_after_optimization.webp "Memory access results after the Structure of Arrays optimization") The optimized result shows much stronger L1 cache behavior. The hot update path now has `100%` L1C loads in the captured result and a lower average L1C latency than the baseline. This confirms that the data layout change improved locality, not just wall-clock time. @@ -160,14 +160,14 @@ Optimized took 279 milliseconds | Metric | Baseline | Optimized | Explanation | |-----------------------|--------------|--------------|---------------------------------------------------------------------------------------------| | Wall time (ms) | 571 | 279 | The optimized layout improves cache usage and removes pointer chasing, roughly halving execution time. | -| Max RSS (KB) | 92,720 | 64,044 | Struct of Arrays reduces memory footprint by removing per-object overhead and cold fields. | +| Max RSS (KB) | 92,720 | 64,044 | Structure of Arrays reduces memory footprint by removing per-object overhead and cold fields. | | Minor page faults | 22,655 | 15,500 | Fewer pages are touched due to more compact, contiguous storage of only needed data fields. | | L1 cache hit rate (%) | 66.3 | 99.3 | Hot data is now accessed in a cache-friendly pattern, maximizing L1 cache effectiveness. | | L1 avg latency (cycles)| 26.2 | 11.7 | Each L1 load takes fewer cycles because pointer chasing is removed. | -## Summary +## What you've accomplished -You used Arm Performix and the Arm MCP Server to identify a memory access bottleneck in a C++ particle simulation. You connected the profile data to source code, found that the hot loop suffered from poor data layout and unnecessary pointer chasing, and improved the implementation with a Struct of Arrays layout. You then validated the change with direct wall-time measurements and a second Performix run. +You used Arm Performix and the Arm MCP Server to identify a memory access bottleneck in a C++ particle simulation. You then connected the profile data to source code, found that the hot loop suffered from poor data layout and unnecessary pointer chasing, and improved the implementation with a Structure of Arrays layout. You validated the change with direct wall-time measurements and a second Performix run. This approach combines measurement tools, code context, and focused prompts to iterate on real bottlenecks. From 35e22893d44479d60248ce3a34fc9a1a3f598614 Mon Sep 17 00:00:00 2001 From: anupras-mohapatra-arm Date: Thu, 21 May 2026 13:48:39 -0500 Subject: [PATCH 2/5] second pass edits --- .../performix-memory-access/_index.md | 4 ++-- .../performix-memory-access/how-to-0.md | 4 ++-- .../performix-memory-access/how-to-1.md | 12 ++++++++---- .../performix-memory-access/how-to-2.md | 8 ++++---- .../performix-memory-access/how-to-3.md | 8 ++++---- 5 files changed, 20 insertions(+), 16 deletions(-) diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/_index.md b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/_index.md index 14d5313318..e499d7df18 100644 --- a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/_index.md @@ -10,7 +10,7 @@ who_is_this_for: This is an introductory topic for C++ developers who want to us learning_objectives: - Explain how L1 cache hits, TLB misses, and page walks affect C++ application performance. - Build and visualize the orbiting galaxies example on an Arm Neoverse server. - - Inspect and optimize particle data structure using insights from the memory access recipe. + - Inspect and optimize the particle data structure using insights from the memory access recipe. - Use the Arm MCP Server in combination with Arm Performix for an agentic solution. prerequisites: @@ -29,7 +29,7 @@ armips: tools_software_languages: - Arm Performix - MCP - - C + - C++ - CMake - Python - perf diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-0.md b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-0.md index f5d0a29aeb..d66a6657e6 100644 --- a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-0.md +++ b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-0.md @@ -15,8 +15,8 @@ Modern Arm Neoverse server CPUs use a hierarchy of memories to reduce the cost o You usually see the following: - L1 data cache (`L1d`) and L1 instruction cache (`L1i`) close to each core with each access usually taking up to 10 cycles. -- L2 cache, often private to each core, with each access usually taking 10-20 cycles. -- Last-level cache, often shared across multiple cores, and usually taking 20+ cycles. +- L2 cache, often private to each core, with each access usually taking 10 to 20 cycles. +- Last-level cache, often shared across multiple cores, and usually taking more than 20 cycles. - DRAM, which is much larger but much slower than on-chip cache. To inspect cache topology on an Arm Neoverse server, see the [Learning Path for Arm's Sysreport tool](/learning-paths/servers-and-cloud-computing/sysreport/) or use the `lscpu` command. Unlike `lscpu`, Sysreport also reports the set associativity for each cache level. For example, you can run the following command on a system with `git` and `python` installed: diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-1.md b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-1.md index 50d46d6f11..25e43b82d5 100644 --- a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-1.md +++ b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-1.md @@ -41,7 +41,7 @@ sudo apt install -y linux-modules-extra-$(uname -r) sudo modprobe arm_spe_pmu ``` -If you are using a `c7g.metal` instance, you also need turn Kernel Page Table Isolation (KPTI) off. +If you're using a `c7g.metal` instance, you also need to turn Kernel Page Table Isolation (KPTI) off. The fastest way on AWS is to use an editor to add `kpti=off` to the `GRUB_CMDLINE_LINUX_DEFAULT` line in `/etc/default/grub.d/50-cloudimg-settings.cfg`. @@ -54,7 +54,11 @@ sudo reboot For a complete explanation of SPE, see [Enable Arm SPE for Performix memory access analysis](/learning-paths/servers-and-cloud-computing/spe-on-performix/). -## Clone the example repository +## Build the sample application + +After setting up the build environment, clone and build the sample application. + +### Clone the example repository Clone the orbiting galaxies repository and check out the tagged release to work from a known starting point: @@ -64,7 +68,7 @@ cd Orbiting-Galaxy-Example git checkout -b my-work v1.0.3 ``` -## Build with CMake +### Build with CMake Build the project using CMake: @@ -83,7 +87,7 @@ This produces three binaries in `build/`: ## Set up a Python virtual environment and run visualization -From the repository root, run: +After building the application, from the repository root, run: ```bash cd .. diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-2.md b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-2.md index 0df961b666..caf48d5aa6 100644 --- a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-2.md +++ b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-2.md @@ -12,9 +12,9 @@ Start by inspecting the baseline particle model in `src/baseline/particle.hpp`. {{% notice Tip %}} -If you are using an IDE or editor with an LLM-based coding assistant, the `AGENT.md` file can improve your learning experience. This file provides repository context and helps guide the agent to give more useful assistance. +If you are using an IDE or editor with an LLM-based coding assistant, the `AGENTS.md` file can improve your learning experience. This file provides repository context and helps guide the agent to give more useful assistance. -![Screenshot showing the AGENT.md file in the repository, highlighting the context file your coding assistant uses to provide more relevant guidance during this task.#center](./agent_screen_shot.webp "Screenshot of GitHub Copilot in VSCode using AGENTS.md as a system prompt to act as a learning assistant.") +![Screenshot showing the AGENTS.md file in the repository, highlighting the context file your coding assistant uses to provide more relevant guidance during this task.#center](./agent_screen_shot.webp "Screenshot of GitHub Copilot in VSCode using AGENTS.md as a system prompt to act as a learning assistant.") {{% /notice %}} @@ -96,10 +96,10 @@ Double-click the `update_positions()` row to open the source code view. The sour ![Performix source code view for update_positions showing sample concentration on the x, y, and z update statements, helping you confirm that this loop is the main optimization target.#center](./source_code.webp "Baseline source-level samples in update_positions") -The majority of samples are associated with accessing the `Particle` data structure, and we fall back to L2 cache ~1/3 of the time. Considering this, to improve the execution time of the example, we'll need to focus on more efficient ways, if any, of accessing the `Particle` member variables. For example, there might be an alternative data structure that has better cache utilization. +The majority of samples are associated with accessing the `Particle` data structure, and the samples fall back to L2 cache approximately one-third of the time. Considering this, to improve the execution time of the example, you'll need to focus on more efficient ways, if any, of accessing the `Particle` member variables. For example, there might be an alternative data structure that has better cache utilization. ## What you've accomplished and what's next -You've now used Arm Performix to assess the memory performance of the oribiting galaxy particle simulator application using the Memory Access recipe. +You've now used Arm Performix to assess the memory performance of the orbiting galaxy particle simulator application using the Memory Access recipe. Next, you'll use these performance results to guide optimization of the application. diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-3.md b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-3.md index 9001f01681..a47b441d72 100644 --- a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-3.md +++ b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-3.md @@ -1,5 +1,5 @@ --- -title: Optimize manually and with the Arm MCP server +title: Optimize manually and with the Arm MCP Server weight: 5 ### FIXED, DO NOT MODIFY @@ -31,11 +31,11 @@ To measure wall time and compare it against the baseline, run: /usr/bin/time -v ~/Orbiting-Galaxy-Example/build/users_solution ``` -The hot loop is instrumented with `scopedTimer`, so you'll also see the loop duration printed directly to the terminal. Compare it with the baseline result of 571 milliseconds shown at the end of this page. +The hot loop is instrumented with `scopedTimer`, so you'll also see the loop duration printed directly to the terminal. Compare it with the baseline result of 571 milliseconds shown at the end of the section. -## (Optional) Optimize with an AI agent and the Arm MCP server +## Optimize with an AI agent and the Arm MCP server -If you have access to a code assistant such as Kiro, Gemini, Codex, or GitHub Copilot, you can also use the Arm Model Context Protocol (MCP) server. The MCP server includes direct tool support to invoke Performix on a remote target. It integrates with MCP-compatible coding assistants and can provide performance insights to create a useful feedback loop. The following example shows how to connect to OpenAI Codex. For other tools, see [your preferred coding assistant](/learning-paths/servers-and-cloud-computing/arm-mcp-server/1-overview/). +If you have access to a code assistant such as Kiro, Gemini, Codex, or GitHub Copilot, you can also use the Arm Model Context Protocol (MCP) Server. The MCP server includes direct tool support to invoke Performix on a remote target. It integrates with MCP-compatible coding assistants and can provide performance insights to create a useful feedback loop. The following example shows how to connect to OpenAI Codex. For other tools, see [your preferred coding assistant](/learning-paths/servers-and-cloud-computing/arm-mcp-server/1-overview/). {{% notice Note %}} From ccd1ec2cb8b66effe5da3492b8e1511d58f9780e Mon Sep 17 00:00:00 2001 From: anupras-mohapatra-arm Date: Thu, 21 May 2026 14:05:18 -0500 Subject: [PATCH 3/5] edit --- .../performix-memory-access/how-to-3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-3.md b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-3.md index a47b441d72..01878b544e 100644 --- a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-3.md +++ b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-3.md @@ -1,5 +1,5 @@ --- -title: Optimize manually and with the Arm MCP Server +title: Optimize the application manually and with the Arm MCP Server weight: 5 ### FIXED, DO NOT MODIFY From 75678f56ce727e820a2fe8b679d877b84effae0b Mon Sep 17 00:00:00 2001 From: anupras-mohapatra-arm Date: Thu, 21 May 2026 15:42:24 -0500 Subject: [PATCH 4/5] more consistency updates --- .../performix-memory-access/_index.md | 4 ++-- .../performix-memory-access/how-to-0.md | 4 ++-- .../performix-memory-access/how-to-2.md | 2 +- .../performix-memory-access/how-to-3.md | 4 ++-- 4 files changed, 7 insertions(+), 7 deletions(-) diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/_index.md b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/_index.md index e499d7df18..70da52675d 100644 --- a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/_index.md @@ -1,7 +1,7 @@ --- -title: Analyze memory access behavior using Arm Performix and the Arm MCP Server +title: Optimize memory access behavior using Arm Performix and the Arm MCP Server -description: Learn how to profile memory access behavior in a C++ particle simulation on Arm Linux using the Arm Performix Memory Access recipe through the Arm MCP Server. +description: Learn how to profile and optimize memory access behavior in a C++ particle simulation on Arm Linux using the Arm Performix Memory Access recipe through the Arm MCP Server. minutes_to_complete: 45 diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-0.md b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-0.md index d66a6657e6..bc2c910e96 100644 --- a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-0.md +++ b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-0.md @@ -8,7 +8,7 @@ layout: learningpathall ## Review of the CPU memory hierarchy -In this section, you'll learn the memory hierarchy concepts the worked example builds on. It is not an exhaustive explanation, but covers what you'll need to interpret the profiling results. +In this section, you'll learn the memory hierarchy concepts the worked example builds on. It's not an exhaustive explanation, but it covers what you'll need to interpret the profiling results. Modern Arm Neoverse server CPUs use a hierarchy of memories to reduce the cost of loading and storing data. The fastest storage sits close to each CPU core, while larger memories sit farther away and take more cycles to access. @@ -50,7 +50,7 @@ hwloc-ls --of png > topology.png ![Hardware locality topology for an Arm server showing per-core L1 and L2 caches and a shared L3 cache across all cores, which helps you verify cache hierarchy before profiling.#center](./topology.webp "Example hardware locality topology") -The diagram illustrates cache tiers on an AWS Graviton3 bare metal server based on Neoverse V1. Each of the 64 cores has private `L1d`, `L1i`, and `L2` caches, and all cores share one `L3` cache, sometimes referred to as last-level cache (LLC). Cache sizes, especially at later levels, are not fixed by the Neoverse architecture. Implementers such as AWS or Google can configure larger or smaller caches based on design goals. +The diagram shows cache tiers on an AWS Graviton3 bare metal server based on Neoverse V1. Each of the 64 cores has private `L1d`, `L1i`, and `L2` caches, and all cores share one `L3` cache, sometimes referred to as last-level cache (LLC). Cache sizes, especially at later levels, are not fixed by the Neoverse architecture. Implementers such as AWS or Google can configure larger or smaller caches based on design goals. Non-uniform memory access (NUMA) means memory latency can depend on which processor or socket owns the memory being accessed. On this AWS Graviton3 instance, there is only one NUMA node. diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-2.md b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-2.md index caf48d5aa6..c1f306b122 100644 --- a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-2.md +++ b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-2.md @@ -12,7 +12,7 @@ Start by inspecting the baseline particle model in `src/baseline/particle.hpp`. {{% notice Tip %}} -If you are using an IDE or editor with an LLM-based coding assistant, the `AGENTS.md` file can improve your learning experience. This file provides repository context and helps guide the agent to give more useful assistance. +If you are using an IDE or editor with an LLM-based coding assistant, the `AGENTS.md` file can improve your learning experience. The `AGENTS.md` file provides the repository context and helps guide the agent to give more useful assistance. ![Screenshot showing the AGENTS.md file in the repository, highlighting the context file your coding assistant uses to provide more relevant guidance during this task.#center](./agent_screen_shot.webp "Screenshot of GitHub Copilot in VSCode using AGENTS.md as a system prompt to act as a learning assistant.") diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-3.md b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-3.md index 01878b544e..c9ca107484 100644 --- a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-3.md +++ b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-3.md @@ -33,9 +33,9 @@ To measure wall time and compare it against the baseline, run: The hot loop is instrumented with `scopedTimer`, so you'll also see the loop duration printed directly to the terminal. Compare it with the baseline result of 571 milliseconds shown at the end of the section. -## Optimize with an AI agent and the Arm MCP server +## Optimize with an AI agent and the Arm MCP Server -If you have access to a code assistant such as Kiro, Gemini, Codex, or GitHub Copilot, you can also use the Arm Model Context Protocol (MCP) Server. The MCP server includes direct tool support to invoke Performix on a remote target. It integrates with MCP-compatible coding assistants and can provide performance insights to create a useful feedback loop. The following example shows how to connect to OpenAI Codex. For other tools, see [your preferred coding assistant](/learning-paths/servers-and-cloud-computing/arm-mcp-server/1-overview/). +You can use the Arm Model Context Protocol (MCP) Server with a code assistant such as Kiro, Gemini, Codex, or GitHub Copilot to optimize the application. The MCP server includes direct tool support to invoke Performix on a remote target. It integrates with MCP-compatible coding assistants and can provide performance insights to create a useful feedback loop. The following example shows how to connect to OpenAI Codex. For other tools, see [your preferred coding assistant](/learning-paths/servers-and-cloud-computing/arm-mcp-server/1-overview/). {{% notice Note %}} From 85ec45abf097a3d739a2b133ca59bd77ed7eaa29 Mon Sep 17 00:00:00 2001 From: anupras-mohapatra-arm Date: Thu, 21 May 2026 16:02:04 -0500 Subject: [PATCH 5/5] edit --- .../performix-memory-access/how-to-1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-1.md b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-1.md index 25e43b82d5..dec5b59822 100644 --- a/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-1.md +++ b/content/learning-paths/servers-and-cloud-computing/performix-memory-access/how-to-1.md @@ -45,7 +45,7 @@ If you're using a `c7g.metal` instance, you also need to turn Kernel Page Table The fastest way on AWS is to use an editor to add `kpti=off` to the `GRUB_CMDLINE_LINUX_DEFAULT` line in `/etc/default/grub.d/50-cloudimg-settings.cfg`. -After editing the file: +After editing the file, run: ```bash sudo update-grub