Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Integer division is ideal for benchmarking because it's significantly more expen

## What tools are needed to run a Google Benchmark example?

For this example, you can use any Arm Linux computer. For example, an AWS EC2 `c7g.xlarge` instance running Ubuntu 24.04 LTS can be used.
For this example, you can use any Arm Linux computer. For example, the 1st generation Arm AGI CPU running Ubuntu 24.04 LTS.

Run the following commands to install the prerequisite packages:

Expand Down Expand Up @@ -57,30 +57,41 @@ Compile with the command:
g++ -O3 -std=c++17 div_bench.cpp -lbenchmark -lpthread -o div_bench.base
```

Run the program:
{{% notice Please Note %}}

```bash
./div_bench.base
```
Since the command above does not specify `-mcpu` or `-march`, GCC targets a generic baseline architecture. If you want to apply PGO specifically for the 1st generation Arm AGI CPU, the `-mcpu=armagicpu` was added in [GCC 16.1.0](https://github.com/gcc-mirror/gcc/commit/0f5f728854d2ea93e6806a8632c04383502b0386). As of May 2026, it enables the same architectural features as `-march=neoverse-v3ae` from [GCC 15](https://gcc.gnu.org/gcc-15/changes.html). However in the future there may be differences.

### Example output
As such, we recommend installing the latest version of GCC/G++ if you are targeting the Arm AGI CPU. Use the `-mcpu=native` flag if compiling on the target machine or `-mcpu=armagicpu` if cross compiling.

```output
{{% /notice %}}

Run the program:

```bash { command_line="user@localhost | 2-16" }
./div_bench.base
2026-05-21T08:12:08+00:00
Running ./div_bench.base
Run on (4 X 2100 MHz CPU s)
Run on (128 X 2800 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x4)
L1 Instruction 64 KiB (x4)
L2 Unified 1024 KiB (x4)
L3 Unified 32768 KiB (x1)
Load Average: 0.00, 0.00, 0.00
L1 Data 64 KiB (x128)
L1 Instruction 64 KiB (x128)
L2 Unified 2048 KiB (x128)
L3 Unified 131072 KiB (x1)
Load Average: 0.13, 0.05, 0.02
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------
baseDiv/1500 7.90 us 7.90 us 88512
baseDiv/1500 7.37 us 7.37 us 79910
```

{{% notice Please Note%}}

Since we are not interested in exact timings but the relative change we can ignore the warnings about CPU scaling and library debug. We expect the speed up from PGO to be beyond the reasonable margin or error caused by said affects.

{{% /notice %}}

### Inspect assembly

To inspect what assembly instructions are being executed most frequently, you can use the `perf` command. This is useful for identifying bottlenecks and understanding the performance characteristics of your code.
Expand All @@ -101,4 +112,3 @@ sudo perf report --input=perf-division-base
As the `perf report` graphic below shows, the program spends a significant amount of time in the short loops with no loop unrolling. There is also an expensive `sdiv` operation, and most of the execution time is spent storing the result of the operation.

![before-pgo](./before-pgo.gif)

Original file line number Diff line number Diff line change
Expand Up @@ -32,29 +32,26 @@ g++ -O3 -std=c++17 -fprofile-use div_bench.cpp -lbenchmark -lpthread -o div_benc

Now run the optimized binary:

```bash
```bash {command_line="user@localhost | 2-16" }
./div_bench.opt
```

The following output shows the performance improvement:

```output
2026-05-21T09:20:23+00:00
Running ./div_bench.opt
Run on (4 X 2100 MHz CPU s)
Run on (128 X 2800 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x4)
L1 Instruction 64 KiB (x4)
L2 Unified 1024 KiB (x4)
L3 Unified 32768 KiB (x1)
Load Average: 0.10, 0.03, 0.01
L1 Data 64 KiB (x128)
L1 Instruction 64 KiB (x128)
L2 Unified 2048 KiB (x128)
L3 Unified 131072 KiB (x1)
Load Average: 0.00, 0.00, 0.00
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------
baseDiv/1500 2.86 us 2.86 us 244429
baseDiv/1500 2.06 us 2.06 us 315065
```

As the terminal output above shows, the average execution time is reduced from 7.90 to 2.86 microseconds. This improvement occurs because the profile data informed the compiler that the input divisor was consistently 1500 during the profiled runs, allowing it to apply specific optimizations.
As the terminal output above shows, the average execution time is reduced from 7.37 to 2.06 microseconds. This improvement occurs because the profile data informed the compiler that the input divisor was consistently 1500 during the profiled runs, allowing it to apply specific optimizations.

Next, let's examine how the code was optimized at the assembly level.

Expand Down
Loading