Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
title: Fine Tune LLM performance on CPU with multithreading

minutes_to_complete: 20

who_is_this_for: ML Engineers looking to fine tune the inference performance of LLMs running on CPU

learning_objectives:
- Understand how PyTorch uses multiple threads for CPU inference and the various tradeoffs involved
- Tune the thread count to improve performance for specific models and systems

prerequisites:
- Intermediate understanding of Python and PyTorch
- Access to an Arm-based system

author: Kieran Hejmadi

### Tags
skilllevels: Introductory
subjects: ML
armips:
- Neoverse
tools_software_languages:
- Python
- PyTorch
- Bash
operatingsystems:
- Linux


further_reading:
- resource:
title: Arm Tool Solutions
link: https://github.com/ARM-software/Tool-Solutions/tree/main
type: website


### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
# ================================================================================
# FIXED, DO NOT MODIFY THIS FILE
# ================================================================================
weight: 21 # The weight controls the order of the pages. _index.md always has weight 1.
title: "Next Steps" # Always the same, html page title.
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
---

If you would like to see other dials available to improve the performance of a LLM on Arm or other types of AI models. Please look at the examples page within the [Tools Solutions repository](https://github.com/ARM-software/Tool-Solutions/tree/main/ML-Frameworks/pytorch-aarch64/examples.)
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
---
title: Background Information
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Background Information

A well-known challenge in parallel programming is choosing the right number of threads for a given amount of work. When multiple threads are created to perform a task, the actual computation must be large enough to justify the overhead of coordinating those threads.

For example, if a computation is split across many threads, the costs of:
- creating the threads, and
- synchronizing their results through shared memory

can easily outweigh any performance gains from parallel execution. The same principle applies to generative AI workloads running on CPU.

When work is distributed across multiple threads, communication and synchronization overhead increases the total amount of work the system must perform. This creates a trade-off between:

- **Latency** – the time to process a single request, and
- **Throughput** – the number of requests processed per unit time.

PyTorch attempts to automatically choose an appropriate number of threads. However, as we will show, in some cases you may want to manually fine-tune this configuration to improve performance.

## Multi-threading with PyTorch on CPU


The diagram below is taken from the [PyTorch documentation](https://docs.pytorch.org/docs/2.8/notes/cpu_threading_torchscript_inference.html). When running inference, PyTorch uses an **Application Thread Pool**. In PyTorch, there are inter-op parallelism, which is spawning threads to run separate operations in a graph in parallel (e.g., 1 thread for a matmul and another thread for a softmax). Additionally there's intra-op parallelism can be used to spawn multiple threads to work on the same operation.

![threading-in-pytorch](./pytorch-threading.jpg)

In PyTorch, the `torch.set_num_threads()` [API](https://docs.pytorch.org/docs/stable/generated/torch.set_num_threads.html) is used to set the maximum number of threads to spawn in the Application Thread Pool.

As of PyTorch 2.8.0, the default number of threads is equal to the number of CPU cores (see [PyTorch CPU Threading Documentation](https://docs.pytorch.org/docs/2.8/notes/cpu_threading_torchscript_inference.html) for more detail). PyTorch looks to find the ideal number of threads as described with the following code snippet taken from the PyTorch source code, [ParallemOpenMP.h](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/ParallelOpenMP.h).

```cpp
int64_t num_threads = omp_get_num_threads();
if (grain_size > 0) {
num_threads = std::min(num_threads, divup((end - begin), grain_size));
}

...
inline int64_t divup(int64_t x, int64_t y) {
return (x + y - 1) / y;
}
```

In PyTorch builds that use OpenMP, the maximum size of the application’s thread pool can be configured once at runtime using the `OMP_NUM_THREADS` environment variable. The actual number of threads used will scale up to this limit depending on the workload and the `grain_size`.

The short example below illustrates that the default settings on many-core systems may not provide optimal performance for all workloads.

## Basic PyTorch Example

Create a new file named `pytorch_omp_example.py` and paste in the Python script below. The script performs a matrix multiplication in eager mode on two 256×256 random matrices.
For this relatively small computation, we will:

- Observe the default performance of PyTorch’s parallelism
- Print the parallel configuration using `torch.__config__.parallel_info()`.

```python
import os
import time
import torch


def main():
print(f"PyTorch version: {torch.__version__}")

# Read OMP_NUM_THREADS from the environment
omp_threads = os.environ.get("OMP_NUM_THREADS")
print(f"OMP_NUM_THREADS in environment: {omp_threads}")

# If it's set and looks like a number, use it to set PyTorch's intra-op threads
if omp_threads and omp_threads.isdigit():
torch.set_num_threads(int(omp_threads))

# Show how many threads PyTorch will actually use for intra-op parallelism
print(f"torch.get_num_threads(): {torch.get_num_threads()}\n")

# A simple operation to illustrate parallelism:
size = 256
a = torch.randn(size, size)
b = torch.randn(size, size)

start = time.time()
c = a @ b # matrix multiplication (runs in a parallel region on CPU)
end = time.time()

print(f"Result shape: {c.shape}")
print(f"Matrix multiply time: {end - start:.5f} seconds")
print(f"\nThreading Information = {torch.__config__.parallel_info()}")

if __name__ == "__main__":
main()
```

Run the python script

```bash
python pytorch_omp_example.py
```

You will observe the an output similar to the follow. As you can see the number of threads is set to core count of 96 and the time to execute is 2.24 ms.


```output
PyTorch version: 2.10.0.dev20251124
OMP_NUM_THREADS in environment: None
torch.get_num_threads(): 96

Result shape: torch.Size([256, 256])
Matrix multiply time: 0.00224 seconds

Threading Information = ATen/Parallel:
at::get_num_threads() : 96
at::get_num_interop_threads() : 96
OpenMP 201511 (a.k.a. OpenMP 4.5)
omp_get_max_threads() : 96
Intel(R) MKL-DNN v3.11.0 (Git Hash 0b8a866c009b03f322e6526d7c33cfec84a4a97a)
std::thread::hardware_concurrency() : 96
Environment variables:
OMP_NUM_THREADS : [not set]
ATen parallel backend: OpenMP
```

Now reduce the number of OpenMP threads using the `OMP_NUM_THREADS` value and observe a reduction is matrix multiply time to 0.64 ms.

```bash
OMP_NUM_THREADS=16 python pytorch_omp_example.py
```

```output
PyTorch version: 2.10.0.dev20251124
OMP_NUM_THREADS in environment: 16
torch.get_num_threads(): 16

Result shape: torch.Size([256, 256])
Matrix multiply time: 0.00064 seconds

Threading Information = ATen/Parallel:
at::get_num_threads() : 16
at::get_num_interop_threads() : 96
OpenMP 201511 (a.k.a. OpenMP 4.5)
omp_get_max_threads() : 16
Intel(R) MKL-DNN v3.11.0 (Git Hash 0b8a866c009b03f322e6526d7c33cfec84a4a97a)
std::thread::hardware_concurrency() : 96
Environment variables:
OMP_NUM_THREADS : 16
ATen parallel backend: OpenMP
```

We will now move on from this trivial example to a much larger workload of a large language model (LLM).
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
title: Setup Environment
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Build

In this learning path, we will use Arm’s downstream canary release of PyTorch, which includes ready-to-use examples and scripts. While this release offers access to the latest downstream features, it is intended for experimentation rather than production use.

### 1. Create HuggingFace Account

Create up a [huggingface account](https://huggingface.co/) if you do not already have one. Once created, request access to the [1B](https://huggingface.co/google/gemma-3-1b-it) and [270M](https://huggingface.co/google/gemma-3-270m-it) variants of Google's Gemma-3 model. It will take around 15 minutes to be granted access.

### 2. Connect to an Arm system and install Docker

lease see our [getting started guide](https://learn.arm.com/learning-paths/servers-and-cloud-computing/csp/) if it is your first time using Arm-based cloud instances.

In this example I will be using an AWS Graviton 4 (`m8g.24xlarge`)instance running Ubuntu 24.04 LTS. This is based on the Neoverse V2 architecture.


Install docker through the [official documentation](https://docs.docker.com/engine/install/ubuntu/) or our [Arm install guide](https://learn.arm.com/install-guides/docker/docker-desktop-arm-linux/). Make sure to follow the post-installation steps.


### 3. Build the PyTorch-AArch64 Docker Container

Connect to the Arm instance and run the following to clone the repository.

```bash
git clone https://github.com/ARM-software/Tool-Solutions.git
cd Tool-Solutions/ML-Frameworks/pytorch-aarch64/
```
Run the following bash script to build the container

```bash
./build.sh -n $(nproc - 1)
```

> **Note**: On a 96-core instance `m8g.24` AWS this will take approximately 20 minutes to build.

Once the build has finished, run the following command replacing `<version>` with the version of torch and torchao just built.

```bash
./dockerize.sh ./results/torch-<version>linux_aarch64.whl ./results/torchao-<version>-py3-none-any.whl
```

You should see the following output in your terminal, confirming that you are in the correct directory inside the Docker container.

```outpu
aarch64_pytorch ~>
```

### 5. Login to HuggingFace



Create a new `read` token on HuggingFace by clicking [this link](https://huggingface.co/settings/tokens/new?tokenType=read).

![hf-token](./hf-access-token.jpg)

Provide a suitable token name, press create token and copy the generated token value. From within docker container, enter the following command and paste the token to login.

```bash
huggingface-cli login
```

> **Note**: The login will not persist once the docker session has ended
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading