diff --git a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/_index.md b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/_index.md new file mode 100644 index 0000000000..47c207d7ec --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/_index.md @@ -0,0 +1,43 @@ +--- +title: Fine Tune LLM performance on CPU with multithreading + +minutes_to_complete: 20 + +who_is_this_for: ML Engineers looking to fine tune the inference performance of LLMs running on CPU + +learning_objectives: + - Understand how PyTorch uses multiple threads for CPU inference and the various tradeoffs involved + - Tune the thread count to improve performance for specific models and systems + +prerequisites: + - Intermediate understanding of Python and PyTorch + - Access to an Arm-based system + +author: Kieran Hejmadi + +### Tags +skilllevels: Introductory +subjects: ML +armips: + - Neoverse +tools_software_languages: + - Python + - PyTorch + - Bash +operatingsystems: + - Linux + + +further_reading: + - resource: + title: Arm Tool Solutions + link: https://github.com/ARM-software/Tool-Solutions/tree/main + type: website + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/_next-steps.md new file mode 100644 index 0000000000..eccc3dab2b --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/_next-steps.md @@ -0,0 +1,10 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # The weight controls the order of the pages. _index.md always has weight 1. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- + +If you would like to see other dials available to improve the performance of a LLM on Arm or other types of AI models. Please look at the examples page within the [Tools Solutions repository](https://github.com/ARM-software/Tool-Solutions/tree/main/ML-Frameworks/pytorch-aarch64/examples.) \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/background.md b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/background.md new file mode 100644 index 0000000000..215f7cc9c2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/background.md @@ -0,0 +1,153 @@ +--- +title: Background Information +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Background Information + +A well-known challenge in parallel programming is choosing the right number of threads for a given amount of work. When multiple threads are created to perform a task, the actual computation must be large enough to justify the overhead of coordinating those threads. + +For example, if a computation is split across many threads, the costs of: +- creating the threads, and +- synchronizing their results through shared memory + +can easily outweigh any performance gains from parallel execution. The same principle applies to generative AI workloads running on CPU. + +When work is distributed across multiple threads, communication and synchronization overhead increases the total amount of work the system must perform. This creates a trade-off between: + +- **Latency** – the time to process a single request, and +- **Throughput** – the number of requests processed per unit time. + +PyTorch attempts to automatically choose an appropriate number of threads. However, as we will show, in some cases you may want to manually fine-tune this configuration to improve performance. + +## Multi-threading with PyTorch on CPU + + +The diagram below is taken from the [PyTorch documentation](https://docs.pytorch.org/docs/2.8/notes/cpu_threading_torchscript_inference.html). When running inference, PyTorch uses an **Application Thread Pool**. In PyTorch, there are inter-op parallelism, which is spawning threads to run separate operations in a graph in parallel (e.g., 1 thread for a matmul and another thread for a softmax). Additionally there's intra-op parallelism can be used to spawn multiple threads to work on the same operation. + +![threading-in-pytorch](./pytorch-threading.jpg) + +In PyTorch, the `torch.set_num_threads()` [API](https://docs.pytorch.org/docs/stable/generated/torch.set_num_threads.html) is used to set the maximum number of threads to spawn in the Application Thread Pool. + +As of PyTorch 2.8.0, the default number of threads is equal to the number of CPU cores (see [PyTorch CPU Threading Documentation](https://docs.pytorch.org/docs/2.8/notes/cpu_threading_torchscript_inference.html) for more detail). PyTorch looks to find the ideal number of threads as described with the following code snippet taken from the PyTorch source code, [ParallemOpenMP.h](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/ParallelOpenMP.h). + +```cpp +int64_t num_threads = omp_get_num_threads(); + if (grain_size > 0) { + num_threads = std::min(num_threads, divup((end - begin), grain_size)); + } + +... +inline int64_t divup(int64_t x, int64_t y) { + return (x + y - 1) / y; +} +``` + +In PyTorch builds that use OpenMP, the maximum size of the application’s thread pool can be configured once at runtime using the `OMP_NUM_THREADS` environment variable. The actual number of threads used will scale up to this limit depending on the workload and the `grain_size`. + +The short example below illustrates that the default settings on many-core systems may not provide optimal performance for all workloads. + +## Basic PyTorch Example + +Create a new file named `pytorch_omp_example.py` and paste in the Python script below. The script performs a matrix multiplication in eager mode on two 256×256 random matrices. +For this relatively small computation, we will: + +- Observe the default performance of PyTorch’s parallelism +- Print the parallel configuration using `torch.__config__.parallel_info()`. + +```python +import os +import time +import torch + + +def main(): + print(f"PyTorch version: {torch.__version__}") + + # Read OMP_NUM_THREADS from the environment + omp_threads = os.environ.get("OMP_NUM_THREADS") + print(f"OMP_NUM_THREADS in environment: {omp_threads}") + + # If it's set and looks like a number, use it to set PyTorch's intra-op threads + if omp_threads and omp_threads.isdigit(): + torch.set_num_threads(int(omp_threads)) + + # Show how many threads PyTorch will actually use for intra-op parallelism + print(f"torch.get_num_threads(): {torch.get_num_threads()}\n") + + # A simple operation to illustrate parallelism: + size = 256 + a = torch.randn(size, size) + b = torch.randn(size, size) + + start = time.time() + c = a @ b # matrix multiplication (runs in a parallel region on CPU) + end = time.time() + + print(f"Result shape: {c.shape}") + print(f"Matrix multiply time: {end - start:.5f} seconds") + print(f"\nThreading Information = {torch.__config__.parallel_info()}") + +if __name__ == "__main__": + main() +``` + +Run the python script + +```bash +python pytorch_omp_example.py +``` + +You will observe the an output similar to the follow. As you can see the number of threads is set to core count of 96 and the time to execute is 2.24 ms. + + +```output +PyTorch version: 2.10.0.dev20251124 +OMP_NUM_THREADS in environment: None +torch.get_num_threads(): 96 + +Result shape: torch.Size([256, 256]) +Matrix multiply time: 0.00224 seconds + +Threading Information = ATen/Parallel: + at::get_num_threads() : 96 + at::get_num_interop_threads() : 96 +OpenMP 201511 (a.k.a. OpenMP 4.5) + omp_get_max_threads() : 96 +Intel(R) MKL-DNN v3.11.0 (Git Hash 0b8a866c009b03f322e6526d7c33cfec84a4a97a) +std::thread::hardware_concurrency() : 96 +Environment variables: + OMP_NUM_THREADS : [not set] +ATen parallel backend: OpenMP +``` + +Now reduce the number of OpenMP threads using the `OMP_NUM_THREADS` value and observe a reduction is matrix multiply time to 0.64 ms. + +```bash +OMP_NUM_THREADS=16 python pytorch_omp_example.py +``` + +```output +PyTorch version: 2.10.0.dev20251124 +OMP_NUM_THREADS in environment: 16 +torch.get_num_threads(): 16 + +Result shape: torch.Size([256, 256]) +Matrix multiply time: 0.00064 seconds + +Threading Information = ATen/Parallel: + at::get_num_threads() : 16 + at::get_num_interop_threads() : 96 +OpenMP 201511 (a.k.a. OpenMP 4.5) + omp_get_max_threads() : 16 +Intel(R) MKL-DNN v3.11.0 (Git Hash 0b8a866c009b03f322e6526d7c33cfec84a4a97a) +std::thread::hardware_concurrency() : 96 +Environment variables: + OMP_NUM_THREADS : 16 +ATen parallel backend: OpenMP +``` + +We will now move on from this trivial example to a much larger workload of a large language model (LLM). \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/build.md b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/build.md new file mode 100644 index 0000000000..97f1cf7f92 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/build.md @@ -0,0 +1,69 @@ +--- +title: Setup Environment +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Build + +In this learning path, we will use Arm’s downstream canary release of PyTorch, which includes ready-to-use examples and scripts. While this release offers access to the latest downstream features, it is intended for experimentation rather than production use. + +### 1. Create HuggingFace Account + +Create up a [huggingface account](https://huggingface.co/) if you do not already have one. Once created, request access to the [1B](https://huggingface.co/google/gemma-3-1b-it) and [270M](https://huggingface.co/google/gemma-3-270m-it) variants of Google's Gemma-3 model. It will take around 15 minutes to be granted access. + +### 2. Connect to an Arm system and install Docker + +lease see our [getting started guide](https://learn.arm.com/learning-paths/servers-and-cloud-computing/csp/) if it is your first time using Arm-based cloud instances. + +In this example I will be using an AWS Graviton 4 (`m8g.24xlarge`)instance running Ubuntu 24.04 LTS. This is based on the Neoverse V2 architecture. + + +Install docker through the [official documentation](https://docs.docker.com/engine/install/ubuntu/) or our [Arm install guide](https://learn.arm.com/install-guides/docker/docker-desktop-arm-linux/). Make sure to follow the post-installation steps. + + +### 3. Build the PyTorch-AArch64 Docker Container + +Connect to the Arm instance and run the following to clone the repository. + +```bash +git clone https://github.com/ARM-software/Tool-Solutions.git +cd Tool-Solutions/ML-Frameworks/pytorch-aarch64/ +``` +Run the following bash script to build the container + +```bash +./build.sh -n $(nproc - 1) +``` + +> **Note**: On a 96-core instance `m8g.24` AWS this will take approximately 20 minutes to build. + +Once the build has finished, run the following command replacing `` with the version of torch and torchao just built. + +```bash +./dockerize.sh ./results/torch-linux_aarch64.whl ./results/torchao--py3-none-any.whl +``` + +You should see the following output in your terminal, confirming that you are in the correct directory inside the Docker container. + +```outpu +aarch64_pytorch ~> +``` + +### 5. Login to HuggingFace + + + +Create a new `read` token on HuggingFace by clicking [this link](https://huggingface.co/settings/tokens/new?tokenType=read). + +![hf-token](./hf-access-token.jpg) + +Provide a suitable token name, press create token and copy the generated token value. From within docker container, enter the following command and paste the token to login. + +```bash +huggingface-cli login +``` + +> **Note**: The login will not persist once the docker session has ended \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/hf-access-token.jpg b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/hf-access-token.jpg new file mode 100644 index 0000000000..c02ba89881 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/hf-access-token.jpg differ diff --git a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/prefill_throughput.png b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/prefill_throughput.png new file mode 100644 index 0000000000..d67b397259 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/prefill_throughput.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/pytorch-threading.jpg b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/pytorch-threading.jpg new file mode 100644 index 0000000000..6b63f93cf1 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/pytorch-threading.jpg differ diff --git a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/tune.md b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/tune.md new file mode 100644 index 0000000000..73fe6f15a4 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/tune.md @@ -0,0 +1,150 @@ +--- +title: Fine-Tune Thread Count +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + + +## Tuning thread size + +We will run inference on Google's [Gemma-3](https://huggingface.co/google/gemma-3-1b-it) model and measure how inference performance vaires with thread count for both the 270-million-parameter and 1-billion-parameter models. We will be running the `transformers_llm_text_gen.py` script which by default applies groupwise, layout-aware INT4 quantization of our model. + +Create a file names `comparison-1b.sh` and paste in the following script. + +```bash +#!/usr/bin/env bash +set -euo pipefail + +# Loop over OMP_NUM_THREADS: powers of 2 plus 96 +for t in 2 4 8 16 32 64 96; do + echo "===============================" + echo "Running with OMP_NUM_THREADS=$t" + echo "===============================" + + TORCHINDUCTOR_CPP_WRAPPER=1 \ + TORCHINDUCTOR_FREEZING=1 \ + OMP_NUM_THREADS="$t" \ + python transformers_llm_text_gen.py --model google/gemma-3-1b-it 2>&1 | \ + grep -E \ + "^(Prefill Tokens|Prefill time|E2E Generation time|Decoded Tokens|Decode time|Prefill Tokens per second|Decode Tokens per second):" + + echo # blank line between runs +done +``` + +Likewise create a separate script, `comparison-270m.sh` for comparing the 270m model + +```bash +#!/usr/bin/env bash +set -euo pipefail + +# Loop over OMP_NUM_THREADS: powers of 2 plus 96 +for t in 2 4 8 16 32 64 96; do + echo "===============================" + echo "Running with OMP_NUM_THREADS=$t" + echo "===============================" + + TORCHINDUCTOR_CPP_WRAPPER=1 \ + TORCHINDUCTOR_FREEZING=1 \ + OMP_NUM_THREADS="$t" \ + python transformers_llm_text_gen.py --model google/gemma-3-270m-it 2>&1 | \ + grep -E \ + "^(Prefill Tokens|Prefill time|E2E Generation time|Decoded Tokens|Decode time|Prefill Tokens per second|Decode Tokens per second):" + + echo # blank line between runs +done +``` + +Run both scripts from the directory containing the `transformers_llm_text_gen.py` file using the commands below. For clarity, we print only the final statistics. + +```bash +./comparison-1b.sh +./comparison-270m.sh +``` + +> Note: In a separate terminal session you can observe the realtime CPU utilization and the spawning on threads by running the following command. + +```bash +watch -n 0.1 'pid=$(pgrep -n python); [ -n "$pid" ] && ps -L -p "$pid" -o pid,tid,psr,pcpu,stat,comm' +``` + +You should see output similar to the example below, showing the CPU utilization of each thread. This illustrates how new threads, both inter-op and intra-op, are created and used over time. + +```output + PID TID PSR %CPU STAT COMMAND + 10600 10600 31 85.3 Rl+ python + 10600 10606 32 2.4 Sl+ python + 10600 10607 33 2.4 Sl+ python + 10600 10608 34 2.4 Sl+ python +``` + +## Results + +You should see the following output summarizing the statistics for each run as we sweep through the number of threads. + +```out +=============================== +Running with OMP_NUM_THREADS=2 +=============================== +Prefill Tokens: 55 +Prefill time: 0.07 seconds +E2E Generation time: 1.50 seconds +Decoded Tokens: 65 +Decode time: 1.44 seconds +Prefill Tokens per second: 834.48 +Decode Tokens per second: 45.23 + +... + +``` + +The graph above shows how prefill tokens per second change with the number of OpenMP threads for the 270M and 1B variants of Gemma-3. As expected, the smaller 270M model runs faster. Both models reach their optimal token generation rate at around 16–32 threads, though the 270M model exhibits a sharper performance drop-off beyond this range compared with the 1B variant. + +![comparison](./prefill_throughput.png) + + + +## Using PyTorch Compilation Mode + +So far, we have been running in PyTorch's eager execution mode, we can observe the performance characteristics with PyTorch's compile mode using the following. First install a C++ compiler and various dependencies. + +```bash +sudo apt update && sudo apt install g++ python3.10-dev build-essential +``` + +Then run the `gemma-3-270m` model with the `--compile` flag but without the default number of OpenMP threads. + +```bash +TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 python transformers_llm_text_gen.py --comp +ile --model google/gemma-3-270m-it +``` +```output +E2E Generation time: 6.15 seconds +Decoded Tokens: 65 +Decode time: 5.74 seconds +Prefill Tokens per second: 133.52 +Decode Tokens per second: 11.33 +``` + +Now run with `OMP_NUM_THREADS` set to 16. + +```bash +TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python transformers_llm_text_gen.py --compile --model google/gemma-3-270m-it +``` + +As the output below shows, we see a huge reduction in the end-to-end generation time by reducing the number the thread count from 96 (default) to 16. + +```output +E2E Generation time: 0.63 seconds +Decoded Tokens: 65 +Decode time: 0.61 seconds +Prefill Tokens per second: 2728.34 +Decode Tokens per second: 107.37 +``` + +### Summary + +In this learning path, we explored how the number of OpenMP threads is a tunable parameter that can impact the performance of a large language model. This is especially important when running such models on Arm systems with high core counts. You should also take the model’s parameter size into account. In practice, using a heuristic or trial-and-error approach is often the fastest way to determine the optimal thread count for a given model and system. +