ArmDeveloperEcosystem · kieranhejmadi01 · Dec 15, 2025 · Dec 15, 2025 · Dec 15, 2025 · Dec 15, 2025
diff --git a/...-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/_index.md b/...-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/_index.md
@@ -0,0 +1,43 @@
+---
+title: Fine Tune LLM performance on CPU with multithreading
+
+minutes_to_complete: 20
+
+who_is_this_for: ML Engineers looking to fine tune the inference performance of LLMs running on CPU 
+
+learning_objectives: 
+    - Understand how PyTorch uses multiple threads for CPU inference and the various tradeoffs involved
+    - Tune the thread count to improve performance for specific models and systems
+
+prerequisites:
+    - Intermediate understanding of Python and PyTorch
+    - Access to an Arm-based system
+
+author: Kieran Hejmadi
+
+### Tags
+skilllevels: Introductory
+subjects: ML
+armips:
+    - Neoverse
+tools_software_languages:
+    - Python
+    - PyTorch
+    - Bash
+operatingsystems:
+    - Linux
+
+
+further_reading:
+    - resource:
+        title: Arm Tool Solutions
+        link: https://github.com/ARM-software/Tool-Solutions/tree/main
+        type: website
+
+
+### FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 1                       # _index.md always has weight of 1 to order correctly
+layout: "learningpathall"       # All files under learning paths have this same wrapper
+learning_path_main_page: "yes"  # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
+---
diff --git a/...s/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/_next-steps.md b/...s/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/_next-steps.md
@@ -0,0 +1,10 @@
+---
+# ================================================================================
+#       FIXED, DO NOT MODIFY THIS FILE
+# ================================================================================
+weight: 21                  # The weight controls the order of the pages. _index.md always has weight 1.
+title: "Next Steps"         # Always the same, html page title.
+layout: "learningpathall"   # All files under learning paths have this same wrapper for Hugo processing.
+---
+
+If you would like to see other dials available to improve the performance of a LLM on Arm or other types of AI models. Please look at the examples page within the [Tools Solutions repository](https://github.com/ARM-software/Tool-Solutions/tree/main/ML-Frameworks/pytorch-aarch64/examples.)
diff --git a/...hs/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/background.md b/...hs/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/background.md
@@ -0,0 +1,153 @@
+---
+title: Background Information
+weight: 3
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Background Information
+
+A well-known challenge in parallel programming is choosing the right number of threads for a given amount of work. When multiple threads are created to perform a task, the actual computation must be large enough to justify the overhead of coordinating those threads. 
+
+For example, if a computation is split across many threads, the costs of:
+- creating the threads, and  
+- synchronizing their results through shared memory  
+
+can easily outweigh any performance gains from parallel execution. The same principle applies to generative AI workloads running on CPU.
+
+When work is distributed across multiple threads, communication and synchronization overhead increases the total amount of work the system must perform. This creates a trade-off between:
+
+- **Latency** – the time to process a single request, and  
+- **Throughput** – the number of requests processed per unit time.
+
+PyTorch attempts to automatically choose an appropriate number of threads. However, as we will show, in some cases you may want to manually fine-tune this configuration to improve performance.
+
+## Multi-threading with PyTorch on CPU
+
+
+The diagram below is taken from the [PyTorch documentation](https://docs.pytorch.org/docs/2.8/notes/cpu_threading_torchscript_inference.html). When running inference, PyTorch uses an **Application Thread Pool**. In PyTorch, there are inter-op parallelism, which is spawning threads to run separate operations in a graph in parallel (e.g., 1 thread for a matmul and another thread for a softmax). Additionally there's intra-op parallelism can be used to spawn multiple threads to work on the same operation. 
+
+![threading-in-pytorch](./pytorch-threading.jpg)
+
+In PyTorch, the `torch.set_num_threads()` [API](https://docs.pytorch.org/docs/stable/generated/torch.set_num_threads.html) is used to set the maximum number of threads to spawn in the Application Thread Pool. 
+
+As of PyTorch 2.8.0, the default number of threads is equal to the number of CPU cores (see [PyTorch CPU Threading Documentation](https://docs.pytorch.org/docs/2.8/notes/cpu_threading_torchscript_inference.html) for more detail). PyTorch looks to find the ideal number of threads as described with the following code snippet taken from the PyTorch source code, [ParallemOpenMP.h](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/ParallelOpenMP.h).
+
+```cpp
+int64_t num_threads = omp_get_num_threads();
+    if (grain_size > 0) {
+      num_threads = std::min(num_threads, divup((end - begin), grain_size));
+    }
+
+...
+inline int64_t divup(int64_t x, int64_t y) {
+  return (x + y - 1) / y;
+}
+```
+
+In PyTorch builds that use OpenMP, the maximum size of the application’s thread pool can be configured once at runtime using the `OMP_NUM_THREADS` environment variable. The actual number of threads used will scale up to this limit depending on the workload and the `grain_size`. 
+
+The short example below illustrates that the default settings on many-core systems may not provide optimal performance for all workloads.
+
+## Basic PyTorch Example
+
+Create a new file named `pytorch_omp_example.py` and paste in the Python script below. The script performs a matrix multiplication in eager mode on two 256×256 random matrices.  
+For this relatively small computation, we will:
+
+- Observe the default performance of PyTorch’s parallelism  
+- Print the parallel configuration using `torch.__config__.parallel_info()`.
+
+```python
+import os
+import time
+import torch
+
+
+def main():
+    print(f"PyTorch version: {torch.__version__}")
+
+    # Read OMP_NUM_THREADS from the environment
+    omp_threads = os.environ.get("OMP_NUM_THREADS")
+    print(f"OMP_NUM_THREADS in environment: {omp_threads}")
+
+    # If it's set and looks like a number, use it to set PyTorch's intra-op threads
+    if omp_threads and omp_threads.isdigit():
+        torch.set_num_threads(int(omp_threads))
+
+    # Show how many threads PyTorch will actually use for intra-op parallelism
+    print(f"torch.get_num_threads(): {torch.get_num_threads()}\n")
+
+    # A simple operation to illustrate parallelism:
+    size = 256  
+    a = torch.randn(size, size)
+    b = torch.randn(size, size)
+
+    start = time.time()
+    c = a @ b  # matrix multiplication (runs in a parallel region on CPU)
+    end = time.time()
+
+    print(f"Result shape: {c.shape}")
+    print(f"Matrix multiply time: {end - start:.5f} seconds")
+    print(f"\nThreading Information = {torch.__config__.parallel_info()}")
+
+if __name__ == "__main__":
+    main()
+```
+
+Run the python script 
+
+```bash
+python pytorch_omp_example.py
+```
+
+You will observe the an output similar to the follow. As you can see the number of threads is set to core count of 96 and the time to execute is 2.24 ms. 
+
+
+```output
+PyTorch version: 2.10.0.dev20251124
+OMP_NUM_THREADS in environment: None
+torch.get_num_threads(): 96
+
+Result shape: torch.Size([256, 256])
+Matrix multiply time: 0.00224 seconds
+
+Threading Information = ATen/Parallel:
+        at::get_num_threads() : 96
+        at::get_num_interop_threads() : 96
+OpenMP 201511 (a.k.a. OpenMP 4.5)
+        omp_get_max_threads() : 96
+Intel(R) MKL-DNN v3.11.0 (Git Hash 0b8a866c009b03f322e6526d7c33cfec84a4a97a)
+std::thread::hardware_concurrency() : 96
+Environment variables:
+        OMP_NUM_THREADS : [not set]
+ATen parallel backend: OpenMP
+```
+
+Now reduce the number of OpenMP threads using the `OMP_NUM_THREADS` value and observe a reduction is matrix multiply time to 0.64 ms. 
+
+```bash
+OMP_NUM_THREADS=16 python pytorch_omp_example.py
+```
+
+```output
+PyTorch version: 2.10.0.dev20251124
+OMP_NUM_THREADS in environment: 16
+torch.get_num_threads(): 16
+
+Result shape: torch.Size([256, 256])
+Matrix multiply time: 0.00064 seconds
+
+Threading Information = ATen/Parallel:
+        at::get_num_threads() : 16
+        at::get_num_interop_threads() : 96
+OpenMP 201511 (a.k.a. OpenMP 4.5)
+        omp_get_max_threads() : 16
+Intel(R) MKL-DNN v3.11.0 (Git Hash 0b8a866c009b03f322e6526d7c33cfec84a4a97a)
+std::thread::hardware_concurrency() : 96
+Environment variables:
+        OMP_NUM_THREADS : 16
+ATen parallel backend: OpenMP
+```
+
+We will now move on from this trivial example to a much larger workload of a large language model (LLM). 
diff --git a/...g-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/build.md b/...g-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/build.md
@@ -0,0 +1,69 @@
+---
+title: Setup Environment
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Build 
+
+In this learning path, we will use Arm’s downstream canary release of PyTorch, which includes ready-to-use examples and scripts. While this release offers access to the latest downstream features, it is intended for experimentation rather than production use.
+
+### 1. Create HuggingFace Account
+
+Create up a [huggingface account](https://huggingface.co/) if you do not already have one. Once created, request access to the [1B](https://huggingface.co/google/gemma-3-1b-it) and [270M](https://huggingface.co/google/gemma-3-270m-it) variants of Google's Gemma-3 model. It will take around 15 minutes to be granted access. 
+
+### 2. Connect to an Arm system and install Docker
+
+lease see our [getting started guide](https://learn.arm.com/learning-paths/servers-and-cloud-computing/csp/) if it is your first time using Arm-based cloud instances. 
+
+In this example I will be using an AWS Graviton 4 (`m8g.24xlarge`)instance running Ubuntu 24.04 LTS. This is based on the Neoverse V2 architecture.
+
+
+Install docker through the [official documentation](https://docs.docker.com/engine/install/ubuntu/) or our [Arm install guide](https://learn.arm.com/install-guides/docker/docker-desktop-arm-linux/). Make sure to follow the post-installation steps. 
+
+
+### 3. Build the PyTorch-AArch64 Docker Container
+
+Connect to the Arm instance and run the following to clone the repository.
+
+```bash
+git clone https://github.com/ARM-software/Tool-Solutions.git
+cd Tool-Solutions/ML-Frameworks/pytorch-aarch64/
+```
+Run the following bash script to build the container
+
+```bash
+./build.sh -n $(nproc - 1)
+```
+
+> **Note**: On a 96-core instance `m8g.24` AWS this will take approximately 20 minutes to build. 
+
+Once the build has finished, run the following command replacing `<version>` with the version of torch and torchao just built. 
+
+```bash
+./dockerize.sh ./results/torch-<version>linux_aarch64.whl ./results/torchao-<version>-py3-none-any.whl 
+```
+
+You should see the following output in your terminal, confirming that you are in the correct directory inside the Docker container.
+
+```outpu
+aarch64_pytorch ~> 
+```
+
+### 5. Login to HuggingFace
+
+
+
+Create a new `read` token on HuggingFace by clicking [this link](https://huggingface.co/settings/tokens/new?tokenType=read). 
+
+![hf-token](./hf-access-token.jpg)
+
+Provide a suitable token name, press create token and copy the generated token value. From within docker container, enter the following command and paste the token to login.
+
+```bash
+huggingface-cli login
+```
+
+> **Note**: The login will not persist once the docker session has ended
diff --git a/...vers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/hf-access-token.jpg b/...vers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/hf-access-token.jpg
diff --git a/...s-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/prefill_throughput.png b/...s-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/prefill_throughput.png
diff --git a/...rs-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/pytorch-threading.jpg b/...rs-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/pytorch-threading.jpg