ggml-cpu: parallelize tensor repacking with OpenMP #18239

pestopoppa · 2025-12-21T00:08:06Z

Summary

Add OpenMP parallelization to tensor repack functions to significantly speed up model loading on many-core CPUs.

Motivation

On systems with many cores (e.g., AMD EPYC), tensor repacking for AVX-512 optimization was single-threaded and became a significant bottleneck during model loading. The repack functions convert quantized tensors from storage layout to SIMD-optimized interleaved layout.

Benchmark Results

Measured on AMD EPYC 9655 "Turin" (96 cores, 192 threads):

Model Size	Before	After	Speedup
6.8GB Q4_K	5.0s	3.3s	1.5x
19GB Q4_K	11.9s	5.3s	2.2x

Speedup increases with model size as repack time dominates over I/O.

Changes

Convert pointer-increment loops to explicit indexing (parallelizable)
Add #pragma omp parallel for to outer loops
Move thread-local dst_tmp arrays inside parallel region
Each thread processes independent row groups with no synchronization needed

Functions Parallelized

repack_q4_0_to_q4_0_4_bl - Q4_0 x4 interleave
repack_q4_K_to_q4_K_8_bl - Q4_K models (most common)
repack_q2_K_to_q2_K_8_bl - Q2_K models
repack_q4_0_to_q4_0_8_bl - Q4_0 x8 interleave
repack_iq4_nl_to_iq4_nl_4_bl - IQ4_NL x4
repack_iq4_nl_to_iq4_nl_8_bl - IQ4_NL x8

Testing

Verified outputs match original implementation
Tested on multiple Q4_K_M models
Build verified with GCC 13.3.0 + OpenMP

Key changes: - patches/: OpenMP parallelization of tensor repack functions - PR submitted: ggml-org/llama.cpp#18239 - Measured: 19GB model loads in 5.3s vs 11.9s (2.2x faster) - scripts/lib/executor.py: Remove OMP_NUM_THREADS=1 - Enables parallel repack in benchmark scripts - Also improves prompt processing 2.4x (49 → 119 t/s) - README.md: Add modded llama.cpp fork reference - Fork: https://github.com/pestopoppa/llama.cpp - Instructions for reproducing the setup - CLAUDE.md: Document fork and patches directory - research/RESULTS_SUMMARY.md: Add parallel repack section - orchestration/progress/PROGRESS_2025-12-21.md: Progress report 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

taronaeo · 2026-01-01T11:26:32Z

Can you fix the CI errors? https://github.com/ggml-org/llama.cpp/actions/runs/20401924391/job/58628429821?pr=18239#step:6:106

Add OpenMP parallelization to tensor repack functions to significantly speed up model loading on many-core CPUs. Measured on AMD EPYC 9655 (96 cores): | Model Size | Before | After | Speedup | |------------|--------|-------|---------| | 6.8GB Q4_K | 5.0s | 3.3s | 1.5x | | 19GB Q4_K | 11.9s | 5.3s | 2.2x | | 271GB Q4_K | ~150s | ~60s | ~2.5x | The repack functions convert quantized tensors from storage layout to SIMD-optimized layout for AVX-512. This was previously single-threaded and is now parallelized across row groups. Key changes: - Convert pointer-increment loops to explicit indexing - Add #pragma omp parallel for to outer loops (guarded by #ifdef _OPENMP) - Each thread processes independent row groups - Move thread-local dst_tmp arrays inside parallel region Functions parallelized: - repack_q4_0_to_q4_0_4_bl (Q4_0 x4 interleave) - repack_q4_K_to_q4_K_8_bl (Q4_K_M, Q4_K_S models) - repack_q2_K_to_q2_K_8_bl (Q2_K models) - repack_q4_0_to_q4_0_8_bl (Q4_0 x8 interleave) - repack_iq4_nl_to_iq4_nl_4_bl (IQ4_NL x4) - repack_iq4_nl_to_iq4_nl_8_bl (IQ4_NL x8) Tested on: AMD EPYC 9655 "Turin" with 192 threads

pestopoppa · 2026-01-01T11:52:38Z

Thanks for flagging! I investigated the CI failures:

1. `ubuntu-cmake-sanitizer-riscv64-native (THREAD, Debug)`

Error: "The self-hosted runner lost communication with the server"

This is an infrastructure issue with the RISC-V self-hosted runner, not related to the code changes. The job also has continue-on-error: true in the workflow definition.

2. `ubuntu-latest-cmake-sanitizer (THREAD, Debug)`

Root cause identified and fixed:

The THREAD sanitizer build uses:

-DGGML_OPENMP=OFF (disables OpenMP)
-DLLAMA_FATAL_WARNINGS=ON (treats warnings as errors)

When OpenMP is disabled, bare #pragma omp parallel for directives trigger -Wunknown-pragmas warnings, which become errors with FATAL_WARNINGS=ON.

Fix applied: Wrapped all OpenMP pragmas with #ifdef GGML_USE_OPENMP guards, consistent with the existing pattern in ggml-cpu.c:

#ifdef GGML_USE_OPENMP
    #pragma omp parallel for
#endif
    for (int bg = 0; bg < n_row_groups; bg++) {

Verification:

✅ Builds successfully with -DGGML_OPENMP=OFF -DLLAMA_FATAL_WARNINGS=ON
✅ Builds successfully with -DGGML_OPENMP=ON

taronaeo · 2026-01-01T12:02:33Z

I've just re-triggered the CI. Btw, your reply seems to have been made by AI, and I would like to let you know that we have just revised the AI usage policy (https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).

Would be great if you could add an AI declaration to this PR if there was/wasn't any AI involved in creating this PR :)

pestopoppa · 2026-01-01T14:37:08Z

Thanks for re-triggering the CI and for the heads up about the AI policy. I want to be very clear about how AI has been used in this PR. I have been working for the past two weeks on systematically exploring optimization techniques for CPU-inference on my 96 core EPYC machine:
https://github.com/pestopoppa/amd-epyc-inference/blob/main/research/RESULTS_SUMMARY.md

As this is a side project for me and mostly a reimplementation of already existing techniques, I have used AI tools to speedup work. That said I did stumble on this bottleneck while profiling large model loading on my 96-core EPYC system which I couldn't quite find documented elsewhere. So I designed the OpenMP parallelization approach and thought about pushing it in case others are interested in using it. It's a relatively small change involving:

Converting pointer-increment loops to explicit indexing to enable parallelization
Added #pragma omp parallel for to the outer loops of 6 repack functions
Moved thread-local dst_tmp arrays inside the parallel region for thread safety

Unaware of the AI usage policy, I did use AI tools to quickly investigate the CI failures. Claude:

helped trace the issue to the ubuntu-latest-cmake-sanitizer (THREAD) job,
identified that -DGGML_OPENMP=OFF combined with -DLLAMA_FATAL_WARNINGS=ON caused unknown pragma warnings,
Helped me search for the existing #ifdef GGML_USE_OPENMP pattern in the codebase to keep changes as standard as possible with what's expected out of the repo

CI Status

All 66 code-related checks pass. The 5 which still don't are infrastructure/hardware issues unrelated to this PR. Specifically:

ubuntu-cpu-cmake-riscv64-native, ubuntu-cmake-sanitizer-riscv64-native (ADDRESS), ubuntu-cmake-sanitizer-riscv64-native (UNDEFINED), ubuntu-llguidance-riscv64-native: all cancelled due to self-hosted RISC-V runner
openEuler-latest-cmake-cann: skipped because it requires Huawei Ascend NPU

Before this update, I ran the following CI configuration locally as expected by the PR policy:

cmake -B build \
  -DLLAMA_CURL=OFF \
  -DLLAMA_OPENSSL=ON \
  -DLLAMA_FATAL_WARNINGS=ON \
  -DLLAMA_SANITIZE_THREAD=ON \
  -DCMAKE_BUILD_TYPE=Debug \
  -DGGML_OPENMP=OFF
cmake --build build -j 96

The build completed with no warnings or errors.

OpenMP Guard Pattern

As mentioned, the #ifdef GGML_USE_OPENMP guards follow the existing codebase pattern used in:

Performance Results

I tested the following model sizes on my AMD EPYC 9655 "Turin" (96 cores, 192 threads) system:

Model Size	Quant	Before	After	Speedup
6.8GB	Q4_K	5.0s	3.3s	1.5x
19GB	Q4_K	11.9s	5.3s	2.2x
271GB	Q4_K	~150s	~60s	~2.5x

The PR specifically parallelizes the following functions:

repack_q4_0_to_q4_0_4_bl (Q4_0 x4 interleave)
repack_q4_K_to_q4_K_8_bl (Q4_K_M, Q4_K_S models)
repack_q2_K_to_q2_K_8_bl (Q2_K models)
repack_q4_0_to_q4_0_8_bl (Q4_0 x8 interleave)
repack_iq4_nl_to_iq4_nl_4_bl (IQ4_NL x4)
repack_iq4_nl_to_iq4_nl_8_bl (IQ4_NL x8)

pestopoppa requested a review from ggerganov as a code owner December 21, 2025 00:08

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 21, 2025

loci-dev mentioned this pull request Dec 21, 2025

UPSTREAM PR #18239: ggml-cpu: parallelize tensor repacking with OpenMP auroralabs-loci/llama.cpp#645

Closed

pestopoppa force-pushed the parallel-repack branch from 56c8442 to b136675 Compare January 1, 2026 11:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cpu: parallelize tensor repacking with OpenMP #18239

ggml-cpu: parallelize tensor repacking with OpenMP #18239

pestopoppa commented Dec 21, 2025

Uh oh!

taronaeo commented Jan 1, 2026

Uh oh!

pestopoppa commented Jan 1, 2026

Uh oh!

taronaeo commented Jan 1, 2026

Uh oh!

pestopoppa commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggml-cpu: parallelize tensor repacking with OpenMP #18239

Are you sure you want to change the base?

ggml-cpu: parallelize tensor repacking with OpenMP #18239

Conversation

pestopoppa commented Dec 21, 2025

Summary

Motivation

Benchmark Results

Changes

Functions Parallelized

Testing

Uh oh!

taronaeo commented Jan 1, 2026

Uh oh!

pestopoppa commented Jan 1, 2026

1. ubuntu-cmake-sanitizer-riscv64-native (THREAD, Debug)

2. ubuntu-latest-cmake-sanitizer (THREAD, Debug)

Uh oh!

taronaeo commented Jan 1, 2026

Uh oh!

pestopoppa commented Jan 1, 2026

CI Status

OpenMP Guard Pattern

Performance Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `ubuntu-cmake-sanitizer-riscv64-native (THREAD, Debug)`

2. `ubuntu-latest-cmake-sanitizer (THREAD, Debug)`