Skip to content

Conversation

@pestopoppa
Copy link

Summary

Add OpenMP parallelization to tensor repack functions to significantly speed up model loading on many-core CPUs.

Motivation

On systems with many cores (e.g., AMD EPYC), tensor repacking for AVX-512 optimization was single-threaded and became a significant bottleneck during model loading. The repack functions convert quantized tensors from storage layout to SIMD-optimized interleaved layout.

Benchmark Results

Measured on AMD EPYC 9655 "Turin" (96 cores, 192 threads):

Model Size Before After Speedup
6.8GB Q4_K 5.0s 3.3s 1.5x
19GB Q4_K 11.9s 5.3s 2.2x

Speedup increases with model size as repack time dominates over I/O.

Changes

  • Convert pointer-increment loops to explicit indexing (parallelizable)
  • Add #pragma omp parallel for to outer loops
  • Move thread-local dst_tmp arrays inside parallel region
  • Each thread processes independent row groups with no synchronization needed

Functions Parallelized

  • repack_q4_0_to_q4_0_4_bl - Q4_0 x4 interleave
  • repack_q4_K_to_q4_K_8_bl - Q4_K models (most common)
  • repack_q2_K_to_q2_K_8_bl - Q2_K models
  • repack_q4_0_to_q4_0_8_bl - Q4_0 x8 interleave
  • repack_iq4_nl_to_iq4_nl_4_bl - IQ4_NL x4
  • repack_iq4_nl_to_iq4_nl_8_bl - IQ4_NL x8

Testing

  • Verified outputs match original implementation
  • Tested on multiple Q4_K_M models
  • Build verified with GCC 13.3.0 + OpenMP

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 21, 2025
pestopoppa added a commit to pestopoppa/amd-epyc-inference that referenced this pull request Dec 21, 2025
Key changes:
- patches/: OpenMP parallelization of tensor repack functions
  - PR submitted: ggml-org/llama.cpp#18239
  - Measured: 19GB model loads in 5.3s vs 11.9s (2.2x faster)

- scripts/lib/executor.py: Remove OMP_NUM_THREADS=1
  - Enables parallel repack in benchmark scripts
  - Also improves prompt processing 2.4x (49 → 119 t/s)

- README.md: Add modded llama.cpp fork reference
  - Fork: https://github.com/pestopoppa/llama.cpp
  - Instructions for reproducing the setup

- CLAUDE.md: Document fork and patches directory
- research/RESULTS_SUMMARY.md: Add parallel repack section
- orchestration/progress/PROGRESS_2025-12-21.md: Progress report

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@taronaeo
Copy link
Collaborator

taronaeo commented Jan 1, 2026

Add OpenMP parallelization to tensor repack functions to significantly
speed up model loading on many-core CPUs.

Measured on AMD EPYC 9655 (96 cores):

| Model Size | Before | After | Speedup |
|------------|--------|-------|---------|
| 6.8GB Q4_K | 5.0s   | 3.3s  | 1.5x    |
| 19GB Q4_K  | 11.9s  | 5.3s  | 2.2x    |
| 271GB Q4_K | ~150s  | ~60s  | ~2.5x   |

The repack functions convert quantized tensors from storage layout
to SIMD-optimized layout for AVX-512. This was previously single-threaded
and is now parallelized across row groups.

Key changes:
- Convert pointer-increment loops to explicit indexing
- Add #pragma omp parallel for to outer loops (guarded by #ifdef _OPENMP)
- Each thread processes independent row groups
- Move thread-local dst_tmp arrays inside parallel region

Functions parallelized:
- repack_q4_0_to_q4_0_4_bl (Q4_0 x4 interleave)
- repack_q4_K_to_q4_K_8_bl (Q4_K_M, Q4_K_S models)
- repack_q2_K_to_q2_K_8_bl (Q2_K models)
- repack_q4_0_to_q4_0_8_bl (Q4_0 x8 interleave)
- repack_iq4_nl_to_iq4_nl_4_bl (IQ4_NL x4)
- repack_iq4_nl_to_iq4_nl_8_bl (IQ4_NL x8)

Tested on: AMD EPYC 9655 "Turin" with 192 threads
@pestopoppa
Copy link
Author

Thanks for flagging! I investigated the CI failures:

1. ubuntu-cmake-sanitizer-riscv64-native (THREAD, Debug)

Error: "The self-hosted runner lost communication with the server"

This is an infrastructure issue with the RISC-V self-hosted runner, not related to the code changes. The job also has continue-on-error: true in the workflow definition.

2. ubuntu-latest-cmake-sanitizer (THREAD, Debug)

Root cause identified and fixed:

The THREAD sanitizer build uses:

  • -DGGML_OPENMP=OFF (disables OpenMP)
  • -DLLAMA_FATAL_WARNINGS=ON (treats warnings as errors)

When OpenMP is disabled, bare #pragma omp parallel for directives trigger -Wunknown-pragmas warnings, which become errors with FATAL_WARNINGS=ON.

Fix applied: Wrapped all OpenMP pragmas with #ifdef GGML_USE_OPENMP guards, consistent with the existing pattern in ggml-cpu.c:

#ifdef GGML_USE_OPENMP
    #pragma omp parallel for
#endif
    for (int bg = 0; bg < n_row_groups; bg++) {

Verification:

  • ✅ Builds successfully with -DGGML_OPENMP=OFF -DLLAMA_FATAL_WARNINGS=ON
  • ✅ Builds successfully with -DGGML_OPENMP=ON

@taronaeo
Copy link
Collaborator

taronaeo commented Jan 1, 2026

I've just re-triggered the CI. Btw, your reply seems to have been made by AI, and I would like to let you know that we have just revised the AI usage policy (https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).

Would be great if you could add an AI declaration to this PR if there was/wasn't any AI involved in creating this PR :)

@pestopoppa
Copy link
Author

Thanks for re-triggering the CI and for the heads up about the AI policy. I want to be very clear about how AI has been used in this PR. I have been working for the past two weeks on systematically exploring optimization techniques for CPU-inference on my 96 core EPYC machine:
https://github.com/pestopoppa/amd-epyc-inference/blob/main/research/RESULTS_SUMMARY.md

As this is a side project for me and mostly a reimplementation of already existing techniques, I have used AI tools to speedup work. That said I did stumble on this bottleneck while profiling large model loading on my 96-core EPYC system which I couldn't quite find documented elsewhere. So I designed the OpenMP parallelization approach and thought about pushing it in case others are interested in using it. It's a relatively small change involving:

  1. Converting pointer-increment loops to explicit indexing to enable parallelization
  2. Added #pragma omp parallel for to the outer loops of 6 repack functions
  3. Moved thread-local dst_tmp arrays inside the parallel region for thread safety

Unaware of the AI usage policy, I did use AI tools to quickly investigate the CI failures. Claude:

  1. helped trace the issue to the ubuntu-latest-cmake-sanitizer (THREAD) job,
  2. identified that -DGGML_OPENMP=OFF combined with -DLLAMA_FATAL_WARNINGS=ON caused unknown pragma warnings,
  3. Helped me search for the existing #ifdef GGML_USE_OPENMP pattern in the codebase to keep changes as standard as possible with what's expected out of the repo

CI Status

All 66 code-related checks pass. The 5 which still don't are infrastructure/hardware issues unrelated to this PR. Specifically:

  1. ubuntu-cpu-cmake-riscv64-native, ubuntu-cmake-sanitizer-riscv64-native (ADDRESS), ubuntu-cmake-sanitizer-riscv64-native (UNDEFINED), ubuntu-llguidance-riscv64-native: all cancelled due to self-hosted RISC-V runner
  2. openEuler-latest-cmake-cann: skipped because it requires Huawei Ascend NPU

Before this update, I ran the following CI configuration locally as expected by the PR policy:

cmake -B build \
  -DLLAMA_CURL=OFF \
  -DLLAMA_OPENSSL=ON \
  -DLLAMA_FATAL_WARNINGS=ON \
  -DLLAMA_SANITIZE_THREAD=ON \
  -DCMAKE_BUILD_TYPE=Debug \
  -DGGML_OPENMP=OFF
cmake --build build -j 96

The build completed with no warnings or errors.

OpenMP Guard Pattern

As mentioned, the #ifdef GGML_USE_OPENMP guards follow the existing codebase pattern used in:

Performance Results

I tested the following model sizes on my AMD EPYC 9655 "Turin" (96 cores, 192 threads) system:

Model Size Quant Before After Speedup
6.8GB Q4_K 5.0s 3.3s 1.5x
19GB Q4_K 11.9s 5.3s 2.2x
271GB Q4_K ~150s ~60s ~2.5x

The PR specifically parallelizes the following functions:

  1. repack_q4_0_to_q4_0_4_bl (Q4_0 x4 interleave)
  2. repack_q4_K_to_q4_K_8_bl (Q4_K_M, Q4_K_S models)
  3. repack_q2_K_to_q2_K_8_bl (Q2_K models)
  4. repack_q4_0_to_q4_0_8_bl (Q4_0 x8 interleave)
  5. repack_iq4_nl_to_iq4_nl_4_bl (IQ4_NL x4)
  6. repack_iq4_nl_to_iq4_nl_8_bl (IQ4_NL x8)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants