-
Notifications
You must be signed in to change notification settings - Fork 14.4k
ggml-cpu: parallelize tensor repacking with OpenMP #18239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Key changes: - patches/: OpenMP parallelization of tensor repack functions - PR submitted: ggml-org/llama.cpp#18239 - Measured: 19GB model loads in 5.3s vs 11.9s (2.2x faster) - scripts/lib/executor.py: Remove OMP_NUM_THREADS=1 - Enables parallel repack in benchmark scripts - Also improves prompt processing 2.4x (49 → 119 t/s) - README.md: Add modded llama.cpp fork reference - Fork: https://github.com/pestopoppa/llama.cpp - Instructions for reproducing the setup - CLAUDE.md: Document fork and patches directory - research/RESULTS_SUMMARY.md: Add parallel repack section - orchestration/progress/PROGRESS_2025-12-21.md: Progress report 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Can you fix the CI errors? https://github.com/ggml-org/llama.cpp/actions/runs/20401924391/job/58628429821?pr=18239#step:6:106 |
Add OpenMP parallelization to tensor repack functions to significantly speed up model loading on many-core CPUs. Measured on AMD EPYC 9655 (96 cores): | Model Size | Before | After | Speedup | |------------|--------|-------|---------| | 6.8GB Q4_K | 5.0s | 3.3s | 1.5x | | 19GB Q4_K | 11.9s | 5.3s | 2.2x | | 271GB Q4_K | ~150s | ~60s | ~2.5x | The repack functions convert quantized tensors from storage layout to SIMD-optimized layout for AVX-512. This was previously single-threaded and is now parallelized across row groups. Key changes: - Convert pointer-increment loops to explicit indexing - Add #pragma omp parallel for to outer loops (guarded by #ifdef _OPENMP) - Each thread processes independent row groups - Move thread-local dst_tmp arrays inside parallel region Functions parallelized: - repack_q4_0_to_q4_0_4_bl (Q4_0 x4 interleave) - repack_q4_K_to_q4_K_8_bl (Q4_K_M, Q4_K_S models) - repack_q2_K_to_q2_K_8_bl (Q2_K models) - repack_q4_0_to_q4_0_8_bl (Q4_0 x8 interleave) - repack_iq4_nl_to_iq4_nl_4_bl (IQ4_NL x4) - repack_iq4_nl_to_iq4_nl_8_bl (IQ4_NL x8) Tested on: AMD EPYC 9655 "Turin" with 192 threads
56c8442 to
b136675
Compare
|
Thanks for flagging! I investigated the CI failures: 1.
|
|
I've just re-triggered the CI. Btw, your reply seems to have been made by AI, and I would like to let you know that we have just revised the AI usage policy (https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy). Would be great if you could add an AI declaration to this PR if there was/wasn't any AI involved in creating this PR :) |
|
Thanks for re-triggering the CI and for the heads up about the AI policy. I want to be very clear about how AI has been used in this PR. I have been working for the past two weeks on systematically exploring optimization techniques for CPU-inference on my 96 core EPYC machine: As this is a side project for me and mostly a reimplementation of already existing techniques, I have used AI tools to speedup work. That said I did stumble on this bottleneck while profiling large model loading on my 96-core EPYC system which I couldn't quite find documented elsewhere. So I designed the OpenMP parallelization approach and thought about pushing it in case others are interested in using it. It's a relatively small change involving:
Unaware of the AI usage policy, I did use AI tools to quickly investigate the CI failures. Claude:
CI StatusAll 66 code-related checks pass. The 5 which still don't are infrastructure/hardware issues unrelated to this PR. Specifically:
Before this update, I ran the following CI configuration locally as expected by the PR policy: cmake -B build \
-DLLAMA_CURL=OFF \
-DLLAMA_OPENSSL=ON \
-DLLAMA_FATAL_WARNINGS=ON \
-DLLAMA_SANITIZE_THREAD=ON \
-DCMAKE_BUILD_TYPE=Debug \
-DGGML_OPENMP=OFF
cmake --build build -j 96The build completed with no warnings or errors. OpenMP Guard PatternAs mentioned, the
Performance ResultsI tested the following model sizes on my AMD EPYC 9655 "Turin" (96 cores, 192 threads) system:
The PR specifically parallelizes the following functions:
|
Summary
Add OpenMP parallelization to tensor repack functions to significantly speed up model loading on many-core CPUs.
Motivation
On systems with many cores (e.g., AMD EPYC), tensor repacking for AVX-512 optimization was single-threaded and became a significant bottleneck during model loading. The repack functions convert quantized tensors from storage layout to SIMD-optimized interleaved layout.
Benchmark Results
Measured on AMD EPYC 9655 "Turin" (96 cores, 192 threads):
Speedup increases with model size as repack time dominates over I/O.
Changes
#pragma omp parallel forto outer loopsdst_tmparrays inside parallel regionFunctions Parallelized
repack_q4_0_to_q4_0_4_bl- Q4_0 x4 interleaverepack_q4_K_to_q4_K_8_bl- Q4_K models (most common)repack_q2_K_to_q2_K_8_bl- Q2_K modelsrepack_q4_0_to_q4_0_8_bl- Q4_0 x8 interleaverepack_iq4_nl_to_iq4_nl_4_bl- IQ4_NL x4repack_iq4_nl_to_iq4_nl_8_bl- IQ4_NL x8Testing