(Performance) Optimized x86 and generic q1_0(_g128) dot#10
(Performance) Optimized x86 and generic q1_0(_g128) dot#10pl752 wants to merge 10 commits intoPrismML-Eng:prismfrom
Conversation
The Q1_0_g128 vec_dot kernel had a bug where `sumi` was declared as `int` but accumulated `float` partial products (`d1 * sumi_block`), causing float-to-int truncation that destroyed dot product results and produced gibberish output on CPU. Additionally, the x86 kernel was purely scalar (one bit at a time). This adds an AVX-512BW path that processes 32 elements per iteration using mask_sub + madd + fma, with a single horizontal reduction at the end. Benchmarks (Bonsai-8B, CPU-only, AVX-512): Before: 0.73 t/s prompt, 0.65 t/s generation (gibberish output) After: 23.2 t/s prompt, 13.5 t/s generation (coherent output) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks this looks great, nice write up. I am not too familiar with SIMD/AVX stuff, what CPUs does this support: |
|
@khosravipasha You are welcome :)
As for perplexity, I have performed run for single 64 token wikitext-2-test chunk with 1.7B model
I will perform more runs |
|
I have run 5 chunks of 512 tokens, looks better, I think, will run 100 chunks:
|
|
I am somewhat in doubt now, it seems something around the effect of comparing cpu to cuda, or something inbetween fp32->fp16 and fp32->q8_0, maybe it is from using smaller model |
|
@pl752 Awesome thanks for the explnations. https://github.com/ggml-org/llama.cpp/tree/master/tools/perplexity Yeah I used running the model in fp16 as the baeslines using these https://huggingface.co/collections/prism-ml/bonsai-auxiliary |
|
Okay, don't forget to thank the user from which I've hijacked AVX-512 implementation |
|
@pl752 good idea, which one was it? We can tag them here, After that's merged, then can all send a PR together with everyone that contributed tagged in main llama.cpp maybe. Note that there will be some naming changes (in summary Q1_0_g128 is renamed to Q1_0, and original Q1_0 will be deleted). Should not affect running the current models. |
|
|
Performed additional 5x512 run against unpacked gguf
|
|
UPD: I have reviewed how I was interleaving instructions when testing various register pressure options and found issues resulting in register spilling, so I just relied on the compiler doing its job properly and simply unrolled inner loop with individual accumulators for SSSE3 (as the compiler already did pretty well for other flows); I have also tried the same thing for AVX-512, but it did result in tiny performance regression. It had almost no effect on perplexity. Effects on performance, (baseline has drifted due to using
|
| flow | run | baseline | updated | delta |
|---|---|---|---|---|
| SSSE3 | pp512 | 33.38 t/s | 39.18 t/s | +17.36% |
| SSSE3 | tg128 | 24.61 t/s | 29.24 t/s | +18.81% |
There was a problem hiding this comment.
Pull request overview
This PR focuses on improving CPU inference throughput by optimizing the q1_0 / q1_0_g128 dot-product kernels against q8_0, reducing bit-twiddling overhead in portable fallbacks and introducing additional optimized x86 SIMD execution paths.
Changes:
- Reworked generic fallbacks to process packed sign bits in a byte-oriented way (4 × 8-value groups per 32-element sub-block), eliminating per-element bit index arithmetic.
- Implemented x86-specialized kernels for
ggml_vec_dot_q1_0_q8_0andggml_vec_dot_q1_0_g128_q8_0with multiple SIMD paths (SSSE3 / AVX / AVX2 / AVX-512BW) plus scalar byte-oriented fallback. - Added small SSSE3 helpers to expand packed sign bits into byte masks and to reduce vector accumulators.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
ggml/src/ggml-cpu/quants.c |
Optimizes portable q1_0 and q1_0_g128 generic dot fallbacks by switching to explicit byte-oriented sign decoding and removing per-element bit math. |
ggml/src/ggml-cpu/arch/x86/quants.c |
Replaces x86 dispatch to generic kernels with specialized SIMD implementations across AVX-512BW/AVX2/AVX/SSSE3, keeping a byte-oriented scalar fallback. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I tested the AVX2 impl, slightly faster then #7 (see the est full test time) but slower than xor+sub. Maybe the reported 0.00022 KLD is arch related (Tigerlake and Broadwell are both Intel CPU). I have tried several impl on Broadwell, all hit the same KLD after the first few chunks, thus later there's little point to run the full test just to confirm the KLD. |
|
@zcattacz Thank you for the hint, it worked at least for at least AVX2, I will revise my current kernels and post updates |
|
@pl752 , oh. my bad, I misread your KLD. Are they all tested on AMD. it's also around 0.00022. The xor+sub is adapted from PR4. If you are after speed, please give it a try. You can find the code I tested for AVX2 from my comment in #7. Even the shadowed variable gives it a 5%~10% boost. I also tested double accumulator impl, but it didn't give any edge. The compiler seems to be doing some magic here. |
|
@zcattacz They all tested on AMD Ryzen 5 7640HS (Zen 4) |
Hello
This is yet another PR about the
fix of the truncation andoptimization of the cpu inference.In this case I have:
Note that this PR is built on top of the #3 by @jordankzf, who implemented AVX-512 workflow
Benchmarks were performed with:
Bonsai-1.7B.gguf(Q1_0_g128)6pp 512t/stg 128t/sSSSE3AVXAVX2+FMAAVX512BW*extrapolated frompp 32/tg 16:1.659 t/spp and0.862 t/stg, as I was impatient.**new SIMD instruction kinds improve performance even on AMD Zen4 implementation of AVX-512, which uses 256 bit pipeline twice instead of implementing full 512 bit oneI would appreciate your feedback