(Performance) Optimized x86 and generic q1_0(_g128) dot by pl752 · Pull Request #10 · PrismML-Eng/llama.cpp

pl752 · 2026-04-03T09:25:29Z

Hello
This is yet another PR about the ~~fix of the truncation and~~ optimization of the cpu inference.

In this case I have:

Replaced a ton of bit-masking operations, removed redundant float multiplication and unrolled the hot inner loop with constant masks for accumulation with signs in arch-agnostic fallback
Introduced paths for filling the gap between default fallback and AVX-512 capable CPUs
Performed tests to make sure that optimizations don't have effect on precision/correctness
Performed various experinments (most yielded worse performance) including:
brancless variant of unroll
various register and superscalar pipeline pressure options (AVX 2 uses doubled accumulation flow)
AVX-512 VNNI
explicitly precomputed masks for SIMD

Note that this PR is built on top of the #3 by @jordankzf, who implemented AVX-512 workflow

Benchmarks were performed with:

CPU: AMD Ryzen 5 7640HS (at 65w)
WSL vm
LPDDR5 @ 6400MT JEDEC
Model: Bonsai-1.7B.gguf (Q1_0_g128)
Threads: 6

Flow	`pp 512` t/s	`tg 128` t/s	Speedup	Notes
Initial*	1.59	0.85	1.0x / 1.0x	Slow
Scalar	9.57	7.06	6.0x / 8.3x	Explicit byte-oriented unroll
`SSSE3`	26.13	19.51	16.5x / 22.9x	128-bit specialization
`AVX`	34.99	27.31	22.1x / 32.1x	Mixed-width specialization
`AVX2` + `FMA`	80.02	51.46	50.4x / 60.5x	256-bit specialization
`AVX512BW`	97.16	60.88	61.3x / 71.5x	Leverages new SIMD extensions**

* extrapolated from pp 32 / tg 16: 1.659 t/s pp and 0.862 t/s tg, as I was impatient.
** new SIMD instruction kinds improve performance even on AMD Zen4 implementation of AVX-512, which uses 256 bit pipeline twice instead of implementing full 512 bit one

I would appreciate your feedback

The Q1_0_g128 vec_dot kernel had a bug where `sumi` was declared as `int` but accumulated `float` partial products (`d1 * sumi_block`), causing float-to-int truncation that destroyed dot product results and produced gibberish output on CPU. Additionally, the x86 kernel was purely scalar (one bit at a time). This adds an AVX-512BW path that processes 32 elements per iteration using mask_sub + madd + fma, with a single horizontal reduction at the end. Benchmarks (Bonsai-8B, CPU-only, AVX-512): Before: 0.73 t/s prompt, 0.65 t/s generation (gibberish output) After: 23.2 t/s prompt, 13.5 t/s generation (coherent output) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…28 dot

khosravipasha · 2026-04-04T16:09:28Z

Thanks this looks great, nice write up.
I guess could you also run corretness checks, similar to what we did here: #8
see KL divergence between packed vs unpacked model (should be close to 0).

I am not too familiar with SIMD/AVX stuff, what CPUs does this support:
I know there is some different between AVX512BW, AVX2, SSSE3, AVX, is this for different CPU archetictures?

pl752 · 2026-04-04T17:04:00Z

@khosravipasha You are welcome :)
These are various generations of x86 simd instructions (mostly backward compatible, aka AVX can SSE, etc.):

SSE family is 128 bit simd introduced with Pentium III (fp32 only); then SSE2 followed up with (fp64 and int), then SSE3 with some additional utility instructions;
SSSE3 is most interesting in 128 bit case, as it provides instructions for shuffling/expansion of bit mask, sign assign ops and int dot product used in our and other implementations there (this set is available since Core 2 generation and AMD Bobcat, so it covers essentially all realistic x86 targets (except for some shenanigans);
AVX is 256 bit SIMD for fp32/fp64 introduced with Sandy bridge (eg core i7-2xxx) or AMD Bulldozer (eg fx-8100), lacks int ops though, still can be used for accum paired with SSSE3 instructions;
AVX2 adds most of the missing 256 bit instructions, plus it is paired with FMA(3) instruction introduction (except for VIA C4650), providing fused multiply-add; introduced in Haswell or AMD Zen, in our case used with most modern-ish processors completing the gap bridging between legacy and SoTA cpus;
AVX512 extends AVX2 ops to 512 bit SIMD, plus it has numerous extensions including AVX512BW with support for int8 and int16 ops (used in Q8_0 sign expansion to int16)
Scalar fallback is mostly for non-x86 cpus and comformity, there I got rid of bit mask calculations, int to float conversion with multiplication and unrolled inner loop with constant masks to minimize computations and increase pipeline pressure

As for perplexity, I have performed run for single 64 token wikitext-2-test chunk with 1.7B model

Flow	PPL	ln(PPL(Q)/PPL(base))	KL Divergence	Δp RMS	Same top p
Scalar	17.1988 ± 9.6330	-0.00000 ± -nan	-0.00000 ± 0.00000	0.001 ± 0.000 %	100.000 ± 0.000 %
SSSE3	17.2739 ± 9.7094	0.00435 ± 0.00450	0.00024 ± 0.00004	0.218 ± 0.038 %	100.000 ± 0.000 %
AVX	17.2402 ± 9.6760	0.00240 ± 0.00352	0.00025 ± 0.00005	0.362 ± 0.062 %	90.323 ± 5.398 %
AVX2	17.2321 ± 9.6740	0.00193 ± 0.00398	0.00023 ± 0.00004	0.379 ± 0.067 %	96.774 ± 3.226 %
AVX-512	17.2463 ± 9.6895	0.00275 ± 0.00298	0.00023 ± 0.00005	0.279 ± 0.070 %	93.548 ± 4.485 %

I will perform more runs

pl752 · 2026-04-04T17:29:18Z

I have run 5 chunks of 512 tokens, looks better, I think, will run 100 chunks:

Flow	Mean PPL(Q)	Mean ln(PPL(Q)/PPL(base))	Mean KLD	RMS Δp	Same top p
Scalar baseline	20.943033 +/- 2.071658	0.000000	0.000000	0.000 %	100.000 %
SSSE3	21.076136 +/- 2.102022	0.006335 +/- 0.004656	0.000267 +/- 0.000009	0.386 +/- 0.017 %	99.059 +/- 0.271 %
AVX	21.081167 +/- 2.102227	0.006574 +/- 0.004686	0.000285 +/- 0.000011	0.404 +/- 0.019 %	99.451 +/- 0.207 %
AVX2	21.087163 +/- 2.103328	0.006858 +/- 0.004650	0.000282 +/- 0.000012	0.418 +/- 0.027 %	99.529 +/- 0.192 %
AVX-512BW	21.095567 +/- 2.103673	0.007257 +/- 0.004635	0.000279 +/- 0.000010	0.399 +/- 0.019 %	99.294 +/- 0.235 %

pl752 · 2026-04-04T17:58:52Z

I am somewhat in doubt now, it seems something around the effect of comparing cpu to cuda, or something inbetween fp32->fp16 and fp32->q8_0, maybe it is from using smaller model
Note: I am comparing between my implementations, I think I need to use fp16 as a baseline first

khosravipasha · 2026-04-04T18:16:59Z

@pl752 Awesome thanks for the explnations.
And for the KL's look pretty good, being close to 0 is good. The rest is numerical noise probably since also llama.cpp side they convert logits to fp16 to save time (I only run few chunks myself too), llama.cpp tool was designed to see how good their quantizations are, for us the weights are equivalent packed and unpacked so having KL close to zero for few chunks is good enough (this is mostly to test the kernels and not the quantization itself)

https://github.com/ggml-org/llama.cpp/tree/master/tools/perplexity

Yeah I used running the model in fp16 as the baeslines using these https://huggingface.co/collections/prism-ml/bonsai-auxiliary

pl752 · 2026-04-04T18:19:13Z

Okay, don't forget to thank the user from which I've hijacked AVX-512 implementation

khosravipasha · 2026-04-04T18:20:55Z

@pl752 good idea, which one was it? We can tag them here,
Right now only sending PR to llama.cpp with generic cpu to finalize the naming, formatting, etc.

After that's merged, then can all send a PR together with everyone that contributed tagged in main llama.cpp maybe.

Note that there will be some naming changes (in summary Q1_0_g128 is renamed to Q1_0, and original Q1_0 will be deleted). Should not affect running the current models.

ggml-org#21273

pl752 · 2026-04-04T18:36:15Z

Note that this PR is built on top of the #3 by @jordankzf, who implemented AVX-512 workflow

pl752 · 2026-04-05T06:30:59Z

Performed additional 5x512 run against unpacked gguf

Flow	Mean PPL(Q)	Mean ln(PPL(Q)/PPL(base))	Mean KLD	RMS Δp	Same top p
Scalar	21.082185 +/- 2.102340	0.005412 +/- 0.004643	0.000213 +/- 0.000008	0.334 +/- 0.017 %	99.451 +/- 0.207 %
SSSE3	21.076136 +/- 2.102022	0.005125 +/- 0.004661	0.000220 +/- 0.000008	0.341 +/- 0.016 %	99.137 +/- 0.259 %
AVX	21.081167 +/- 2.102227	0.005364 +/- 0.004690	0.000235 +/- 0.000010	0.362 +/- 0.017 %	99.373 +/- 0.221 %
AVX2	21.087163 +/- 2.103328	0.005649 +/- 0.004643	0.000216 +/- 0.000009	0.377 +/- 0.023 %	99.608 +/- 0.175 %
AVX-512BW	21.095567 +/- 2.103673	0.006047 +/- 0.004636	0.000222 +/- 0.000009	0.365 +/- 0.020 %	99.059 +/- 0.271 %

pl752 · 2026-04-05T20:48:30Z

UPD: I have reviewed how I was interleaving instructions when testing various register pressure options and found issues resulting in register spilling, so I just relied on the compiler doing its job properly and simply unrolled inner loop with individual accumulators for SSSE3 (as the compiler already did pretty well for other flows); I have also tried the same thing for AVX-512, but it did result in tiny performance regression. It had almost no effect on perplexity.

Effects on performance, (baseline has drifted due to using `-t10` instead of `-t6`):

flow	run	baseline	updated	delta
SSSE3	pp512	33.38 t/s	39.18 t/s	+17.36%
SSSE3	tg128	24.61 t/s	29.24 t/s	+18.81%

Copilot

Pull request overview

This PR focuses on improving CPU inference throughput by optimizing the q1_0 / q1_0_g128 dot-product kernels against q8_0, reducing bit-twiddling overhead in portable fallbacks and introducing additional optimized x86 SIMD execution paths.

Changes:

Reworked generic fallbacks to process packed sign bits in a byte-oriented way (4 × 8-value groups per 32-element sub-block), eliminating per-element bit index arithmetic.
Implemented x86-specialized kernels for ggml_vec_dot_q1_0_q8_0 and ggml_vec_dot_q1_0_g128_q8_0 with multiple SIMD paths (SSSE3 / AVX / AVX2 / AVX-512BW) plus scalar byte-oriented fallback.
Added small SSSE3 helpers to expand packed sign bits into byte masks and to reduce vector accumulators.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`ggml/src/ggml-cpu/quants.c`	Optimizes portable `q1_0` and `q1_0_g128` generic dot fallbacks by switching to explicit byte-oriented sign decoding and removing per-element bit math.
`ggml/src/ggml-cpu/arch/x86/quants.c`	Replaces x86 dispatch to generic kernels with specialized SIMD implementations across AVX-512BW/AVX2/AVX/SSSE3, keeping a byte-oriented scalar fallback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

zcattacz · 2026-04-06T09:09:50Z

I tested the AVX2 impl, slightly faster then #7 (see the est full test time) but slower than xor+sub. Maybe the reported 0.00022 KLD is arch related (Tigerlake and Broadwell are both Intel CPU). I have tried several impl on Broadwell, all hit the same KLD after the first few chunks, thus later there's little point to run the full test just to confirm the KLD.

PR10 with mul_sum_i8_pairs_float

system_info: n_threads = 2 (n_threads_batch = 2) / 4 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
kl_divergence: computing over 100 chunks, n_ctx=512, batch_size=2048, n_seq=4
kl_divergence: 150.86 seconds per pass - ETA 1 hours 2.85 minutes

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      13.9557 ±    3.1807      -0.00019 ±    0.00239       0.00019 ±    0.00002     0.376 ±  0.047 %    99.608 ±  0.392 %
   2      20.1986 ±    3.4363       0.01428 ±    0.01146       0.00020 ±    0.00001     0.346 ±  0.030 %    99.608 ±  0.277 %
   3      20.8582 ±    2.7888       0.00944 ±    0.00766       0.00021 ±    0.00001     0.375 ±  0.025 %    99.216 ±  0.319 %
   4      21.2096 ±    2.3896       0.00684 ±    0.00577       0.00022 ±    0.00001     0.385 ±  0.026 %    99.412 ±  0.240 %
   5      21.0872 ±    2.1033       0.00566 ±    0.00464       0.00022 ±    0.00001     0.376 ±  0.023 %    99.529 ±  0.192 %
   6      21.2932 ±    1.9099       0.00549 ±    0.00390       0.00021 ±    0.00001     0.362 ±  0.020 %    99.477 ±  0.184 %
   7      21.4337 ±    1.7665       0.00508 ±    0.00335       0.00021 ±    0.00001     0.365 ±  0.020 %    99.440 ±  0.177 %
   8      23.1788 ±    1.8031       0.00527 ±    0.00297       0.00021 ±    0.00001     0.364 ±  0.018 %    99.412 ±  0.169 %
   9      24.6955 ±    1.8365       0.00752 ±    0.00337       0.00022 ±    0.00001     0.355 ±  0.017 %    99.390 ±  0.163 %
  10      25.4214 ±    1.7879       0.00672 ±    0.00303       0.00022 ±    0.00001     0.353 ±  0.015 %    99.294 ±  0.166 %
  11      26.0683 ±    1.7516       0.00617 ±    0.00276       0.00022 ±    0.00001     0.354 ±  0.014 %    99.287 ±  0.159 %
  12      26.5272 ±    1.7091       0.00582 ±    0.00254       0.00022 ±    0.00001     0.351 ±  0.013 %    99.346 ±  0.146 %

xor+sub (like PR4)

system_info: n_threads = 2 (n_threads_batch = 2) / 4 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
kl_divergence: computing over 100 chunks, n_ctx=512, batch_size=2048, n_seq=4
kl_divergence: 115.23 seconds per pass - ETA 48.00 minutes

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      13.9528 ±    3.1791      -0.00040 ±    0.00223       0.00019 ±    0.00002     0.382 ±  0.053 %    99.608 ±  0.392 %
   2      20.1970 ±    3.4355       0.01420 ±    0.01145       0.00019 ±    0.00001     0.343 ±  0.033 %    99.608 ±  0.277 %
   3      20.8596 ±    2.7888       0.00950 ±    0.00765       0.00021 ±    0.00001     0.351 ±  0.026 %    99.346 ±  0.292 %
   4      21.2115 ±    2.3896       0.00693 ±    0.00576       0.00022 ±    0.00001     0.369 ±  0.025 %    99.510 ±  0.219 %
   5      21.0887 ±    2.1034       0.00573 ±    0.00463       0.00022 ±    0.00001     0.363 ±  0.022 %    99.608 ±  0.175 %
   6      21.2944 ±    1.9099       0.00555 ±    0.00389       0.00021 ±    0.00001     0.351 ±  0.019 %    99.542 ±  0.173 %
   7      21.4348 ±    1.7665       0.00513 ±    0.00334       0.00021 ±    0.00001     0.355 ±  0.020 %    99.496 ±  0.168 %

PR7 with _mm256_shuffle_epi8

system_info: n_threads = 2 (n_threads_batch = 2) / 4 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
kl_divergence: computing over 100 chunks, n_ctx=512, batch_size=2048, n_seq=4
kl_divergence: 186.99 seconds per pass - ETA 1 hours 17.90 minutes

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      13.9733 ±    3.1846       0.00107 ±    0.00236       0.00020 ±    0.00002     0.402 ±  0.048 %    99.608 ±  0.392 %
   2      20.2038 ±    3.4373       0.01454 ±    0.01146       0.00022 ±    0.00002     0.375 ±  0.029 %    99.608 ±  0.277 %
   3      20.8431 ±    2.7865       0.00871 ±    0.00766       0.00023 ±    0.00001     0.387 ±  0.026 %    98.693 ±  0.411 %
   4      21.1827 ±    2.3859       0.00558 ±    0.00577       0.00023 ±    0.00001     0.378 ±  0.022 %    99.020 ±  0.309 %
   5      21.0675 ±    2.1012       0.00473 ±    0.00465       0.00022 ±    0.00001     0.379 ±  0.019 %    99.137 ±  0.259 %
   6      21.2662 ±    1.9072       0.00422 ±    0.00390       0.00022 ±    0.00001     0.381 ±  0.018 %    99.085 ±  0.244 %
   7      21.4126 ±    1.7643       0.00409 ±    0.00335       0.00022 ±    0.00001     0.374 ±  0.016 %    98.992 ±  0.237 %

pl752 · 2026-04-06T10:45:18Z

@zcattacz Thank you for the hint, it worked at least for at least AVX2, I will revise my current kernels and post updates

zcattacz · 2026-04-06T11:56:35Z

@pl752 , oh. my bad, I misread your KLD. Are they all tested on AMD. it's also around 0.00022. The xor+sub is adapted from PR4. If you are after speed, please give it a try. You can find the code I tested for AVX2 from my comment in #7. Even the shadowed variable gives it a 5%~10% boost. I also tested double accumulator impl, but it didn't give any edge. The compiler seems to be doing some magic here.
I did a similar SSSE3 test (code also in #7) for KLD, I didn't save the result, but since it's so slow on i5, not gonna to do it again. iirc, the KLD is also ~0.00022.

pl752 · 2026-04-06T12:14:29Z

@zcattacz They all tested on AMD Ryzen 5 7640HS (Zen 4)

jordankzf and others added 7 commits April 3, 2026 13:07

Removed unnecessary calculations and unrolled accumulation in q1_0_g1…

6b235a3

…28 dot

Added additional x86 SIMD specializations for Q1_0_g128

5be7d71

Added FMA3 and optimized AVX2 pressure

7b1c736

Switched to explicit unroll for x86 fallback for q1_0_g128

ac9f33e

Replicated q1_0_g128 optimizations to other q1_0 flows

2ccdcec

Added comments with flow explainations

b0394d0

github-actions bot added the ggml label Apr 3, 2026

Split flows for SSSE3 q1_0_g128 dot

79932a2

This was referenced Apr 6, 2026

No output on Ubuntu 24.04 LTS CPU. PrismML-Eng/Bonsai-demo#17

Open

No cpu-only build PrismML-Eng/Bonsai-demo#6

Open

khosravipasha requested a review from Copilot April 6, 2026 08:03

Copilot started reviewing on behalf of khosravipasha April 6, 2026 08:03 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

Replaced mul_sum with xor+sub

251a345

Removed AVX-512 variant of q1_0_g128

93e192f

Conversation

pl752 commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

In this case I have:

Uh oh!

khosravipasha commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khosravipasha commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khosravipasha commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented Apr 4, 2026

Uh oh!

pl752 commented Apr 5, 2026

Uh oh!

pl752 commented Apr 5, 2026

Effects on performance, (baseline has drifted due to using -t10 instead of -t6):

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

zcattacz commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented Apr 6, 2026

Uh oh!

zcattacz commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pl752 commented Apr 3, 2026 •

edited

Loading

khosravipasha commented Apr 4, 2026 •

edited

Loading

pl752 commented Apr 4, 2026 •

edited

Loading

pl752 commented Apr 4, 2026 •

edited

Loading

pl752 commented Apr 4, 2026 •

edited

Loading

khosravipasha commented Apr 4, 2026 •

edited

Loading

pl752 commented Apr 4, 2026 •

edited

Loading

khosravipasha commented Apr 4, 2026 •

edited

Loading

Effects on performance, (baseline has drifted due to using `-t10` instead of `-t6`):

zcattacz commented Apr 6, 2026 •

edited

Loading

zcattacz commented Apr 6, 2026 •

edited

Loading