x86: implement AVX2 kernel for ggml_vec_dot_q1_0_g128_q8_0 by SimesD61 · Pull Request #11 · PrismML-Eng/llama.cpp

SimesD61 · 2026-04-05T19:11:43Z

Problem

The x86 implementation of ggml_vec_dot_q1_0_g128_q8_0 in ggml/src/ggml-cpu/arch/x86/quants.c was a stub that immediately fell through to the scalar generic fallback:

void ggml_vec_dot_q1_0_g128_q8_0(...) {
    ggml_vec_dot_q1_0_g128_q8_0_generic(...);
}

The ARM NEON implementation was already fully vectorized. On x86 this meant Bonsai 8B ran at ~0.04 tok/s — 67× slower than the ARM CPU path.

Solution

Full AVX2 implementation using the same algorithm as the NEON kernel:

vpshufb bit expansion: Each 32-bit sub-block is broadcast to 32 bytes via _mm_shuffle_epi8, then AND+cmpeq decodes 1-bit weights to sign bytes (+1/-1)
INT8 dot product: maddubs_epi16 + madd_epi16 for efficient 8-bit multiply-accumulate
4 independent FMA accumulators: Hides the 5-cycle FMA latency on Skylake (matches one accumulator per sub-block of the block_q1_0_g128 layout)
Falls back to generic on non-AVX2 targets

Performance (Intel i7-8700B, AVX2, no AVX-512)

	tok/s
Before (scalar stub)	~0.04
After (AVX2)	~8.0
Speedup	~200x

The 8 tok/s result is at the compute-bound ceiling for Q1_0_g128 on this CPU — Q1_0_g128 is ~4x more compute-intensive per byte than Q4_0, so further gains would require AVX-512 or a fundamentally different algorithm.

The x86 implementation was a stub that called the scalar generic fallback. The ARM NEON kernel was already fully vectorized. This implements the same algorithm using AVX2 intrinsics. Key techniques: - vpshufb (mm_shuffle_epi8) to broadcast each 4-byte sub-block to 32 lanes - AND+cmpeq to decode 1-bit weights to sign bytes (+1/-1) - maddubs_epi16 + madd_epi16 for INT8 dot product reduction - 4 independent FMA accumulators to hide the 5-cycle FMA latency Performance on Intel i7-8700B (no AVX-512): - Before: ~0.04 tok/s (scalar fallback, 67x slower than ARM CPU) - After: ~8 tok/s (AVX2, matches compute-bound ceiling for Q1_0_g128) - ~200x speedup over the stub Falls back to generic implementation on non-AVX2 targets.

github-actions bot added the ggml label Apr 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x86: implement AVX2 kernel for ggml_vec_dot_q1_0_g128_q8_0#11

x86: implement AVX2 kernel for ggml_vec_dot_q1_0_g128_q8_0#11
SimesD61 wants to merge 1 commit intoPrismML-Eng:prismfrom
SimesD61:feat/avx2-q1_0_g128-kernel

SimesD61 commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SimesD61 commented Apr 5, 2026

Problem

Solution

Performance (Intel i7-8700B, AVX2, no AVX-512)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant