Skip to content

x86: implement AVX2 kernel for ggml_vec_dot_q1_0_g128_q8_0#11

Open
SimesD61 wants to merge 1 commit intoPrismML-Eng:prismfrom
SimesD61:feat/avx2-q1_0_g128-kernel
Open

x86: implement AVX2 kernel for ggml_vec_dot_q1_0_g128_q8_0#11
SimesD61 wants to merge 1 commit intoPrismML-Eng:prismfrom
SimesD61:feat/avx2-q1_0_g128-kernel

Conversation

@SimesD61
Copy link
Copy Markdown

@SimesD61 SimesD61 commented Apr 5, 2026

Problem

The x86 implementation of ggml_vec_dot_q1_0_g128_q8_0 in ggml/src/ggml-cpu/arch/x86/quants.c was a stub that immediately fell through to the scalar generic fallback:

void ggml_vec_dot_q1_0_g128_q8_0(...) {
    ggml_vec_dot_q1_0_g128_q8_0_generic(...);
}

The ARM NEON implementation was already fully vectorized. On x86 this meant Bonsai 8B ran at ~0.04 tok/s — 67× slower than the ARM CPU path.

Solution

Full AVX2 implementation using the same algorithm as the NEON kernel:

  • vpshufb bit expansion: Each 32-bit sub-block is broadcast to 32 bytes via _mm_shuffle_epi8, then AND+cmpeq decodes 1-bit weights to sign bytes (+1/-1)
  • INT8 dot product: maddubs_epi16 + madd_epi16 for efficient 8-bit multiply-accumulate
  • 4 independent FMA accumulators: Hides the 5-cycle FMA latency on Skylake (matches one accumulator per sub-block of the block_q1_0_g128 layout)
  • Falls back to generic on non-AVX2 targets

Performance (Intel i7-8700B, AVX2, no AVX-512)

tok/s
Before (scalar stub) ~0.04
After (AVX2) ~8.0
Speedup ~200x

The 8 tok/s result is at the compute-bound ceiling for Q1_0_g128 on this CPU — Q1_0_g128 is ~4x more compute-intensive per byte than Q4_0, so further gains would require AVX-512 or a fundamentally different algorithm.

The x86 implementation was a stub that called the scalar generic fallback.
The ARM NEON kernel was already fully vectorized. This implements the same
algorithm using AVX2 intrinsics.

Key techniques:
- vpshufb (mm_shuffle_epi8) to broadcast each 4-byte sub-block to 32 lanes
- AND+cmpeq to decode 1-bit weights to sign bytes (+1/-1)
- maddubs_epi16 + madd_epi16 for INT8 dot product reduction
- 4 independent FMA accumulators to hide the 5-cycle FMA latency

Performance on Intel i7-8700B (no AVX-512):
- Before: ~0.04 tok/s (scalar fallback, 67x slower than ARM CPU)
- After:  ~8 tok/s (AVX2, matches compute-bound ceiling for Q1_0_g128)
- ~200x speedup over the stub

Falls back to generic implementation on non-AVX2 targets.
@github-actions github-actions bot added the ggml label Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant