fix: Q1_0_g128 x86 CPU kernel — float truncation + AVX2 vectorization by wildcattrio · Pull Request #7 · PrismML-Eng/llama.cpp

wildcattrio · 2026-04-02T17:31:26Z

Summary

The Q1_0_g128 x86 CPU kernel produces gibberish output at 0.25 tok/s on Intel CPUs. Two bugs:

Bug 1: Float-to-int truncation (causes gibberish)

The per-block accumulator is int, but d1 * sumi_block produces a float. The implicit cast truncates every Q8_0 block's scale factor to 0 or ±1, destroying the output.

// Before (broken):
int sumi = 0;
sumi += d1 * sumi_block;  // float truncated to int

// After (fixed):
float block_sum = 0.0f;
block_sum += d1 * (float)sumi_block;

Bug 2: No SIMD (causes 0.25 tok/s)

The x86 kernel is scalar-only while the ARM NEON version has full vectorization. Added AVX2 using the same broadcast → shuffle → cmpeq → mul_sum_i8_pairs_float pattern from the existing ggml_vec_dot_q1_0_q8_0 kernel.

Results (i5-1135G7, 32GB, Bonsai 8B)

Version	tok/s	Output
Before (MSVC, shipped)	0.25	`, with is it. and the. and the.... in.........`
Bug fix only (scalar)	3.7	Correct
Bug fix + AVX2	6.9	Correct

Files changed

ggml/src/ggml-cpu/arch/x86/quants.c — AVX2 kernel + scalar fix
ggml/src/ggml-cpu/quants.c — generic scalar fallback fix

Test plan

Bonsai 8B: coherent output on Q&A, reasoning, and JSON extraction prompts
Bonsai 4B: coherent output (also tested)
Standard GGUF (Qwen 3.5 4B Q4_K_M): no regression, loads and runs correctly
Benchmark: 3 prompts × 8 configurations, all results in JSON

The Q1_0_g128 x86 kernel has two bugs causing gibberish output at 0.25 tok/s on Intel CPUs: 1. Float-to-int truncation: the per-block accumulator was `int`, truncating `d1 * sumi_block` (float * int → float → int). Each Q8_0 block's scale factor was rounded to 0 or ±1, destroying the output. Fix: `float block_sum` accumulator. 2. No SIMD: the x86 path was scalar-only while ARM NEON had full vectorization. Added AVX2 using the same broadcast/shuffle/cmpeq pattern from the existing Q1_0 kernel + mul_sum_i8_pairs_float. Results on i5-1135G7 with Bonsai 8B: Before (MSVC): 0.25 tok/s, gibberish output Bug fix only: 3.7 tok/s, correct output Bug fix + AVX2: 6.9 tok/s, correct output Both the x86-specific kernel (arch/x86/quants.c) and the generic fallback (quants.c) are fixed.

khosravipasha · 2026-04-02T18:46:32Z

This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones.

Could you run the KL divergence tests described here: #8

zcattacz · 2026-04-03T09:55:47Z

Running 8B on i5 box with this PR, I get consistent
[ Prompt: 1.8 t/s | Generation: 2.2~.2.4 t/s ] performance.

After swapped out defined(__AVX2__) logic with PR4 's xor + sub logic. (PR4 is based on an old commit, difficult to tinker)

I get consistent
[ Prompt: 2.7 t/s | Generation: 2.2~.2.4 t/s ] performance.

Below alternative code on i5 Broadwell gives:
~8 tps prompt parsing and ~6 tps generation for 4B
~4 tps prompt parsing and ~3 tps generation for 8B

/*
system_info: n_threads = 2 (n_threads_batch = 2) / 4 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
kl_divergence: computing over 100 chunks, n_ctx=512, batch_size=2048, n_seq=4
kl_divergence: 115.23 seconds per pass - ETA 48.00 minutes
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      13.9528 ±    3.1791      -0.00040 ±    0.00223       0.00019 ±    0.00002     0.382 ±  0.053 %    99.608 ±  0.392 %
   2      20.1970 ±    3.4355       0.01420 ±    0.01145       0.00019 ±    0.00001     0.343 ±  0.033 %    99.608 ±  0.277 %
   3      20.8596 ±    2.7888       0.00950 ±    0.00765       0.00021 ±    0.00001     0.351 ±  0.026 %    99.346 ±  0.292 %
   4      21.2115 ±    2.3896       0.00693 ±    0.00576       0.00022 ±    0.00001     0.369 ±  0.025 %    99.510 ±  0.219 %
*/

#if defined(__AVX2__)
    const __m256i ones_8 = _mm256_set1_epi8(1);
    const __m256i ones_16 = _mm256_set1_epi16(1);
    const __m256i byte_shuf = _mm256_setr_epi8(0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3);
    const __m256i bit_masks = _mm256_setr_epi8(1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128);
    const __m256i zero = _mm256_setzero_si256();
    __m256 acc = _mm256_setzero_ps();
    for (int ib = 0; ib < nb; ++ib) {
        const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
        const uint32_t *qs32 = (const uint32_t *)x[ib].qs;
        const block_q8_0 *y_ptr = &y[ib * 4];
        // y is deliberately left shadowed for a measurable performance gain
        // Unrolling removes one if
        __m256 acc_block;
        {
            const __m256i y = _mm256_loadu_si256((const __m256i *)y_ptr[0].qs);
            const __m256i sm = _mm256_cmpeq_epi8(_mm256_and_si256(_mm256_shuffle_epi8(_mm256_set1_epi32((int)qs32[0]), byte_shuf), bit_masks), zero);
            const __m256i sy = _mm256_sub_epi8(_mm256_xor_si256(y, sm), sm);
            const __m256i s32 = _mm256_madd_epi16(_mm256_maddubs_epi16(ones_8, sy), ones_16);
            // Avoid high KLD Max caused by AxB+0
            acc_block = _mm256_mul_ps(_mm256_set1_ps(GGML_CPU_FP16_TO_FP32(y_ptr[0].d)), _mm256_cvtepi32_ps(s32));
        }
    #define Q1_AVX2_BLOCK(K) \
        { \
            const __m256i y = _mm256_loadu_si256((const __m256i *)y_ptr[K].qs); \
            const __m256i sm = _mm256_cmpeq_epi8(_mm256_and_si256(_mm256_shuffle_epi8(_mm256_set1_epi32((int)qs32[K]), byte_shuf), bit_masks), zero); \
            const __m256i sy = _mm256_sub_epi8(_mm256_xor_si256(y, sm), sm); \
            const __m256i s32 = _mm256_madd_epi16(_mm256_maddubs_epi16(ones_8, sy), ones_16); \
            acc_block = _mm256_fmadd_ps(_mm256_set1_ps(GGML_CPU_FP16_TO_FP32(y_ptr[K].d)), _mm256_cvtepi32_ps(s32), acc_block); \
        }
        Q1_AVX2_BLOCK(1) Q1_AVX2_BLOCK(2) Q1_AVX2_BLOCK(3)
    #undef Q1_AVX2_BLOCK
        acc = _mm256_fmadd_ps(_mm256_set1_ps(d0), acc_block, acc);
    }
    {
        const __m128 h = _mm_add_ps(_mm256_extractf128_ps(acc, 0), _mm256_extractf128_ps(acc, 1));
        const __m128 q = _mm_add_ps(h, _mm_movehl_ps(h, h));
        *s = _mm_cvtss_f32(_mm_add_ss(q, _mm_movehdup_ps(q)));
    }

The 0.0002 KLD seems persistent on AVX2 across different basic implementations.

//impl4-macro6
/*
system_info: n_threads = 2 (n_threads_batch = 2) / 4 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
kl_divergence: computing over 100 chunks, n_ctx=512, batch_size=2048, n_seq=4
kl_divergence: 164.69 seconds per pass - ETA 1 hours 8.62 minutes

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      13.9528 ±    3.1791      -0.00040 ±    0.00223       0.00019 ±    0.00002     0.382 ±  0.053 %    99.608 ±  0.392 %
   2      20.1970 ±    3.4355       0.01420 ±    0.01145       0.00019 ±    0.00001     0.343 ±  0.033 %    99.608 ±  0.277 %
   3      20.8596 ±    2.7888       0.00950 ±    0.00765       0.00021 ±    0.00001     0.351 ±  0.026 %    99.346 ±  0.292 %
   4      21.2115 ±    2.3896       0.00693 ±    0.00576       0.00022 ±    0.00001     0.369 ±  0.025 %    99.510 ±  0.219 %
*/
#if defined(__AVX2__)
    const __m256i ones_8 = _mm256_set1_epi8(1);
    const __m256i ones_16 = _mm256_set1_epi16(1);
    const __m256i byte_shuf = _mm256_setr_epi8(0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3);
    const __m256i bit_masks = _mm256_setr_epi8(1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128);
    const __m256i zero = _mm256_setzero_si256();
    __m256 acc = _mm256_setzero_ps();
    for (int ib = 0; ib < nb; ++ib) {
        const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
        const uint32_t *qs32 = (const uint32_t *)x[ib].qs;
        const block_q8_0 *y_ptr = &y[ib * 4];
        // y is deliberately left shadowed for a measurable performance gain
        __m256 acc_block;
    #define Q1_AVX2_BLOCK(K) \
    { \
        const __m256i y = _mm256_loadu_si256((const __m256i *)y_ptr[K].qs); \
        /* sm is 0xFF where bit == 0 (should be subtracted), 0x00 where bit == 1 (should be added) */ \
        const __m256i sm = _mm256_cmpeq_epi8(_mm256_and_si256(_mm256_shuffle_epi8(_mm256_set1_epi32((int)qs32[K]), byte_shuf), bit_masks), zero); \
        \
        /* Isolate the negative and positive y values using masks */ \
        const __m256i y_neg = _mm256_and_si256(y, sm); \
        const __m256i y_pos = _mm256_andnot_si256(sm, y); \
        \
        /* Widen to 16-bit safely. Even if y_neg contains -128, it sits in a 16-bit register now */ \
        const __m256i sum_pos = _mm256_maddubs_epi16(ones_8, y_pos); \
        const __m256i sum_neg = _mm256_maddubs_epi16(ones_8, y_neg); \
        \
        /* Subtract at 16-bit precision: 0 - (-128) safely equals +128 */ \
        const __m256i s16 = _mm256_sub_epi16(sum_pos, sum_neg); \
        const __m256i s32 = _mm256_madd_epi16(s16, ones_16); \
        \
        /* Accumulate as float */ \
        acc_block = (K == 0) \
            ? _mm256_mul_ps(_mm256_set1_ps(GGML_CPU_FP16_TO_FP32(y_ptr[K].d)), _mm256_cvtepi32_ps(s32)) \
            : _mm256_fmadd_ps(_mm256_set1_ps(GGML_CPU_FP16_TO_FP32(y_ptr[K].d)), _mm256_cvtepi32_ps(s32), acc_block); \
    }
            Q1_AVX2_BLOCK(0) Q1_AVX2_BLOCK(1) Q1_AVX2_BLOCK(2) Q1_AVX2_BLOCK(3)
    #undef Q1_AVX2_BLOCK
        acc = _mm256_fmadd_ps(_mm256_set1_ps(d0), acc_block, acc);
    }
    {
        const __m128 h = _mm_add_ps(_mm256_extractf128_ps(acc, 0), _mm256_extractf128_ps(acc, 1));
        const __m128 q = _mm_add_ps(h, _mm_movehl_ps(h, h));
        *s = _mm_cvtss_f32(_mm_add_ss(q, _mm_movehdup_ps(q)));
    }

// impl5
/*
4B Model [ Prompt: 3.8 t/s | Generation: 3.0 t/s 
system_info: n_threads = 2 (n_threads_batch = 2) / 4 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
kl_divergence: computing over 100 chunks, n_ctx=512, batch_size=2048, n_seq=4
kl_divergence: 232.33 seconds per pass - ETA 1 hours 36.80 minutes
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      13.9553 ±    3.1774      -0.00022 ±    0.00221       0.00021 ±    0.00002     0.359 ±  0.043 %    98.431 ±  0.780 %
   2      20.1932 ±    3.4345       0.01401 ±    0.01147       0.00022 ±    0.00002     0.334 ±  0.030 %    99.020 ±  0.437 %
   3      20.8668 ±    2.7899       0.00985 ±    0.00767       0.00022 ±    0.00001     0.372 ±  0.028 %    98.824 ±  0.390 %
   4      21.2224 ±    2.3914       0.00745 ±    0.00577       0.00022 ±    0.00001     0.367 ±  0.023 %    98.824 ±  0.338 %
*/
#if defined(__AVX2__)
// STRICT SCALAR MATH REPRODUCTION
    const __m256i ones_8 = _mm256_set1_epi8(1);
    const __m256i ones_16 = _mm256_set1_epi16(1);
    const __m256i byte_shuf = _mm256_setr_epi8(0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3);
    const __m256i bit_masks = _mm256_setr_epi8(1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128);
    const __m256i zero = _mm256_setzero_si256();
    
    // We replace __m256 acc with a single scalar float!
    float final_acc = 0.0f; 

    for (int ib = 0; ib < nb; ++ib) {
        const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
        const uint32_t *qs32 = (const uint32_t *)x[ib].qs;
        const block_q8_0 *y_ptr = &y[ib * 4];
        
        float acc_block_scalar = 0.0f;

    #define Q1_AVX2_BLOCK(K) \
        { \
            const __m256i qy = _mm256_loadu_si256((const __m256i *)y_ptr[K].qs); \
            const __m256i sm = _mm256_cmpeq_epi8(_mm256_and_si256(_mm256_shuffle_epi8(_mm256_set1_epi32((int)qs32[K]), byte_shuf), bit_masks), zero); \
            const __m256i sy = _mm256_sub_epi8(_mm256_xor_si256(qy, sm), sm); \
            const __m256i s32 = _mm256_madd_epi16(_mm256_maddubs_epi16(ones_8, sy), ones_16); \
            \
            /* 1. Force horizontal integer sum of the 8 lanes down to 1 integer */ \
            __m128i sum128 = _mm_add_epi32(_mm256_castsi256_si128(s32), _mm256_extracti128_si256(s32, 1)); \
            sum128 = _mm_hadd_epi32(sum128, sum128); \
            sum128 = _mm_hadd_epi32(sum128, sum128); \
            int exact_dot_product = _mm_cvtsi128_si32(sum128); \
            \
            /* 2. Convert to float and multiply by scale exactly like C reference */ \
            acc_block_scalar += (float)exact_dot_product * GGML_CPU_FP16_TO_FP32(y_ptr[K].d); \
        }

        Q1_AVX2_BLOCK(0) Q1_AVX2_BLOCK(1) Q1_AVX2_BLOCK(2) Q1_AVX2_BLOCK(3)
    #undef Q1_AVX2_BLOCK

        // Accumulate into scalar
        final_acc += d0 * acc_block_scalar;
    }

    *s = final_acc;

zcattacz · 2026-04-03T13:54:01Z

the SSE path provide 0.1 tps -> 0.7~0.9 tps on N2840 ATOM.

#elif defined(__SSE4_2__) || defined(__SSSE3__)
    // Optimized SSE4.2/SSSE3 path for Q1_0_g128 · Q8_0
    // This uses 128-bit registers to process 16 elements at a time.
    const __m128i ones_8    = _mm_set1_epi8(1);
    const __m128i ones_16   = _mm_set1_epi16(1);
    
    // This shuffle mask spreads the 1st byte of a register to the first 8 slots,
    // and the 2nd byte to the next 8 slots.
    const __m128i byte_shuf = _mm_setr_epi8(
        0,0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1);
        
    const __m128i bit_masks = _mm_setr_epi8(
        1,2,4,8,16,32,64,(char)128,
        1,2,4,8,16,32,64,(char)128);
    const __m128i zero = _mm_setzero_si128();

    __m128 acc = _mm_setzero_ps();

    for (int ib = 0; ib < nb; ++ib) {
        const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
        const uint32_t * qs32 = (const uint32_t *)x[ib].qs;
        const block_q8_0 * y_ptr = &y[ib * 4];

        __m128 acc_block = _mm_setzero_ps();

        for (int k = 0; k < 4; ++k) {
            const float dk = GGML_CPU_FP16_TO_FP32(y_ptr[k].d);
            const __m128 vdk = _mm_set1_ps(dk);
            
            // Load 32 bytes of Q8_0 weights into two 16-byte SSE registers
            const __m128i qy_l = _mm_loadu_si128((const __m128i *)(y_ptr[k].qs));
            const __m128i qy_h = _mm_loadu_si128((const __m128i *)(y_ptr[k].qs + 16));

            const uint32_t bits = qs32[k];

            // Process Lower 16 elements (using lower 16 bits of the mask)
            const __m128i mask_l = _mm_shuffle_epi8(_mm_set1_epi16((short)(bits & 0xFFFF)), byte_shuf);
            const __m128i sm_l   = _mm_cmpeq_epi8(_mm_and_si128(mask_l, bit_masks), zero);
            const __m128i sy_l   = _mm_sub_epi8(_mm_xor_si128(qy_l, sm_l), sm_l);
            const __m128i s32_l  = _mm_madd_epi16(_mm_maddubs_epi16(ones_8, sy_l), ones_16);

            // Process Upper 16 elements (using upper 16 bits of the mask)
            const __m128i mask_h = _mm_shuffle_epi8(_mm_set1_epi16((short)(bits >> 16)), byte_shuf);
            const __m128i sm_h   = _mm_cmpeq_epi8(_mm_and_si128(mask_h, bit_masks), zero);
            const __m128i sy_h   = _mm_sub_epi8(_mm_xor_si128(qy_h, sm_h), sm_h);
            const __m128i s32_h  = _mm_madd_epi16(_mm_maddubs_epi16(ones_8, sy_h), ones_16);

            // Convert integer sums to float, scale by dk, and add to block accumulator
            const __m128i s32_total = _mm_add_epi32(s32_l, s32_h);
            acc_block = _mm_add_ps(acc_block, _mm_mul_ps(vdk, _mm_cvtepi32_ps(s32_total)));
        }

        // Final scale by d0 (block scale) and add to global accumulator
        acc = _mm_add_ps(acc, _mm_mul_ps(_mm_set1_ps(d0), acc_block));
    }

    // Horizontal reduction of the 4 float lanes in the SSE register
    {
        acc = _mm_add_ps(acc, _mm_movehl_ps(acc, acc));
        acc = _mm_add_ss(acc, _mm_shuffle_ps(acc, acc, 1));
        *s = _mm_cvtss_f32(acc);
    }

AI suggested that meaningful acceleration for 1bitnet on CPUs lack of AVX instruction could only be achieved by implementing the dot product as Dot Product=2×popcount(A XNOR B)−Total Bits using _mm_popcnt_u64 ...
@khosravipasha, I guess these devices needs a different quantization format :-D ?

wildcattrio · 2026-04-03T18:33:21Z

KL Divergence Results — x86 AVX2 (PR #7)

Ran the KL divergence tests from PR #8 on the AVX2 kernel fix from this PR. Hardware: Intel i5-1135G7 (Tiger Lake), 32GB RAM, Windows 11. Build: 8194 (1179bfc82) with Clang 22.1.2.

System info: CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1

Test setup

F16 reference: dequantized from Bonsai-1.7B.gguf (Q1_0_g128) using llama-quantize --allow-requantize
Dataset: wikitext-2-raw, -c 512 --chunks 100
F16 perplexity: PPL = 24.04, 41.5 tok/s prompt processing
Q1_0_g128 perplexity: PPL = 24.09, 27.0 tok/s prompt processing

x86 AVX2 Divergences

Metric	Q1_0_g128 (1.13 BPW)
Same top p	99.220 ± 0.055 %
Mean KLD	0.000224 ± 0.000002
Maximum KLD	0.013303
99.9% KLD	0.002923
99.0% KLD	0.001273
Median KLD	0.000150
1.0% KLD	-0.000008
Minimum KLD	-0.000083
Mean Δp	-0.005 ± 0.002 %
Maximum Δp	4.779 %
99.9% Δp	2.198 %
99.0% Δp	1.122 %
95.0% Δp	0.511 %
Median Δp	-0.000 %
5.0% Δp	-0.534 %
1.0% Δp	-1.120 %
Minimum Δp	-3.440 %
RMS Δp	0.352 ± 0.005 %

Comparison with PR #8 reference (ARM NEON / generic scalar)

Metric	ARM NEON	x86 AVX2 (this PR)
Same top p	99.965 %	99.220 %
Mean KLD	0.000000	0.000224
Maximum KLD	0.000065	0.013303
RMS Δp	0.006 %	0.352 %

The AVX2 kernel shows measurably higher divergence compared to the NEON/scalar reference. The likely cause is floating-point operation ordering: our AVX2 path pre-multiplies d0 * d1 as a combined scale per sub-block, while the scalar reference accumulates d1 * sumi_block per sub-block then multiplies by d0 at the end. This difference in FP associativity accumulates across the 28 layers.

Note: @zcattacz's XOR+SUB approach posted above uses the two-level accumulation pattern (acc_block += dk * dot, then acc += d0 * acc_block) which more closely matches the scalar reference FP ordering — it may produce better KLD numbers. Worth testing.

Output quality is still good despite the divergence — text generation is coherent and the PPL difference is only 0.057 (24.09 vs 24.04).

zcattacz · 2026-04-05T09:19:42Z

Hi @wildcattrio , I updated the implementation and here is the combined result. You were right xor+sub gives slightly better KLD with good tps. I also tried other impl for tps, the best were on par, but this is the simplest.

Metric	Generic Scalar Fallback	PR7 on TigerLake	PR7 on Broadwell	xor+sub on Broadwell
Same top p	99.965 ± 0.012 %	99.220 ± 0.055 %	99.216 ± 0.055 %	99.220 ± 0.055 %
Mean KLD	0.000000 ± 0.000000	0.000224 ± 0.000002	0.000223 ± 0.000002	0.000222 ± 0.000002
Maximum KLD	0.000065	0.013303	0.012099	0.010006
99.9% KLD	0.000051	0.002923	0.003226	0.003014
99.0% KLD	0.000036	0.001273	0.001277	0.001232
Median KLD	0.000000	0.000150	0.000151	0.000148
1.0% KLD	-0.000036	-0.000008	-0.000008	-0.000008
Minimum KLD	-0.000061	-0.000083	-0.000124	-0.000114
Mean Δp	-0.000 ± 0.000 %	-0.005 ± 0.002 %	-0.004 ± 0.002 %	-0.001 ± 0.002 %
Maximum Δp	0.120 %	4.779 %	6.801 %	5.948 %
99.9% Δp	0.039 %	2.198 %	2.232 %	2.194 %
99.0% Δp	0.017 %	1.122 %	1.168 %	1.109 %
95.0% Δp	0.007 %	0.511 %	0.510 %	0.509 %
Median Δp	0.000 %	-0.000 %	-0.000 %	-0.000 %
5.0% Δp	-0.008 %	-0.534 %	-0.534 %	-0.517 %
1.0% Δp	-0.017 %	-1.120 %	-1.132 %	-1.096 %
0.1% Δp	-0.039 %	—	-2.314 %	-2.139 %
Minimum Δp	-0.102 %	-3.440 %	-3.624 %	-3.849 %
RMS Δp	0.006 ± 0.000 %	0.352 ± 0.005 %	0.360 ± 0.006 %	0.349 ± 0.005 %
Mean PPL(Q)			24.089197 ± 0.527720	24.089508 ± 0.527763
Mean PPL(Base)			24.036712 ± 0.525238	24.036712 ± 0.525238
Mean PPL(Q) - PPL(Base)			0.052485 ± 0.012109	0.052796 ± 0.012114
prompt eval TPS			10.25	17.68

github-actions bot added the ggml label Apr 2, 2026

zcattacz mentioned this pull request Apr 3, 2026

fix: Q1_0_g128 x86 CPU kernel - correct output + AVX2/AVX-512 VNNI #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Q1_0_g128 x86 CPU kernel — float truncation + AVX2 vectorization#7

fix: Q1_0_g128 x86 CPU kernel — float truncation + AVX2 vectorization#7
wildcattrio wants to merge 1 commit intoPrismML-Eng:prismfrom
wildcattrio:fix/x86-q1_0_g128-kernel

wildcattrio commented Apr 2, 2026

Uh oh!

khosravipasha commented Apr 2, 2026

Uh oh!

zcattacz commented Apr 3, 2026 •

edited

Loading

Uh oh!

zcattacz commented Apr 3, 2026 •

edited

Loading

Uh oh!

wildcattrio commented Apr 3, 2026

Uh oh!

zcattacz commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wildcattrio commented Apr 2, 2026

Summary

Bug 1: Float-to-int truncation (causes gibberish)

Bug 2: No SIMD (causes 0.25 tok/s)

Results (i5-1135G7, 32GB, Bonsai 8B)

Files changed

Test plan

Uh oh!

khosravipasha commented Apr 2, 2026

Uh oh!

zcattacz commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zcattacz commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wildcattrio commented Apr 3, 2026

KL Divergence Results — x86 AVX2 (PR #7)

Test setup

x86 AVX2 Divergences

Comparison with PR #8 reference (ARM NEON / generic scalar)

Uh oh!

zcattacz commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zcattacz commented Apr 3, 2026 •

edited

Loading

zcattacz commented Apr 3, 2026 •

edited

Loading