Skip to content

fix: Q1_0_g128 x86 CPU kernel — float truncation + AVX2 vectorization#7

Open
wildcattrio wants to merge 1 commit intoPrismML-Eng:prismfrom
wildcattrio:fix/x86-q1_0_g128-kernel
Open

fix: Q1_0_g128 x86 CPU kernel — float truncation + AVX2 vectorization#7
wildcattrio wants to merge 1 commit intoPrismML-Eng:prismfrom
wildcattrio:fix/x86-q1_0_g128-kernel

Conversation

@wildcattrio
Copy link
Copy Markdown

Summary

The Q1_0_g128 x86 CPU kernel produces gibberish output at 0.25 tok/s on Intel CPUs. Two bugs:

Bug 1: Float-to-int truncation (causes gibberish)

The per-block accumulator is int, but d1 * sumi_block produces a float. The implicit cast truncates every Q8_0 block's scale factor to 0 or ±1, destroying the output.

// Before (broken):
int sumi = 0;
sumi += d1 * sumi_block;  // float truncated to int

// After (fixed):
float block_sum = 0.0f;
block_sum += d1 * (float)sumi_block;

Bug 2: No SIMD (causes 0.25 tok/s)

The x86 kernel is scalar-only while the ARM NEON version has full vectorization. Added AVX2 using the same broadcast → shuffle → cmpeq → mul_sum_i8_pairs_float pattern from the existing ggml_vec_dot_q1_0_q8_0 kernel.

Results (i5-1135G7, 32GB, Bonsai 8B)

Version tok/s Output
Before (MSVC, shipped) 0.25 , with is it. and the. and the.... in.........
Bug fix only (scalar) 3.7 Correct
Bug fix + AVX2 6.9 Correct

Files changed

  • ggml/src/ggml-cpu/arch/x86/quants.c — AVX2 kernel + scalar fix
  • ggml/src/ggml-cpu/quants.c — generic scalar fallback fix

Test plan

  • Bonsai 8B: coherent output on Q&A, reasoning, and JSON extraction prompts
  • Bonsai 4B: coherent output (also tested)
  • Standard GGUF (Qwen 3.5 4B Q4_K_M): no regression, loads and runs correctly
  • Benchmark: 3 prompts × 8 configurations, all results in JSON

The Q1_0_g128 x86 kernel has two bugs causing gibberish output at
0.25 tok/s on Intel CPUs:

1. Float-to-int truncation: the per-block accumulator was `int`,
   truncating `d1 * sumi_block` (float * int → float → int). Each
   Q8_0 block's scale factor was rounded to 0 or ±1, destroying
   the output. Fix: `float block_sum` accumulator.

2. No SIMD: the x86 path was scalar-only while ARM NEON had full
   vectorization. Added AVX2 using the same broadcast/shuffle/cmpeq
   pattern from the existing Q1_0 kernel + mul_sum_i8_pairs_float.

Results on i5-1135G7 with Bonsai 8B:
  Before (MSVC):      0.25 tok/s, gibberish output
  Bug fix only:       3.7  tok/s, correct output
  Bug fix + AVX2:     6.9  tok/s, correct output

Both the x86-specific kernel (arch/x86/quants.c) and the generic
fallback (quants.c) are fixed.
@github-actions github-actions bot added the ggml label Apr 2, 2026
@khosravipasha
Copy link
Copy Markdown
Collaborator

This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones.

Could you run the KL divergence tests described here: #8

@zcattacz
Copy link
Copy Markdown

zcattacz commented Apr 3, 2026

Running 8B on i5 box with this PR, I get consistent
[ Prompt: 1.8 t/s | Generation: 2.2~.2.4 t/s ] performance.

After swapped out defined(__AVX2__) logic with PR4 's xor + sub logic. (PR4 is based on an old commit, difficult to tinker)

I get consistent
[ Prompt: 2.7 t/s | Generation: 2.2~.2.4 t/s ] performance.

Below alternative code on i5 Broadwell gives:
~8 tps prompt parsing and ~6 tps generation for 4B
~4 tps prompt parsing and ~3 tps generation for 8B

/*
system_info: n_threads = 2 (n_threads_batch = 2) / 4 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
kl_divergence: computing over 100 chunks, n_ctx=512, batch_size=2048, n_seq=4
kl_divergence: 115.23 seconds per pass - ETA 48.00 minutes
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      13.9528 ±    3.1791      -0.00040 ±    0.00223       0.00019 ±    0.00002     0.382 ±  0.053 %    99.608 ±  0.392 %
   2      20.1970 ±    3.4355       0.01420 ±    0.01145       0.00019 ±    0.00001     0.343 ±  0.033 %    99.608 ±  0.277 %
   3      20.8596 ±    2.7888       0.00950 ±    0.00765       0.00021 ±    0.00001     0.351 ±  0.026 %    99.346 ±  0.292 %
   4      21.2115 ±    2.3896       0.00693 ±    0.00576       0.00022 ±    0.00001     0.369 ±  0.025 %    99.510 ±  0.219 %
*/

#if defined(__AVX2__)
    const __m256i ones_8 = _mm256_set1_epi8(1);
    const __m256i ones_16 = _mm256_set1_epi16(1);
    const __m256i byte_shuf = _mm256_setr_epi8(0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3);
    const __m256i bit_masks = _mm256_setr_epi8(1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128);
    const __m256i zero = _mm256_setzero_si256();
    __m256 acc = _mm256_setzero_ps();
    for (int ib = 0; ib < nb; ++ib) {
        const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
        const uint32_t *qs32 = (const uint32_t *)x[ib].qs;
        const block_q8_0 *y_ptr = &y[ib * 4];
        // y is deliberately left shadowed for a measurable performance gain
        // Unrolling removes one if
        __m256 acc_block;
        {
            const __m256i y = _mm256_loadu_si256((const __m256i *)y_ptr[0].qs);
            const __m256i sm = _mm256_cmpeq_epi8(_mm256_and_si256(_mm256_shuffle_epi8(_mm256_set1_epi32((int)qs32[0]), byte_shuf), bit_masks), zero);
            const __m256i sy = _mm256_sub_epi8(_mm256_xor_si256(y, sm), sm);
            const __m256i s32 = _mm256_madd_epi16(_mm256_maddubs_epi16(ones_8, sy), ones_16);
            // Avoid high KLD Max caused by AxB+0
            acc_block = _mm256_mul_ps(_mm256_set1_ps(GGML_CPU_FP16_TO_FP32(y_ptr[0].d)), _mm256_cvtepi32_ps(s32));
        }
    #define Q1_AVX2_BLOCK(K) \
        { \
            const __m256i y = _mm256_loadu_si256((const __m256i *)y_ptr[K].qs); \
            const __m256i sm = _mm256_cmpeq_epi8(_mm256_and_si256(_mm256_shuffle_epi8(_mm256_set1_epi32((int)qs32[K]), byte_shuf), bit_masks), zero); \
            const __m256i sy = _mm256_sub_epi8(_mm256_xor_si256(y, sm), sm); \
            const __m256i s32 = _mm256_madd_epi16(_mm256_maddubs_epi16(ones_8, sy), ones_16); \
            acc_block = _mm256_fmadd_ps(_mm256_set1_ps(GGML_CPU_FP16_TO_FP32(y_ptr[K].d)), _mm256_cvtepi32_ps(s32), acc_block); \
        }
        Q1_AVX2_BLOCK(1) Q1_AVX2_BLOCK(2) Q1_AVX2_BLOCK(3)
    #undef Q1_AVX2_BLOCK
        acc = _mm256_fmadd_ps(_mm256_set1_ps(d0), acc_block, acc);
    }
    {
        const __m128 h = _mm_add_ps(_mm256_extractf128_ps(acc, 0), _mm256_extractf128_ps(acc, 1));
        const __m128 q = _mm_add_ps(h, _mm_movehl_ps(h, h));
        *s = _mm_cvtss_f32(_mm_add_ss(q, _mm_movehdup_ps(q)));
    }

The 0.0002 KLD seems persistent on AVX2 across different basic implementations.

//impl4-macro6
/*
system_info: n_threads = 2 (n_threads_batch = 2) / 4 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
kl_divergence: computing over 100 chunks, n_ctx=512, batch_size=2048, n_seq=4
kl_divergence: 164.69 seconds per pass - ETA 1 hours 8.62 minutes

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      13.9528 ±    3.1791      -0.00040 ±    0.00223       0.00019 ±    0.00002     0.382 ±  0.053 %    99.608 ±  0.392 %
   2      20.1970 ±    3.4355       0.01420 ±    0.01145       0.00019 ±    0.00001     0.343 ±  0.033 %    99.608 ±  0.277 %
   3      20.8596 ±    2.7888       0.00950 ±    0.00765       0.00021 ±    0.00001     0.351 ±  0.026 %    99.346 ±  0.292 %
   4      21.2115 ±    2.3896       0.00693 ±    0.00576       0.00022 ±    0.00001     0.369 ±  0.025 %    99.510 ±  0.219 %
*/
#if defined(__AVX2__)
    const __m256i ones_8 = _mm256_set1_epi8(1);
    const __m256i ones_16 = _mm256_set1_epi16(1);
    const __m256i byte_shuf = _mm256_setr_epi8(0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3);
    const __m256i bit_masks = _mm256_setr_epi8(1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128);
    const __m256i zero = _mm256_setzero_si256();
    __m256 acc = _mm256_setzero_ps();
    for (int ib = 0; ib < nb; ++ib) {
        const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
        const uint32_t *qs32 = (const uint32_t *)x[ib].qs;
        const block_q8_0 *y_ptr = &y[ib * 4];
        // y is deliberately left shadowed for a measurable performance gain
        __m256 acc_block;
    #define Q1_AVX2_BLOCK(K) \
    { \
        const __m256i y = _mm256_loadu_si256((const __m256i *)y_ptr[K].qs); \
        /* sm is 0xFF where bit == 0 (should be subtracted), 0x00 where bit == 1 (should be added) */ \
        const __m256i sm = _mm256_cmpeq_epi8(_mm256_and_si256(_mm256_shuffle_epi8(_mm256_set1_epi32((int)qs32[K]), byte_shuf), bit_masks), zero); \
        \
        /* Isolate the negative and positive y values using masks */ \
        const __m256i y_neg = _mm256_and_si256(y, sm); \
        const __m256i y_pos = _mm256_andnot_si256(sm, y); \
        \
        /* Widen to 16-bit safely. Even if y_neg contains -128, it sits in a 16-bit register now */ \
        const __m256i sum_pos = _mm256_maddubs_epi16(ones_8, y_pos); \
        const __m256i sum_neg = _mm256_maddubs_epi16(ones_8, y_neg); \
        \
        /* Subtract at 16-bit precision: 0 - (-128) safely equals +128 */ \
        const __m256i s16 = _mm256_sub_epi16(sum_pos, sum_neg); \
        const __m256i s32 = _mm256_madd_epi16(s16, ones_16); \
        \
        /* Accumulate as float */ \
        acc_block = (K == 0) \
            ? _mm256_mul_ps(_mm256_set1_ps(GGML_CPU_FP16_TO_FP32(y_ptr[K].d)), _mm256_cvtepi32_ps(s32)) \
            : _mm256_fmadd_ps(_mm256_set1_ps(GGML_CPU_FP16_TO_FP32(y_ptr[K].d)), _mm256_cvtepi32_ps(s32), acc_block); \
    }
            Q1_AVX2_BLOCK(0) Q1_AVX2_BLOCK(1) Q1_AVX2_BLOCK(2) Q1_AVX2_BLOCK(3)
    #undef Q1_AVX2_BLOCK
        acc = _mm256_fmadd_ps(_mm256_set1_ps(d0), acc_block, acc);
    }
    {
        const __m128 h = _mm_add_ps(_mm256_extractf128_ps(acc, 0), _mm256_extractf128_ps(acc, 1));
        const __m128 q = _mm_add_ps(h, _mm_movehl_ps(h, h));
        *s = _mm_cvtss_f32(_mm_add_ss(q, _mm_movehdup_ps(q)));
    }
// impl5
/*
4B Model [ Prompt: 3.8 t/s | Generation: 3.0 t/s 
system_info: n_threads = 2 (n_threads_batch = 2) / 4 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
kl_divergence: computing over 100 chunks, n_ctx=512, batch_size=2048, n_seq=4
kl_divergence: 232.33 seconds per pass - ETA 1 hours 36.80 minutes
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      13.9553 ±    3.1774      -0.00022 ±    0.00221       0.00021 ±    0.00002     0.359 ±  0.043 %    98.431 ±  0.780 %
   2      20.1932 ±    3.4345       0.01401 ±    0.01147       0.00022 ±    0.00002     0.334 ±  0.030 %    99.020 ±  0.437 %
   3      20.8668 ±    2.7899       0.00985 ±    0.00767       0.00022 ±    0.00001     0.372 ±  0.028 %    98.824 ±  0.390 %
   4      21.2224 ±    2.3914       0.00745 ±    0.00577       0.00022 ±    0.00001     0.367 ±  0.023 %    98.824 ±  0.338 %
*/
#if defined(__AVX2__)
// STRICT SCALAR MATH REPRODUCTION
    const __m256i ones_8 = _mm256_set1_epi8(1);
    const __m256i ones_16 = _mm256_set1_epi16(1);
    const __m256i byte_shuf = _mm256_setr_epi8(0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3);
    const __m256i bit_masks = _mm256_setr_epi8(1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128);
    const __m256i zero = _mm256_setzero_si256();
    
    // We replace __m256 acc with a single scalar float!
    float final_acc = 0.0f; 

    for (int ib = 0; ib < nb; ++ib) {
        const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
        const uint32_t *qs32 = (const uint32_t *)x[ib].qs;
        const block_q8_0 *y_ptr = &y[ib * 4];
        
        float acc_block_scalar = 0.0f;

    #define Q1_AVX2_BLOCK(K) \
        { \
            const __m256i qy = _mm256_loadu_si256((const __m256i *)y_ptr[K].qs); \
            const __m256i sm = _mm256_cmpeq_epi8(_mm256_and_si256(_mm256_shuffle_epi8(_mm256_set1_epi32((int)qs32[K]), byte_shuf), bit_masks), zero); \
            const __m256i sy = _mm256_sub_epi8(_mm256_xor_si256(qy, sm), sm); \
            const __m256i s32 = _mm256_madd_epi16(_mm256_maddubs_epi16(ones_8, sy), ones_16); \
            \
            /* 1. Force horizontal integer sum of the 8 lanes down to 1 integer */ \
            __m128i sum128 = _mm_add_epi32(_mm256_castsi256_si128(s32), _mm256_extracti128_si256(s32, 1)); \
            sum128 = _mm_hadd_epi32(sum128, sum128); \
            sum128 = _mm_hadd_epi32(sum128, sum128); \
            int exact_dot_product = _mm_cvtsi128_si32(sum128); \
            \
            /* 2. Convert to float and multiply by scale exactly like C reference */ \
            acc_block_scalar += (float)exact_dot_product * GGML_CPU_FP16_TO_FP32(y_ptr[K].d); \
        }

        Q1_AVX2_BLOCK(0) Q1_AVX2_BLOCK(1) Q1_AVX2_BLOCK(2) Q1_AVX2_BLOCK(3)
    #undef Q1_AVX2_BLOCK

        // Accumulate into scalar
        final_acc += d0 * acc_block_scalar;
    }

    *s = final_acc;

@zcattacz
Copy link
Copy Markdown

zcattacz commented Apr 3, 2026

the SSE path provide 0.1 tps -> 0.7~0.9 tps on N2840 ATOM.

#elif defined(__SSE4_2__) || defined(__SSSE3__)
    // Optimized SSE4.2/SSSE3 path for Q1_0_g128 · Q8_0
    // This uses 128-bit registers to process 16 elements at a time.
    const __m128i ones_8    = _mm_set1_epi8(1);
    const __m128i ones_16   = _mm_set1_epi16(1);
    
    // This shuffle mask spreads the 1st byte of a register to the first 8 slots,
    // and the 2nd byte to the next 8 slots.
    const __m128i byte_shuf = _mm_setr_epi8(
        0,0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1);
        
    const __m128i bit_masks = _mm_setr_epi8(
        1,2,4,8,16,32,64,(char)128,
        1,2,4,8,16,32,64,(char)128);
    const __m128i zero = _mm_setzero_si128();

    __m128 acc = _mm_setzero_ps();

    for (int ib = 0; ib < nb; ++ib) {
        const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
        const uint32_t * qs32 = (const uint32_t *)x[ib].qs;
        const block_q8_0 * y_ptr = &y[ib * 4];

        __m128 acc_block = _mm_setzero_ps();

        for (int k = 0; k < 4; ++k) {
            const float dk = GGML_CPU_FP16_TO_FP32(y_ptr[k].d);
            const __m128 vdk = _mm_set1_ps(dk);
            
            // Load 32 bytes of Q8_0 weights into two 16-byte SSE registers
            const __m128i qy_l = _mm_loadu_si128((const __m128i *)(y_ptr[k].qs));
            const __m128i qy_h = _mm_loadu_si128((const __m128i *)(y_ptr[k].qs + 16));

            const uint32_t bits = qs32[k];

            // Process Lower 16 elements (using lower 16 bits of the mask)
            const __m128i mask_l = _mm_shuffle_epi8(_mm_set1_epi16((short)(bits & 0xFFFF)), byte_shuf);
            const __m128i sm_l   = _mm_cmpeq_epi8(_mm_and_si128(mask_l, bit_masks), zero);
            const __m128i sy_l   = _mm_sub_epi8(_mm_xor_si128(qy_l, sm_l), sm_l);
            const __m128i s32_l  = _mm_madd_epi16(_mm_maddubs_epi16(ones_8, sy_l), ones_16);

            // Process Upper 16 elements (using upper 16 bits of the mask)
            const __m128i mask_h = _mm_shuffle_epi8(_mm_set1_epi16((short)(bits >> 16)), byte_shuf);
            const __m128i sm_h   = _mm_cmpeq_epi8(_mm_and_si128(mask_h, bit_masks), zero);
            const __m128i sy_h   = _mm_sub_epi8(_mm_xor_si128(qy_h, sm_h), sm_h);
            const __m128i s32_h  = _mm_madd_epi16(_mm_maddubs_epi16(ones_8, sy_h), ones_16);

            // Convert integer sums to float, scale by dk, and add to block accumulator
            const __m128i s32_total = _mm_add_epi32(s32_l, s32_h);
            acc_block = _mm_add_ps(acc_block, _mm_mul_ps(vdk, _mm_cvtepi32_ps(s32_total)));
        }

        // Final scale by d0 (block scale) and add to global accumulator
        acc = _mm_add_ps(acc, _mm_mul_ps(_mm_set1_ps(d0), acc_block));
    }

    // Horizontal reduction of the 4 float lanes in the SSE register
    {
        acc = _mm_add_ps(acc, _mm_movehl_ps(acc, acc));
        acc = _mm_add_ss(acc, _mm_shuffle_ps(acc, acc, 1));
        *s = _mm_cvtss_f32(acc);
    }

AI suggested that meaningful acceleration for 1bitnet on CPUs lack of AVX instruction could only be achieved by implementing the dot product as Dot Product=2×popcount(A XNOR B)−Total Bits using _mm_popcnt_u64 ...
@khosravipasha, I guess these devices needs a different quantization format :-D ?

@wildcattrio
Copy link
Copy Markdown
Author

KL Divergence Results — x86 AVX2 (PR #7)

Ran the KL divergence tests from PR #8 on the AVX2 kernel fix from this PR. Hardware: Intel i5-1135G7 (Tiger Lake), 32GB RAM, Windows 11. Build: 8194 (1179bfc82) with Clang 22.1.2.

System info: CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1

Test setup

  • F16 reference: dequantized from Bonsai-1.7B.gguf (Q1_0_g128) using llama-quantize --allow-requantize
  • Dataset: wikitext-2-raw, -c 512 --chunks 100
  • F16 perplexity: PPL = 24.04, 41.5 tok/s prompt processing
  • Q1_0_g128 perplexity: PPL = 24.09, 27.0 tok/s prompt processing

x86 AVX2 Divergences

Metric Q1_0_g128 (1.13 BPW)
Same top p 99.220 ± 0.055 %
Mean KLD 0.000224 ± 0.000002
Maximum KLD 0.013303
99.9% KLD 0.002923
99.0% KLD 0.001273
Median KLD 0.000150
1.0% KLD -0.000008
Minimum KLD -0.000083
Mean Δp -0.005 ± 0.002 %
Maximum Δp 4.779 %
99.9% Δp 2.198 %
99.0% Δp 1.122 %
95.0% Δp 0.511 %
Median Δp -0.000 %
5.0% Δp -0.534 %
1.0% Δp -1.120 %
Minimum Δp -3.440 %
RMS Δp 0.352 ± 0.005 %

Comparison with PR #8 reference (ARM NEON / generic scalar)

Metric ARM NEON x86 AVX2 (this PR)
Same top p 99.965 % 99.220 %
Mean KLD 0.000000 0.000224
Maximum KLD 0.000065 0.013303
RMS Δp 0.006 % 0.352 %

The AVX2 kernel shows measurably higher divergence compared to the NEON/scalar reference. The likely cause is floating-point operation ordering: our AVX2 path pre-multiplies d0 * d1 as a combined scale per sub-block, while the scalar reference accumulates d1 * sumi_block per sub-block then multiplies by d0 at the end. This difference in FP associativity accumulates across the 28 layers.

Note: @zcattacz's XOR+SUB approach posted above uses the two-level accumulation pattern (acc_block += dk * dot, then acc += d0 * acc_block) which more closely matches the scalar reference FP ordering — it may produce better KLD numbers. Worth testing.

Output quality is still good despite the divergence — text generation is coherent and the PPL difference is only 0.057 (24.09 vs 24.04).

@zcattacz
Copy link
Copy Markdown

zcattacz commented Apr 5, 2026

Hi @wildcattrio , I updated the implementation and here is the combined result. You were right xor+sub gives slightly better KLD with good tps. I also tried other impl for tps, the best were on par, but this is the simplest.

Metric Generic Scalar Fallback PR7 on TigerLake PR7 on Broadwell xor+sub on Broadwell
Same top p 99.965 ± 0.012 % 99.220 ± 0.055 % 99.216 ± 0.055 % 99.220 ± 0.055 %
Mean KLD 0.000000 ± 0.000000 0.000224 ± 0.000002 0.000223 ± 0.000002 0.000222 ± 0.000002
Maximum KLD 0.000065 0.013303 0.012099 0.010006
99.9% KLD 0.000051 0.002923 0.003226 0.003014
99.0% KLD 0.000036 0.001273 0.001277 0.001232
Median KLD 0.000000 0.000150 0.000151 0.000148
1.0% KLD -0.000036 -0.000008 -0.000008 -0.000008
Minimum KLD -0.000061 -0.000083 -0.000124 -0.000114
Mean Δp -0.000 ± 0.000 % -0.005 ± 0.002 % -0.004 ± 0.002 % -0.001 ± 0.002 %
Maximum Δp 0.120 % 4.779 % 6.801 % 5.948 %
99.9% Δp 0.039 % 2.198 % 2.232 % 2.194 %
99.0% Δp 0.017 % 1.122 % 1.168 % 1.109 %
95.0% Δp 0.007 % 0.511 % 0.510 % 0.509 %
Median Δp 0.000 % -0.000 % -0.000 % -0.000 %
5.0% Δp -0.008 % -0.534 % -0.534 % -0.517 %
1.0% Δp -0.017 % -1.120 % -1.132 % -1.096 %
0.1% Δp -0.039 % -2.314 % -2.139 %
Minimum Δp -0.102 % -3.440 % -3.624 % -3.849 %
RMS Δp 0.006 ± 0.000 % 0.352 ± 0.005 % 0.360 ± 0.006 % 0.349 ± 0.005 %
Mean PPL(Q) 24.089197 ± 0.527720 24.089508 ± 0.527763
Mean PPL(Base) 24.036712 ± 0.525238 24.036712 ± 0.525238
Mean PPL(Q) - PPL(Base) 0.052485 ± 0.012109 0.052796 ± 0.012114
prompt eval TPS 10.25 17.68

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants