feat: port TQ3_0 KV cache from llama-turboquant by carlosfundora · Pull Request #2 · PrismML-Eng/llama.cpp

carlosfundora · 2026-04-01T18:50:56Z

TurboQuant 3-bit (3.5 bpw) KV cache compression combined with PrismML's Q1_0 GPU inference. Ported the TQ3_0 implementation from llama-turboquant, tested on ROCm gfx1030.

TurboQuant 3-bit (3.5 bpw) KV cache compression: - Per-block WHT rotation with 4-centroid MSE codebook - QJL residual signs for error correction - GPU kernels: vec_dot, MMVQ, convert, set-rows, cpy - CPU: quantize/dequantize with WHT butterfly transform - Flash attention auto-disabled for TQ3_0 K cache Combined with PrismML's Q1_0 GPU inference, this enables 1-bit weights + 3-bit KV cache on a single build.

khosravipasha · 2026-04-02T22:26:28Z

Thanks this is pretty cool. How does it work? It is good?

Our main focus right now is getting our changes in llama.cpp so might not have time to look into details yet, but love to see if speed/quality output if we have tried it.
What the vram usage with long context after this change?

carlosfundora · 2026-04-03T06:35:21Z

It works great. I have SGLang nearly wired up for 1-bit support and TurboQuant as well.

carlosfundora · 2026-04-03T06:37:44Z

VRAM usage was reduced by roughly 35%.

khosravipasha · 2026-04-03T10:46:56Z

Oh how does it work with SGlang for 1-bit, was it easy to add support there?

carlosfundora · 2026-04-04T03:37:35Z

So so, patience and preparation was key. I also crafted a few agents to run methodical research and debugging on a very detailed and slow scale during smoke tests. It ran well for chat in SGLang, I bechmarked it and moved on to implement P-EAGLE though, so I've been training heads for the models all day. I haven't yet tried it on any coding tasks but I'm excited to see how they do.

If you guys would be kind enough to release a 1/2 B or .3-.6 range 1-bit quant that would be amazing and make it much easier for me to rapidly work on creating PRs for advanced speculative decoding architectures. 🧠🤌

khosravipasha · 2026-04-04T18:45:32Z

@carlosfundora Sounds exciting, yeah good ideas, lets chat more on the discord-server next week (I think you were there right?)

rosmur · 2026-04-05T06:49:23Z

excellent! Will this work on Apple Silicon? If yes, I can report back with memory footprint improvement

carlosfundora · 2026-04-06T04:23:56Z

@carlosfundora Sounds exciting, yeah good ideas, lets chat more on the discord-server next week (I think you were there right?)

Yes, I commented on there today. looking forward to it.

Copilot

Pull request overview

This PR introduces a new GGML quantization type (GGML_TYPE_TQ3_0) intended for TurboQuant-style KV-cache compression, wiring it through GGML core traits, CUDA/HIP paths, and CLI/tooling so it can be selected as a KV cache type (tested on ROCm gfx1030 per description).

Changes:

Add GGML_TYPE_TQ3_0 type definition/traits and (de)quantization hooks in GGML core + CPU integration.
Add CUDA kernels/support for writing (SET_ROWS) and using (MMVQ vecdot) TQ3_0 KV cache blocks.
Expose tq3_0 via CLI and bench tooling; disable flash-attention when type_k == TQ3_0.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
tools/llama-bench/llama-bench.cpp	Adds `tq3_0` string → type mapping for benchmarking.
src/llama-context.cpp	Forces flash-attention off when using TQ3_0 K-cache.
ggml/src/ggml.c	Registers TQ3_0 type traits and enables chunk quantization dispatch.
ggml/src/ggml-quants.h	Adds TQ3_0 quantize/dequantize API declarations.
ggml/src/ggml-quants.c	Implements TQ3_0 reference quantize + dequantize + quantize wrapper and row validation.
ggml/src/ggml-cuda/vecdotq.cuh	Adds fused TQ3_0×Q8_1 vecdot for MMVQ.
ggml/src/ggml-cuda/set-rows.cu	Enables `SET_ROWS` into TQ3_0 buffers (KV updates).
ggml/src/ggml-cuda/mmvq.cu	Wires TQ3_0 into MMVQ type switches.
ggml/src/ggml-cuda/ggml-cuda.cu	Marks additional ops/types as CUDA-supported (incl. TQ3_0).
ggml/src/ggml-cuda/cpy-utils.cuh	Adds device quantization helper for TQ3_0 blocks.
ggml/src/ggml-cuda/convert.cu	Adds CUDA dequantization kernel for TQ3_0 → fp16/fp32.
ggml/src/ggml-cuda/common.cuh	Adds CUDA type-traits for TQ3_0 (qk/qr/qi).
ggml/src/ggml-cpu/quants.h	Declares CPU quantize entrypoint for TQ3_0.
ggml/src/ggml-cpu/quants.c	Implements CPU quantize wrapper calling reference quantizer.
ggml/src/ggml-cpu/ops.cpp	Allows TQ3_0 through quantized op switch cases.
ggml/src/ggml-cpu/ggml-cpu.cpp	Tightens CPU op support checks for `MUL_MAT` and `FLASH_ATTN_EXT`.
ggml/src/ggml-cpu/ggml-cpu.c	Registers TQ3_0 CPU type-traits (from_float).
ggml/src/ggml-common.h	Defines `block_tq3_0` layout and constants.
ggml/include/ggml.h	Adds `GGML_TYPE_TQ3_0` to the public enum/API.
common/arg.cpp	Exposes `tq3_0` as an allowed KV cache type via CLI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-06T08:13:06Z

ggml/src/ggml-cuda/vecdotq.cuh

+    // Scale: d_tq3 * d_q8 / 32  (two 1/sqrt(32) normalizations combined)
+    const float d_q8 = __low2float(bq8_1[0].ds);
+    return sumf * d * d_q8 * 0.03125f;  // 0.03125 = 1/32
+}


The CUDA vecdot applies an extra 1/32 factor even though K blocks are already stored in normalized WHT space (quantize path divides by sqrt(32)). Since the query-side WHT here does not apply the matching 1/sqrt(32) normalization, this scales the dot-product result down by ~5.66×. Align the normalization between CPU/GPU quantize (tq3_wht32_forward*) and this kernel (either apply 1/sqrt(32) to the query transform, or store K unnormalized and keep 1/32).

Copilot · 2026-04-06T08:13:06Z

ggml/src/ggml-quants.c

+// QJL correction constant: sqrt(pi/2) / block_size
+static const float TQ3_QJL_SCALE = 0.03921875f;  // sqrt(pi/2) / 32 ≈ 1.2533 / 32
+


TQ3_QJL_SCALE is defined but never used, and the encoded QJL residual-sign bits (qr) are not incorporated in either dequantization or dot-product paths. As written, this is effectively a 2-bit codebook + scale scheme with an unused extra 1-bit payload per value. Either implement the QJL correction term using qr/gamma, or remove the unused fields/constant and update the format/docs accordingly.

Copilot · 2026-04-06T08:13:06Z

ggml/src/ggml-quants.c

+void dequantize_row_tq3_0(const block_tq3_0 * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k) {
+    assert(k % QK_TQ3_0 == 0);
+    const int64_t nb = k / QK_TQ3_0;
+
+    for (int64_t i = 0; i < nb; ++i) {
+        const float d = GGML_FP16_TO_FP32(x[i].gamma);
+
+        // Dequantize to rotated space
+        float rotated[QK_TQ3_0];
+        for (int j = 0; j < QK_TQ3_0; j++) {
+            const int idx = (x[i].qs[j / 4] >> (2 * (j % 4))) & 3;
+            rotated[j] = d * tq3_centroids[idx];
+        }
+
+        // Apply inverse WHT to get back to original space
+        tq3_wht32_inverse(rotated);
+


dequantize_row_tq3_0 reconstructs only the codebook centroids and ignores the stored residual-sign bits (qr) and any QJL correction described in the header comment. This makes quantize→dequantize inconsistent with the stated TQ3_0 scheme and wastes 1 bit/value in storage. Implement the QJL residual contribution (or drop qr and adjust the block layout).

Copilot · 2026-04-06T08:13:07Z

ggml/src/ggml-common.h

+// Per TurboQuant paper (Algorithm 2: TurboQuant_prod), ICLR 2026
+// Each block of 32 values is quantized as:
+//   - 2-bit MSE codebook indices (after random rotation Π·x)
+//   - 1-bit QJL residual signs (sign(S·r) where r = x - dequant_mse(quant_mse(x)))
+//   - FP16 residual norm ||r||₂ for QJL scaling
+// Requires per-model rotation matrices Π and S (stored externally)
+#define QK_TQ3_0 32
+typedef struct {
+    uint8_t   qs[QK_TQ3_0 / 4]; // 2-bit codebook indices, 32 × 2 bits = 8 bytes
+    uint8_t   qr[QK_TQ3_0 / 8]; // QJL residual signs, 32 × 1 bit = 4 bytes
+    ggml_half gamma;             // ||residual||₂ for QJL correction scaling


The block_tq3_0 documentation says gamma stores the residual norm and that external rotation matrices are required, but the implementation uses a fixed WHT+sign preconditioner and stores the per-block scale d in gamma (see quantize_row_tq3_0_ref). Update the comment/field description (or update the implementation) so the on-wire/layout semantics match the code.

Suggested change

// Per TurboQuant paper (Algorithm 2: TurboQuant_prod), ICLR 2026

// Each block of 32 values is quantized as:

// - 2-bit MSE codebook indices (after random rotation Π·x)

// - 1-bit QJL residual signs (sign(S·r) where r = x - dequant_mse(quant_mse(x)))

// - FP16 residual norm ||r||₂ for QJL scaling

// Requires per-model rotation matrices Π and S (stored externally)

#define QK_TQ3_0 32

typedef struct {

uint8_t qs[QK_TQ3_0 / 4]; // 2-bit codebook indices, 32 × 2 bits = 8 bytes

uint8_t qr[QK_TQ3_0 / 8]; // QJL residual signs, 32 × 1 bit = 4 bytes

ggml_half gamma; // ||residual||₂ for QJL correction scaling

// Implementation note: the on-wire format used here follows the current

// ggml implementation, which applies a fixed WHT+sign preconditioner rather

// than storing or requiring external rotation matrices.

// Each block of 32 values stores:

// - 2-bit quantized indices

// - 1-bit residual/sign bits

// - FP16 per-block scale d

#define QK_TQ3_0 32

typedef struct {

uint8_t qs[QK_TQ3_0 / 4]; // 2-bit codebook indices, 32 × 2 bits = 8 bytes

uint8_t qr[QK_TQ3_0 / 8]; // residual/sign bits, 32 × 1 bit = 4 bytes

ggml_half gamma; // per-block FP16 scale d

Copilot · 2026-04-06T08:13:07Z

ggml/include/ggml.h

        GGML_TYPE_Q1_0    = 40,
        GGML_TYPE_Q1_0_g128 = 41,
-        GGML_TYPE_COUNT   = 42,
+        GGML_TYPE_TQ3_0   = 42, // TurboQuant 3-bit polar + QJL (no per-block scale)


The enum comment says TQ3_0 has “no per-block scale”, but the block layout includes gamma and both CPU/GPU paths use it as a per-block scale. Please correct the type comment to avoid misleading API/CLI consumers.

Suggested change

GGML_TYPE_TQ3_0 = 42, // TurboQuant 3-bit polar + QJL (no per-block scale)

GGML_TYPE_TQ3_0 = 42, // TurboQuant 3-bit polar + QJL (with per-block scale)

Copilot · 2026-04-06T08:13:07Z

ggml/src/ggml-cpu/ggml-cpu.c

        .nrows                    = 1,
    },
+    [GGML_TYPE_TQ3_0] = {
+        .from_float               = quantize_row_tq3_0,


GGML_TYPE_TQ3_0 is added with from_float/to_float support but no CPU vec_dot/vec_dot_type. This will crash tests like tests/test-quantize-fns.cpp (it unconditionally calls qfns_cpu->vec_dot when from_float/to_float exist) and also makes CPU attention paths unable to use TQ3_0 KV. Add a CPU vec_dot implementation (and vec_dot_type) for TQ3_0, or ensure CPU code/tests skip dot-product checks for types without vec_dot and prevent selecting TQ3_0 on CPU-only backends.

Suggested change

.from_float = quantize_row_tq3_0,

Copilot · 2026-04-06T08:13:07Z

common/arg.cpp

+    GGML_TYPE_TQ3_0,
 };

 static ggml_type kv_cache_type_from_str(const std::string & s) {


TQ3_0 is advertised as an allowed KV cache type in the CLI, but the CPU backend has no vec_dot support for GGML_TYPE_TQ3_0 (see ggml/src/ggml-cpu/ggml-cpu.c), so CPU-only runs selecting this type will fail during graph planning/execution. Either add CPU support, or gate this option based on backend capabilities / emit a clear error earlier.

Suggested change

GGML_TYPE_TQ3_0,

};

static ggml_type kv_cache_type_from_str(const std::string & s) {

};

static ggml_type kv_cache_type_from_str(const std::string & s) {

if (s == "TQ3_0") {

throw std::runtime_error(

"Unsupported cache type: " + s +

" (not supported by the CPU backend for KV cache use)"

);

}

Copilot · 2026-04-06T08:13:08Z

ggml/src/ggml-cuda/convert.cu

+// TurboQuant TQ3_0: 2-bit codebook dequantization + inverse WHT
+// Dequantize to rotated space, then apply inverse WHT32 cooperatively
+template<typename dst_t>
+static __global__ void dequantize_block_tq3_0(const void * __restrict__ vx, dst_t * __restrict__ yy) {
+    const float centroids[4] = { -1.510f, -0.4528f, 0.4528f, 1.510f };
+    const int8_t signs[32] = {
+        +1, -1, +1, +1, -1, -1, +1, -1, +1, +1, -1, +1, -1, +1, -1, -1,
+        +1, -1, -1, +1, +1, -1, +1, -1, -1, +1, +1, +1, -1, -1, +1, -1
+    };
+
+    const int64_t i = blockIdx.x;
+    const block_tq3_0 * x = (const block_tq3_0 *)vx;
+    const int tid = threadIdx.x;
+    if (tid >= 32) return;
+
+    const float d = __half2float(x[i].gamma);
+
+    // Step 1: Each thread dequantizes its value (in rotated space)
+    const int byte_idx = tid / 4;
+    const int bit_shift = 2 * (tid % 4);
+    const int idx = (x[i].qs[byte_idx] >> bit_shift) & 3;
+
+    __shared__ float shmem[32];
+    shmem[tid] = d * centroids[idx];
+    __syncthreads();
+
+    // Step 2: Cooperative inverse WHT (5 butterfly stages)
+    for (int step = 1; step < 32; step <<= 1) {
+        int partner = tid ^ step;  // butterfly partner
+        float a = shmem[tid];
+        float b = shmem[partner];
+        __syncthreads();
+        if (tid < partner) {
+            shmem[tid]     = a + b;
+            shmem[partner] = a - b;
+        }
+        __syncthreads();
+    }
+
+    // Step 3: Normalize and undo sign flips
+    const float inv_sqrt32 = 0.17677669529663688f;
+    yy[i * QK_TQ3_0 + tid] = shmem[tid] * inv_sqrt32 * signs[tid];
+}


The GPU dequantization kernel reconstructs only d*centroid[idx] and ignores the stored QJL residual-sign bits (qr) and any QJL correction scaling. If qr/gamma are part of the format, dequantization should incorporate them; otherwise consider removing qr from the block layout to avoid wasting bandwidth/storage.

Copilot · 2026-04-06T08:13:08Z

ggml/src/ggml-cuda/cpy-utils.cuh

+    for (int j = 0; j < QK_TQ3_0; j++) rotated[j] = x[j];
+    tq3_wht32_forward_device(rotated);
+
+    memset(y, 0, sizeof(block_tq3_0));


quantize_f32_tq3_0_block uses memset() inside device code. This is the only device-side memset in this folder and can be problematic for some toolchains (e.g., HIP/device-lib availability) and may add overhead. Consider explicitly zeroing qs/qr/gamma with simple loops/assignments instead.

Suggested change

memset(y, 0, sizeof(block_tq3_0));

y->gamma = __float2half(0.0f);

for (int j = 0; j < (int)(sizeof(y->qs) / sizeof(y->qs[0])); ++j) {

y->qs[j] = 0;

}

for (int j = 0; j < (int)(sizeof(y->qr) / sizeof(y->qr[0])); ++j) {

y->qr[j] = 0;

}

Copilot · 2026-04-06T08:13:08Z

ggml/src/ggml-cuda/vecdotq.cuh

+    const float centroids[4] = { -1.510f, -0.4528f, 0.4528f, 1.510f };
+    const int8_t signs[32] = {
+        +1, -1, +1, +1, -1, -1, +1, -1, +1, +1, -1, +1, -1, +1, -1, -1,
+        +1, -1, -1, +1, +1, -1, +1, -1, -1, +1, +1, +1, -1, -1, +1, -1
+    };


The centroids and sign pattern are duplicated as local arrays across CPU quantize and multiple CUDA/HIP kernels. This duplication risks subtle mismatches and can increase register/local-memory pressure. Consider defining them once (e.g., in a shared header / constant memory) and reusing across quantize/dequant/vecdot paths.

github-actions bot added Nvidia GPU ggml examples labels Apr 1, 2026

khosravipasha requested a review from Copilot April 6, 2026 08:03

Copilot started reviewing on behalf of khosravipasha April 6, 2026 08:04 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

		// QJL correction constant: sqrt(pi/2) / block_size
		static const float TQ3_QJL_SCALE = 0.03921875f; // sqrt(pi/2) / 32 ≈ 1.2533 / 32

	GGML_TYPE_TQ3_0 = 42, // TurboQuant 3-bit polar + QJL (no per-block scale)
	GGML_TYPE_TQ3_0 = 42, // TurboQuant 3-bit polar + QJL (with per-block scale)

-    GGML_TYPE_TQ3_0,
-};
-static ggml_type kv_cache_type_from_str(const std::string & s) {
+};
+static ggml_type kv_cache_type_from_str(const std::string & s) {
+    if (s == "TQ3_0") {
+        throw std::runtime_error(
+            "Unsupported cache type: " + s +
+            " (not supported by the CPU backend for KV cache use)"
+        );
+    }

-    memset(y, 0, sizeof(block_tq3_0));
+    y->gamma = __float2half(0.0f);
+    for (int j = 0; j < (int)(sizeof(y->qs) / sizeof(y->qs[0])); ++j) {
+        y->qs[j] = 0;
+    }
+    for (int j = 0; j < (int)(sizeof(y->qr) / sizeof(y->qr[0])); ++j) {
+        y->qr[j] = 0;
+    }

Conversation

carlosfundora commented Apr 1, 2026

Uh oh!

khosravipasha commented Apr 2, 2026

Uh oh!

carlosfundora commented Apr 3, 2026

Uh oh!

carlosfundora commented Apr 3, 2026

Uh oh!

khosravipasha commented Apr 3, 2026

Uh oh!

carlosfundora commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khosravipasha commented Apr 4, 2026

Uh oh!

rosmur commented Apr 5, 2026

Uh oh!

carlosfundora commented Apr 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

carlosfundora commented Apr 4, 2026 •

edited

Loading