feat: port TQ3_0 KV cache from llama-turboquant#2
feat: port TQ3_0 KV cache from llama-turboquant#2carlosfundora wants to merge 1 commit intoPrismML-Eng:prismfrom
Conversation
TurboQuant 3-bit (3.5 bpw) KV cache compression: - Per-block WHT rotation with 4-centroid MSE codebook - QJL residual signs for error correction - GPU kernels: vec_dot, MMVQ, convert, set-rows, cpy - CPU: quantize/dequantize with WHT butterfly transform - Flash attention auto-disabled for TQ3_0 K cache Combined with PrismML's Q1_0 GPU inference, this enables 1-bit weights + 3-bit KV cache on a single build.
|
Thanks this is pretty cool. How does it work? It is good? Our main focus right now is getting our changes in llama.cpp so might not have time to look into details yet, but love to see if speed/quality output if we have tried it. |
|
VRAM usage was reduced by roughly 35%. |
|
Oh how does it work with SGlang for 1-bit, was it easy to add support there? |
|
So so, patience and preparation was key. I also crafted a few agents to run methodical research and debugging on a very detailed and slow scale during smoke tests. It ran well for chat in SGLang, I bechmarked it and moved on to implement P-EAGLE though, so I've been training heads for the models all day. I haven't yet tried it on any coding tasks but I'm excited to see how they do. If you guys would be kind enough to release a 1/2 B or .3-.6 range 1-bit quant that would be amazing and make it much easier for me to rapidly work on creating PRs for advanced speculative decoding architectures. 🧠🤌 |
|
@carlosfundora Sounds exciting, yeah good ideas, lets chat more on the discord-server next week (I think you were there right?) |
|
excellent! Will this work on Apple Silicon? If yes, I can report back with memory footprint improvement |
Yes, I commented on there today. looking forward to it. |
There was a problem hiding this comment.
Pull request overview
This PR introduces a new GGML quantization type (GGML_TYPE_TQ3_0) intended for TurboQuant-style KV-cache compression, wiring it through GGML core traits, CUDA/HIP paths, and CLI/tooling so it can be selected as a KV cache type (tested on ROCm gfx1030 per description).
Changes:
- Add
GGML_TYPE_TQ3_0type definition/traits and (de)quantization hooks in GGML core + CPU integration. - Add CUDA kernels/support for writing (
SET_ROWS) and using (MMVQvecdot) TQ3_0 KV cache blocks. - Expose
tq3_0via CLI and bench tooling; disable flash-attention whentype_k == TQ3_0.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/llama-bench/llama-bench.cpp | Adds tq3_0 string → type mapping for benchmarking. |
| src/llama-context.cpp | Forces flash-attention off when using TQ3_0 K-cache. |
| ggml/src/ggml.c | Registers TQ3_0 type traits and enables chunk quantization dispatch. |
| ggml/src/ggml-quants.h | Adds TQ3_0 quantize/dequantize API declarations. |
| ggml/src/ggml-quants.c | Implements TQ3_0 reference quantize + dequantize + quantize wrapper and row validation. |
| ggml/src/ggml-cuda/vecdotq.cuh | Adds fused TQ3_0×Q8_1 vecdot for MMVQ. |
| ggml/src/ggml-cuda/set-rows.cu | Enables SET_ROWS into TQ3_0 buffers (KV updates). |
| ggml/src/ggml-cuda/mmvq.cu | Wires TQ3_0 into MMVQ type switches. |
| ggml/src/ggml-cuda/ggml-cuda.cu | Marks additional ops/types as CUDA-supported (incl. TQ3_0). |
| ggml/src/ggml-cuda/cpy-utils.cuh | Adds device quantization helper for TQ3_0 blocks. |
| ggml/src/ggml-cuda/convert.cu | Adds CUDA dequantization kernel for TQ3_0 → fp16/fp32. |
| ggml/src/ggml-cuda/common.cuh | Adds CUDA type-traits for TQ3_0 (qk/qr/qi). |
| ggml/src/ggml-cpu/quants.h | Declares CPU quantize entrypoint for TQ3_0. |
| ggml/src/ggml-cpu/quants.c | Implements CPU quantize wrapper calling reference quantizer. |
| ggml/src/ggml-cpu/ops.cpp | Allows TQ3_0 through quantized op switch cases. |
| ggml/src/ggml-cpu/ggml-cpu.cpp | Tightens CPU op support checks for MUL_MAT and FLASH_ATTN_EXT. |
| ggml/src/ggml-cpu/ggml-cpu.c | Registers TQ3_0 CPU type-traits (from_float). |
| ggml/src/ggml-common.h | Defines block_tq3_0 layout and constants. |
| ggml/include/ggml.h | Adds GGML_TYPE_TQ3_0 to the public enum/API. |
| common/arg.cpp | Exposes tq3_0 as an allowed KV cache type via CLI. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Scale: d_tq3 * d_q8 / 32 (two 1/sqrt(32) normalizations combined) | ||
| const float d_q8 = __low2float(bq8_1[0].ds); | ||
| return sumf * d * d_q8 * 0.03125f; // 0.03125 = 1/32 | ||
| } |
There was a problem hiding this comment.
The CUDA vecdot applies an extra 1/32 factor even though K blocks are already stored in normalized WHT space (quantize path divides by sqrt(32)). Since the query-side WHT here does not apply the matching 1/sqrt(32) normalization, this scales the dot-product result down by ~5.66×. Align the normalization between CPU/GPU quantize (tq3_wht32_forward*) and this kernel (either apply 1/sqrt(32) to the query transform, or store K unnormalized and keep 1/32).
| // QJL correction constant: sqrt(pi/2) / block_size | ||
| static const float TQ3_QJL_SCALE = 0.03921875f; // sqrt(pi/2) / 32 ≈ 1.2533 / 32 | ||
|
|
There was a problem hiding this comment.
TQ3_QJL_SCALE is defined but never used, and the encoded QJL residual-sign bits (qr) are not incorporated in either dequantization or dot-product paths. As written, this is effectively a 2-bit codebook + scale scheme with an unused extra 1-bit payload per value. Either implement the QJL correction term using qr/gamma, or remove the unused fields/constant and update the format/docs accordingly.
| void dequantize_row_tq3_0(const block_tq3_0 * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k) { | ||
| assert(k % QK_TQ3_0 == 0); | ||
| const int64_t nb = k / QK_TQ3_0; | ||
|
|
||
| for (int64_t i = 0; i < nb; ++i) { | ||
| const float d = GGML_FP16_TO_FP32(x[i].gamma); | ||
|
|
||
| // Dequantize to rotated space | ||
| float rotated[QK_TQ3_0]; | ||
| for (int j = 0; j < QK_TQ3_0; j++) { | ||
| const int idx = (x[i].qs[j / 4] >> (2 * (j % 4))) & 3; | ||
| rotated[j] = d * tq3_centroids[idx]; | ||
| } | ||
|
|
||
| // Apply inverse WHT to get back to original space | ||
| tq3_wht32_inverse(rotated); | ||
|
|
There was a problem hiding this comment.
dequantize_row_tq3_0 reconstructs only the codebook centroids and ignores the stored residual-sign bits (qr) and any QJL correction described in the header comment. This makes quantize→dequantize inconsistent with the stated TQ3_0 scheme and wastes 1 bit/value in storage. Implement the QJL residual contribution (or drop qr and adjust the block layout).
| // Per TurboQuant paper (Algorithm 2: TurboQuant_prod), ICLR 2026 | ||
| // Each block of 32 values is quantized as: | ||
| // - 2-bit MSE codebook indices (after random rotation Π·x) | ||
| // - 1-bit QJL residual signs (sign(S·r) where r = x - dequant_mse(quant_mse(x))) | ||
| // - FP16 residual norm ||r||₂ for QJL scaling | ||
| // Requires per-model rotation matrices Π and S (stored externally) | ||
| #define QK_TQ3_0 32 | ||
| typedef struct { | ||
| uint8_t qs[QK_TQ3_0 / 4]; // 2-bit codebook indices, 32 × 2 bits = 8 bytes | ||
| uint8_t qr[QK_TQ3_0 / 8]; // QJL residual signs, 32 × 1 bit = 4 bytes | ||
| ggml_half gamma; // ||residual||₂ for QJL correction scaling |
There was a problem hiding this comment.
The block_tq3_0 documentation says gamma stores the residual norm and that external rotation matrices are required, but the implementation uses a fixed WHT+sign preconditioner and stores the per-block scale d in gamma (see quantize_row_tq3_0_ref). Update the comment/field description (or update the implementation) so the on-wire/layout semantics match the code.
| // Per TurboQuant paper (Algorithm 2: TurboQuant_prod), ICLR 2026 | |
| // Each block of 32 values is quantized as: | |
| // - 2-bit MSE codebook indices (after random rotation Π·x) | |
| // - 1-bit QJL residual signs (sign(S·r) where r = x - dequant_mse(quant_mse(x))) | |
| // - FP16 residual norm ||r||₂ for QJL scaling | |
| // Requires per-model rotation matrices Π and S (stored externally) | |
| #define QK_TQ3_0 32 | |
| typedef struct { | |
| uint8_t qs[QK_TQ3_0 / 4]; // 2-bit codebook indices, 32 × 2 bits = 8 bytes | |
| uint8_t qr[QK_TQ3_0 / 8]; // QJL residual signs, 32 × 1 bit = 4 bytes | |
| ggml_half gamma; // ||residual||₂ for QJL correction scaling | |
| // Implementation note: the on-wire format used here follows the current | |
| // ggml implementation, which applies a fixed WHT+sign preconditioner rather | |
| // than storing or requiring external rotation matrices. | |
| // Each block of 32 values stores: | |
| // - 2-bit quantized indices | |
| // - 1-bit residual/sign bits | |
| // - FP16 per-block scale d | |
| #define QK_TQ3_0 32 | |
| typedef struct { | |
| uint8_t qs[QK_TQ3_0 / 4]; // 2-bit codebook indices, 32 × 2 bits = 8 bytes | |
| uint8_t qr[QK_TQ3_0 / 8]; // residual/sign bits, 32 × 1 bit = 4 bytes | |
| ggml_half gamma; // per-block FP16 scale d |
| GGML_TYPE_Q1_0 = 40, | ||
| GGML_TYPE_Q1_0_g128 = 41, | ||
| GGML_TYPE_COUNT = 42, | ||
| GGML_TYPE_TQ3_0 = 42, // TurboQuant 3-bit polar + QJL (no per-block scale) |
There was a problem hiding this comment.
The enum comment says TQ3_0 has “no per-block scale”, but the block layout includes gamma and both CPU/GPU paths use it as a per-block scale. Please correct the type comment to avoid misleading API/CLI consumers.
| GGML_TYPE_TQ3_0 = 42, // TurboQuant 3-bit polar + QJL (no per-block scale) | |
| GGML_TYPE_TQ3_0 = 42, // TurboQuant 3-bit polar + QJL (with per-block scale) |
| .nrows = 1, | ||
| }, | ||
| [GGML_TYPE_TQ3_0] = { | ||
| .from_float = quantize_row_tq3_0, |
There was a problem hiding this comment.
GGML_TYPE_TQ3_0 is added with from_float/to_float support but no CPU vec_dot/vec_dot_type. This will crash tests like tests/test-quantize-fns.cpp (it unconditionally calls qfns_cpu->vec_dot when from_float/to_float exist) and also makes CPU attention paths unable to use TQ3_0 KV. Add a CPU vec_dot implementation (and vec_dot_type) for TQ3_0, or ensure CPU code/tests skip dot-product checks for types without vec_dot and prevent selecting TQ3_0 on CPU-only backends.
| .from_float = quantize_row_tq3_0, |
| GGML_TYPE_TQ3_0, | ||
| }; | ||
|
|
||
| static ggml_type kv_cache_type_from_str(const std::string & s) { |
There was a problem hiding this comment.
TQ3_0 is advertised as an allowed KV cache type in the CLI, but the CPU backend has no vec_dot support for GGML_TYPE_TQ3_0 (see ggml/src/ggml-cpu/ggml-cpu.c), so CPU-only runs selecting this type will fail during graph planning/execution. Either add CPU support, or gate this option based on backend capabilities / emit a clear error earlier.
| GGML_TYPE_TQ3_0, | |
| }; | |
| static ggml_type kv_cache_type_from_str(const std::string & s) { | |
| }; | |
| static ggml_type kv_cache_type_from_str(const std::string & s) { | |
| if (s == "TQ3_0") { | |
| throw std::runtime_error( | |
| "Unsupported cache type: " + s + | |
| " (not supported by the CPU backend for KV cache use)" | |
| ); | |
| } |
| // TurboQuant TQ3_0: 2-bit codebook dequantization + inverse WHT | ||
| // Dequantize to rotated space, then apply inverse WHT32 cooperatively | ||
| template<typename dst_t> | ||
| static __global__ void dequantize_block_tq3_0(const void * __restrict__ vx, dst_t * __restrict__ yy) { | ||
| const float centroids[4] = { -1.510f, -0.4528f, 0.4528f, 1.510f }; | ||
| const int8_t signs[32] = { | ||
| +1, -1, +1, +1, -1, -1, +1, -1, +1, +1, -1, +1, -1, +1, -1, -1, | ||
| +1, -1, -1, +1, +1, -1, +1, -1, -1, +1, +1, +1, -1, -1, +1, -1 | ||
| }; | ||
|
|
||
| const int64_t i = blockIdx.x; | ||
| const block_tq3_0 * x = (const block_tq3_0 *)vx; | ||
| const int tid = threadIdx.x; | ||
| if (tid >= 32) return; | ||
|
|
||
| const float d = __half2float(x[i].gamma); | ||
|
|
||
| // Step 1: Each thread dequantizes its value (in rotated space) | ||
| const int byte_idx = tid / 4; | ||
| const int bit_shift = 2 * (tid % 4); | ||
| const int idx = (x[i].qs[byte_idx] >> bit_shift) & 3; | ||
|
|
||
| __shared__ float shmem[32]; | ||
| shmem[tid] = d * centroids[idx]; | ||
| __syncthreads(); | ||
|
|
||
| // Step 2: Cooperative inverse WHT (5 butterfly stages) | ||
| for (int step = 1; step < 32; step <<= 1) { | ||
| int partner = tid ^ step; // butterfly partner | ||
| float a = shmem[tid]; | ||
| float b = shmem[partner]; | ||
| __syncthreads(); | ||
| if (tid < partner) { | ||
| shmem[tid] = a + b; | ||
| shmem[partner] = a - b; | ||
| } | ||
| __syncthreads(); | ||
| } | ||
|
|
||
| // Step 3: Normalize and undo sign flips | ||
| const float inv_sqrt32 = 0.17677669529663688f; | ||
| yy[i * QK_TQ3_0 + tid] = shmem[tid] * inv_sqrt32 * signs[tid]; | ||
| } |
There was a problem hiding this comment.
The GPU dequantization kernel reconstructs only d*centroid[idx] and ignores the stored QJL residual-sign bits (qr) and any QJL correction scaling. If qr/gamma are part of the format, dequantization should incorporate them; otherwise consider removing qr from the block layout to avoid wasting bandwidth/storage.
| for (int j = 0; j < QK_TQ3_0; j++) rotated[j] = x[j]; | ||
| tq3_wht32_forward_device(rotated); | ||
|
|
||
| memset(y, 0, sizeof(block_tq3_0)); |
There was a problem hiding this comment.
quantize_f32_tq3_0_block uses memset() inside device code. This is the only device-side memset in this folder and can be problematic for some toolchains (e.g., HIP/device-lib availability) and may add overhead. Consider explicitly zeroing qs/qr/gamma with simple loops/assignments instead.
| memset(y, 0, sizeof(block_tq3_0)); | |
| y->gamma = __float2half(0.0f); | |
| for (int j = 0; j < (int)(sizeof(y->qs) / sizeof(y->qs[0])); ++j) { | |
| y->qs[j] = 0; | |
| } | |
| for (int j = 0; j < (int)(sizeof(y->qr) / sizeof(y->qr[0])); ++j) { | |
| y->qr[j] = 0; | |
| } |
| const float centroids[4] = { -1.510f, -0.4528f, 0.4528f, 1.510f }; | ||
| const int8_t signs[32] = { | ||
| +1, -1, +1, +1, -1, -1, +1, -1, +1, +1, -1, +1, -1, +1, -1, -1, | ||
| +1, -1, -1, +1, +1, -1, +1, -1, -1, +1, +1, +1, -1, -1, +1, -1 | ||
| }; |
There was a problem hiding this comment.
The centroids and sign pattern are duplicated as local arrays across CPU quantize and multiple CUDA/HIP kernels. This duplication risks subtle mismatches and can increase register/local-memory pressure. Consider defining them once (e.g., in a shared header / constant memory) and reusing across quantize/dequant/vecdot paths.

TurboQuant 3-bit (3.5 bpw) KV cache compression combined with PrismML's Q1_0 GPU inference. Ported the TQ3_0 implementation from llama-turboquant, tested on ROCm gfx1030.