Skip to content

feat: port TQ3_0 KV cache from llama-turboquant#2

Open
carlosfundora wants to merge 1 commit intoPrismML-Eng:prismfrom
carlosfundora:feature/tq3_0-kv-cache
Open

feat: port TQ3_0 KV cache from llama-turboquant#2
carlosfundora wants to merge 1 commit intoPrismML-Eng:prismfrom
carlosfundora:feature/tq3_0-kv-cache

Conversation

@carlosfundora
Copy link
Copy Markdown

TurboQuant 3-bit (3.5 bpw) KV cache compression combined with PrismML's Q1_0 GPU inference. Ported the TQ3_0 implementation from llama-turboquant, tested on ROCm gfx1030.

TurboQuant 3-bit (3.5 bpw) KV cache compression:
- Per-block WHT rotation with 4-centroid MSE codebook
- QJL residual signs for error correction
- GPU kernels: vec_dot, MMVQ, convert, set-rows, cpy
- CPU: quantize/dequantize with WHT butterfly transform
- Flash attention auto-disabled for TQ3_0 K cache

Combined with PrismML's Q1_0 GPU inference, this enables
1-bit weights + 3-bit KV cache on a single build.
@khosravipasha
Copy link
Copy Markdown
Collaborator

Thanks this is pretty cool. How does it work? It is good?

Our main focus right now is getting our changes in llama.cpp so might not have time to look into details yet, but love to see if speed/quality output if we have tried it.
What the vram usage with long context after this change?

@carlosfundora
Copy link
Copy Markdown
Author

b2a77edf-03cb-4a58-b569-9a148a6ee24b.jpg

It works great. I have SGLang nearly wired up for 1-bit support and TurboQuant as well.

@carlosfundora
Copy link
Copy Markdown
Author

VRAM usage was reduced by roughly 35%.

@khosravipasha
Copy link
Copy Markdown
Collaborator

Oh how does it work with SGlang for 1-bit, was it easy to add support there?

@carlosfundora
Copy link
Copy Markdown
Author

carlosfundora commented Apr 4, 2026

So so, patience and preparation was key. I also crafted a few agents to run methodical research and debugging on a very detailed and slow scale during smoke tests. It ran well for chat in SGLang, I bechmarked it and moved on to implement P-EAGLE though, so I've been training heads for the models all day. I haven't yet tried it on any coding tasks but I'm excited to see how they do.

If you guys would be kind enough to release a 1/2 B or .3-.6 range 1-bit quant that would be amazing and make it much easier for me to rapidly work on creating PRs for advanced speculative decoding architectures. 🧠🤌

@khosravipasha
Copy link
Copy Markdown
Collaborator

@carlosfundora Sounds exciting, yeah good ideas, lets chat more on the discord-server next week (I think you were there right?)

@rosmur
Copy link
Copy Markdown

rosmur commented Apr 5, 2026

excellent! Will this work on Apple Silicon? If yes, I can report back with memory footprint improvement

@carlosfundora
Copy link
Copy Markdown
Author

@carlosfundora Sounds exciting, yeah good ideas, lets chat more on the discord-server next week (I think you were there right?)

Yes, I commented on there today. looking forward to it.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new GGML quantization type (GGML_TYPE_TQ3_0) intended for TurboQuant-style KV-cache compression, wiring it through GGML core traits, CUDA/HIP paths, and CLI/tooling so it can be selected as a KV cache type (tested on ROCm gfx1030 per description).

Changes:

  • Add GGML_TYPE_TQ3_0 type definition/traits and (de)quantization hooks in GGML core + CPU integration.
  • Add CUDA kernels/support for writing (SET_ROWS) and using (MMVQ vecdot) TQ3_0 KV cache blocks.
  • Expose tq3_0 via CLI and bench tooling; disable flash-attention when type_k == TQ3_0.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
tools/llama-bench/llama-bench.cpp Adds tq3_0 string → type mapping for benchmarking.
src/llama-context.cpp Forces flash-attention off when using TQ3_0 K-cache.
ggml/src/ggml.c Registers TQ3_0 type traits and enables chunk quantization dispatch.
ggml/src/ggml-quants.h Adds TQ3_0 quantize/dequantize API declarations.
ggml/src/ggml-quants.c Implements TQ3_0 reference quantize + dequantize + quantize wrapper and row validation.
ggml/src/ggml-cuda/vecdotq.cuh Adds fused TQ3_0×Q8_1 vecdot for MMVQ.
ggml/src/ggml-cuda/set-rows.cu Enables SET_ROWS into TQ3_0 buffers (KV updates).
ggml/src/ggml-cuda/mmvq.cu Wires TQ3_0 into MMVQ type switches.
ggml/src/ggml-cuda/ggml-cuda.cu Marks additional ops/types as CUDA-supported (incl. TQ3_0).
ggml/src/ggml-cuda/cpy-utils.cuh Adds device quantization helper for TQ3_0 blocks.
ggml/src/ggml-cuda/convert.cu Adds CUDA dequantization kernel for TQ3_0 → fp16/fp32.
ggml/src/ggml-cuda/common.cuh Adds CUDA type-traits for TQ3_0 (qk/qr/qi).
ggml/src/ggml-cpu/quants.h Declares CPU quantize entrypoint for TQ3_0.
ggml/src/ggml-cpu/quants.c Implements CPU quantize wrapper calling reference quantizer.
ggml/src/ggml-cpu/ops.cpp Allows TQ3_0 through quantized op switch cases.
ggml/src/ggml-cpu/ggml-cpu.cpp Tightens CPU op support checks for MUL_MAT and FLASH_ATTN_EXT.
ggml/src/ggml-cpu/ggml-cpu.c Registers TQ3_0 CPU type-traits (from_float).
ggml/src/ggml-common.h Defines block_tq3_0 layout and constants.
ggml/include/ggml.h Adds GGML_TYPE_TQ3_0 to the public enum/API.
common/arg.cpp Exposes tq3_0 as an allowed KV cache type via CLI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1407 to +1410
// Scale: d_tq3 * d_q8 / 32 (two 1/sqrt(32) normalizations combined)
const float d_q8 = __low2float(bq8_1[0].ds);
return sumf * d * d_q8 * 0.03125f; // 0.03125 = 1/32
}
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CUDA vecdot applies an extra 1/32 factor even though K blocks are already stored in normalized WHT space (quantize path divides by sqrt(32)). Since the query-side WHT here does not apply the matching 1/sqrt(32) normalization, this scales the dot-product result down by ~5.66×. Align the normalization between CPU/GPU quantize (tq3_wht32_forward*) and this kernel (either apply 1/sqrt(32) to the query transform, or store K unnormalized and keep 1/32).

Copilot uses AI. Check for mistakes.
Comment on lines +2432 to +2434
// QJL correction constant: sqrt(pi/2) / block_size
static const float TQ3_QJL_SCALE = 0.03921875f; // sqrt(pi/2) / 32 ≈ 1.2533 / 32

Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TQ3_QJL_SCALE is defined but never used, and the encoded QJL residual-sign bits (qr) are not incorporated in either dequantization or dot-product paths. As written, this is effectively a 2-bit codebook + scale scheme with an unused extra 1-bit payload per value. Either implement the QJL correction term using qr/gamma, or remove the unused fields/constant and update the format/docs accordingly.

Copilot uses AI. Check for mistakes.
Comment on lines +2545 to +2561
void dequantize_row_tq3_0(const block_tq3_0 * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k) {
assert(k % QK_TQ3_0 == 0);
const int64_t nb = k / QK_TQ3_0;

for (int64_t i = 0; i < nb; ++i) {
const float d = GGML_FP16_TO_FP32(x[i].gamma);

// Dequantize to rotated space
float rotated[QK_TQ3_0];
for (int j = 0; j < QK_TQ3_0; j++) {
const int idx = (x[i].qs[j / 4] >> (2 * (j % 4))) & 3;
rotated[j] = d * tq3_centroids[idx];
}

// Apply inverse WHT to get back to original space
tq3_wht32_inverse(rotated);

Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dequantize_row_tq3_0 reconstructs only the codebook centroids and ignores the stored residual-sign bits (qr) and any QJL correction described in the header comment. This makes quantize→dequantize inconsistent with the stated TQ3_0 scheme and wastes 1 bit/value in storage. Implement the QJL residual contribution (or drop qr and adjust the block layout).

Copilot uses AI. Check for mistakes.
Comment on lines +280 to +290
// Per TurboQuant paper (Algorithm 2: TurboQuant_prod), ICLR 2026
// Each block of 32 values is quantized as:
// - 2-bit MSE codebook indices (after random rotation Π·x)
// - 1-bit QJL residual signs (sign(S·r) where r = x - dequant_mse(quant_mse(x)))
// - FP16 residual norm ||r||₂ for QJL scaling
// Requires per-model rotation matrices Π and S (stored externally)
#define QK_TQ3_0 32
typedef struct {
uint8_t qs[QK_TQ3_0 / 4]; // 2-bit codebook indices, 32 × 2 bits = 8 bytes
uint8_t qr[QK_TQ3_0 / 8]; // QJL residual signs, 32 × 1 bit = 4 bytes
ggml_half gamma; // ||residual||₂ for QJL correction scaling
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block_tq3_0 documentation says gamma stores the residual norm and that external rotation matrices are required, but the implementation uses a fixed WHT+sign preconditioner and stores the per-block scale d in gamma (see quantize_row_tq3_0_ref). Update the comment/field description (or update the implementation) so the on-wire/layout semantics match the code.

Suggested change
// Per TurboQuant paper (Algorithm 2: TurboQuant_prod), ICLR 2026
// Each block of 32 values is quantized as:
// - 2-bit MSE codebook indices (after random rotation Π·x)
// - 1-bit QJL residual signs (sign(S·r) where r = x - dequant_mse(quant_mse(x)))
// - FP16 residual norm ||r||₂ for QJL scaling
// Requires per-model rotation matrices Π and S (stored externally)
#define QK_TQ3_0 32
typedef struct {
uint8_t qs[QK_TQ3_0 / 4]; // 2-bit codebook indices, 32 × 2 bits = 8 bytes
uint8_t qr[QK_TQ3_0 / 8]; // QJL residual signs, 32 × 1 bit = 4 bytes
ggml_half gamma; // ||residual||₂ for QJL correction scaling
// Implementation note: the on-wire format used here follows the current
// ggml implementation, which applies a fixed WHT+sign preconditioner rather
// than storing or requiring external rotation matrices.
// Each block of 32 values stores:
// - 2-bit quantized indices
// - 1-bit residual/sign bits
// - FP16 per-block scale d
#define QK_TQ3_0 32
typedef struct {
uint8_t qs[QK_TQ3_0 / 4]; // 2-bit codebook indices, 32 × 2 bits = 8 bytes
uint8_t qr[QK_TQ3_0 / 8]; // residual/sign bits, 32 × 1 bit = 4 bytes
ggml_half gamma; // per-block FP16 scale d

Copilot uses AI. Check for mistakes.
GGML_TYPE_Q1_0 = 40,
GGML_TYPE_Q1_0_g128 = 41,
GGML_TYPE_COUNT = 42,
GGML_TYPE_TQ3_0 = 42, // TurboQuant 3-bit polar + QJL (no per-block scale)
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enum comment says TQ3_0 has “no per-block scale”, but the block layout includes gamma and both CPU/GPU paths use it as a per-block scale. Please correct the type comment to avoid misleading API/CLI consumers.

Suggested change
GGML_TYPE_TQ3_0 = 42, // TurboQuant 3-bit polar + QJL (no per-block scale)
GGML_TYPE_TQ3_0 = 42, // TurboQuant 3-bit polar + QJL (with per-block scale)

Copilot uses AI. Check for mistakes.
.nrows = 1,
},
[GGML_TYPE_TQ3_0] = {
.from_float = quantize_row_tq3_0,
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GGML_TYPE_TQ3_0 is added with from_float/to_float support but no CPU vec_dot/vec_dot_type. This will crash tests like tests/test-quantize-fns.cpp (it unconditionally calls qfns_cpu->vec_dot when from_float/to_float exist) and also makes CPU attention paths unable to use TQ3_0 KV. Add a CPU vec_dot implementation (and vec_dot_type) for TQ3_0, or ensure CPU code/tests skip dot-product checks for types without vec_dot and prevent selecting TQ3_0 on CPU-only backends.

Suggested change
.from_float = quantize_row_tq3_0,

Copilot uses AI. Check for mistakes.
Comment on lines +401 to 404
GGML_TYPE_TQ3_0,
};

static ggml_type kv_cache_type_from_str(const std::string & s) {
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TQ3_0 is advertised as an allowed KV cache type in the CLI, but the CPU backend has no vec_dot support for GGML_TYPE_TQ3_0 (see ggml/src/ggml-cpu/ggml-cpu.c), so CPU-only runs selecting this type will fail during graph planning/execution. Either add CPU support, or gate this option based on backend capabilities / emit a clear error earlier.

Suggested change
GGML_TYPE_TQ3_0,
};
static ggml_type kv_cache_type_from_str(const std::string & s) {
};
static ggml_type kv_cache_type_from_str(const std::string & s) {
if (s == "TQ3_0") {
throw std::runtime_error(
"Unsupported cache type: " + s +
" (not supported by the CPU backend for KV cache use)"
);
}

Copilot uses AI. Check for mistakes.
Comment on lines +489 to +531
// TurboQuant TQ3_0: 2-bit codebook dequantization + inverse WHT
// Dequantize to rotated space, then apply inverse WHT32 cooperatively
template<typename dst_t>
static __global__ void dequantize_block_tq3_0(const void * __restrict__ vx, dst_t * __restrict__ yy) {
const float centroids[4] = { -1.510f, -0.4528f, 0.4528f, 1.510f };
const int8_t signs[32] = {
+1, -1, +1, +1, -1, -1, +1, -1, +1, +1, -1, +1, -1, +1, -1, -1,
+1, -1, -1, +1, +1, -1, +1, -1, -1, +1, +1, +1, -1, -1, +1, -1
};

const int64_t i = blockIdx.x;
const block_tq3_0 * x = (const block_tq3_0 *)vx;
const int tid = threadIdx.x;
if (tid >= 32) return;

const float d = __half2float(x[i].gamma);

// Step 1: Each thread dequantizes its value (in rotated space)
const int byte_idx = tid / 4;
const int bit_shift = 2 * (tid % 4);
const int idx = (x[i].qs[byte_idx] >> bit_shift) & 3;

__shared__ float shmem[32];
shmem[tid] = d * centroids[idx];
__syncthreads();

// Step 2: Cooperative inverse WHT (5 butterfly stages)
for (int step = 1; step < 32; step <<= 1) {
int partner = tid ^ step; // butterfly partner
float a = shmem[tid];
float b = shmem[partner];
__syncthreads();
if (tid < partner) {
shmem[tid] = a + b;
shmem[partner] = a - b;
}
__syncthreads();
}

// Step 3: Normalize and undo sign flips
const float inv_sqrt32 = 0.17677669529663688f;
yy[i * QK_TQ3_0 + tid] = shmem[tid] * inv_sqrt32 * signs[tid];
}
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GPU dequantization kernel reconstructs only d*centroid[idx] and ignores the stored QJL residual-sign bits (qr) and any QJL correction scaling. If qr/gamma are part of the format, dequantization should incorporate them; otherwise consider removing qr from the block layout to avoid wasting bandwidth/storage.

Copilot uses AI. Check for mistakes.
for (int j = 0; j < QK_TQ3_0; j++) rotated[j] = x[j];
tq3_wht32_forward_device(rotated);

memset(y, 0, sizeof(block_tq3_0));
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quantize_f32_tq3_0_block uses memset() inside device code. This is the only device-side memset in this folder and can be problematic for some toolchains (e.g., HIP/device-lib availability) and may add overhead. Consider explicitly zeroing qs/qr/gamma with simple loops/assignments instead.

Suggested change
memset(y, 0, sizeof(block_tq3_0));
y->gamma = __float2half(0.0f);
for (int j = 0; j < (int)(sizeof(y->qs) / sizeof(y->qs[0])); ++j) {
y->qs[j] = 0;
}
for (int j = 0; j < (int)(sizeof(y->qr) / sizeof(y->qr[0])); ++j) {
y->qr[j] = 0;
}

Copilot uses AI. Check for mistakes.
Comment on lines +1366 to +1370
const float centroids[4] = { -1.510f, -0.4528f, 0.4528f, 1.510f };
const int8_t signs[32] = {
+1, -1, +1, +1, -1, -1, +1, -1, +1, +1, -1, +1, -1, +1, -1, -1,
+1, -1, -1, +1, +1, -1, +1, -1, -1, +1, +1, +1, -1, -1, +1, -1
};
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The centroids and sign pattern are duplicated as local arrays across CPU quantize and multiple CUDA/HIP kernels. This duplication risks subtle mismatches and can increase register/local-memory pressure. Consider defining them once (e.g., in a shared header / constant memory) and reusing across quantize/dequant/vecdot paths.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants