KV cache memory is a bottleneck at long context lengths. NexusQuant offers training-free 7–10x KV compression via E8 lattice quantization + attention-aware token eviction (up to 17x with token merging).
Integration points:
- After prefill, compress the KV cache in-place
- Use attention mask to exclude evicted tokens during generation
- API:
with nexusquant_evict(model): model.generate(...)
Why this matters for DeepSpeed-Inference:
DeepSpeed already supports ZeroQuant and SmoothQuant for weight/activation quantization. NexusQuant is complementary — it targets the KV cache specifically, which ZeroQuant doesn't cover. At long contexts, KV cache can exceed model weight memory, making it the dominant bottleneck for throughput.
DeepSpeed's inference kernel infrastructure (DeepSpeed-FastGen, blocked KV cache) would be a natural integration point — compress KV blocks before writing to the block pool, decompress at attention time inside the fused kernel.
Validated results:
- Mistral-7B: 7x compression, -2.26% PPL
- Llama-3-8B: 5.3x compression, -0.002% PPL
- Training-free, no calibration data required
Library details:
Would you be interested in exploring this as a KV compression backend for DeepSpeed-Inference? Happy to discuss the integration architecture and provide benchmarks.
KV cache memory is a bottleneck at long context lengths. NexusQuant offers training-free 7–10x KV compression via E8 lattice quantization + attention-aware token eviction (up to 17x with token merging).
Integration points:
with nexusquant_evict(model): model.generate(...)Why this matters for DeepSpeed-Inference:
DeepSpeed already supports ZeroQuant and SmoothQuant for weight/activation quantization. NexusQuant is complementary — it targets the KV cache specifically, which ZeroQuant doesn't cover. At long contexts, KV cache can exceed model weight memory, making it the dominant bottleneck for throughput.
DeepSpeed's inference kernel infrastructure (DeepSpeed-FastGen, blocked KV cache) would be a natural integration point — compress KV blocks before writing to the block pool, decompress at attention time inside the fused kernel.
Validated results:
Library details:
pip install nexusquant-kvWould you be interested in exploring this as a KV compression backend for DeepSpeed-Inference? Happy to discuss the integration architecture and provide benchmarks.