Skip to content

eplt/tqstack

Repository files navigation

tqstack

TurboQuant-accelerated LLM inference stack for Apple Silicon that routes requests between Ollama and a compressed-KV sidecar through a single OpenAI-compatible API.

Python 3.10+ License


What It Does

tqstack provides a unified endpoint (localhost:8000) that automatically routes LLM requests to the best available backend: Ollama for short prompts and normal workloads, or a TurboQuant MLX sidecar for long-context generation with up to 3.5× KV cache compression. The router monitors sidecar health and falls back to Ollama seamlessly if it becomes unavailable. Both backends expose OpenAI-compatible /v1/chat/completions endpoints with full SSE streaming support.


Why this repo?

Why this repo? The TurboQuant paper first appeared on arXiv in April 2025 and was quietly accepted to ICLR 2026. Then on March 24, 2026, Google Research published a blog post spotlighting the work — and everything erupted. Within hours, memory chip stocks cratered: SK Hynix dropped over 6%, Samsung fell nearly 5%, and Micron slid into a multi-day decline that would eventually wipe out 20% of its value in under a week. The internet immediately drew comparisons to Pied Piper from HBO's Silicon Valley — a compression algorithm so good it destabilises an industry. Financial analysts scrambled to figure out whether a research paper about KV-cache quantization had just killed the bull case for memory semiconductors.

The scepticism came fast too. Critics on Hacker News called it "totally irrelevant compared to current quantization methods." The RaBitQ authors published a detailed rebuttal accusing Google of misrepresenting prior work. Others pointed out that newer hybrid-attention models like Qwen 3.5 already reduce KV-cache pressure by 75% architecturally, making TurboQuant's gains look marginal in context. Some suspected the blog post was timed to move markets rather than advance science.

But then the open-source community did what it does. Google released no official code — just the paper and a blog post. Within 48 hours, independent developers had working implementations in PyTorch, Triton, MLX, and llama.cpp. Within two weeks there were at least five separate open-source implementations, and someone was running a 104-billion-parameter model on a laptop. The paper's core ideas — polar coordinate quantization, random orthogonal rotation, Lloyd-Max codebooks — do work. On pure-softmax models, KV-cache compression of 4–6× with less than 1% perplexity loss is real and reproducible. The limitations are also real: throughput can drop significantly at high compression, the technique only targets the KV cache (not model weights), and it does nothing for the Gated DeltaNet layers found in modern hybrid architectures.

This repo exists because none of the existing implementations did exactly what I needed. I am running Apple Silicon with 64 GB of unified memory. I want to use MLX. I want to use Ollama 0.19's native MLX backend for fast inference. I want to run Qwen 3.5 and Qwen 3.6, which are hybrid GDN-softmax models where only 10 out of 40 layers even have a KV cache to compress. And I want a single OpenAI-compatible endpoint that intelligently routes between Ollama and a TurboQuant-compressed sidecar based on context length. So this repo is not a general-purpose TurboQuant library. It is a personal, opinionated stack built for a specific hardware target, a specific model family, and a specific workflow. If that happens to match yours, welcome.


Prerequisites

  • macOS 14.0 (Sonoma) or later on Apple Silicon (M1+)
  • Ollama 0.19.0 or later (provides MLX backend on 32 GB+ machines)
  • Python 3.10 or later
  • 30 GB free disk space for the model (~22 GB)

Installation

git clone https://github.com/eplt/tqstack.git
cd tqstack
bash install.sh

After installation, quit and reopen Ollama so the environment variables (OLLAMA_MLX=1, OLLAMA_FLASH_ATTENTION=1) take effect.


Quick Start

# Check health
curl http://127.0.0.1:8000/health

# Non-streaming chat
curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3.5:35b-a3b-coding-nvfp4","messages":[{"role":"user","content":"What is 2+2?"}],"stream":false}'

# Streaming chat (SSE)
curl -N http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3.5:35b-a3b-coding-nvfp4","messages":[{"role":"user","content":"Write a haiku"}],"stream":true}'

Usage

cURL

# Force the TurboQuant sidecar
curl -s http://127.0.0.1:8000/v1/chat/completions \
  -d '{"model":"tq:auto","messages":[{"role":"user","content":"Hello"}]}'

# Force Ollama
curl -s http://127.0.0.1:8000/v1/chat/completions \
  -d '{"model":"ollama:qwen3.5:35b-a3b-coding-nvfp4","messages":[{"role":"user","content":"Hello"}]}'

OpenAI Python SDK

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="local")

r = client.chat.completions.create(
    model="qwen3.5:35b-a3b-coding-nvfp4",
    messages=[{"role": "user", "content": "Say hello"}],
    stream=True,
)
for chunk in r:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Model Prefixes

Prefix Behavior
tq:, turbo:, sidecar: Force the TurboQuant sidecar (falls back to Ollama if unhealthy)
ollama:, local: Force Ollama, strip prefix from model name
(none) Auto-routed based on estimated context length

Routing Logic

The router makes per-request decisions. If no prefix is specified, it estimates the total token count across all messages in the conversation (roughly 4 characters per token plus ~40 characters overhead per message for the chat template). Long prompts are routed to the sidecar; short prompts go to Ollama. If the sidecar is unreachable, requests silently fall back to Ollama.

Model Architecture Default Threshold
Pure SDPA (Llama, Mistral, Qwen 2.x, Phi, Gemma) 6 000 tokens
Hybrid GDN+SDPA (Qwen 3.5, Qwen 3.6) 16 000 tokens

Hybrid models have a higher threshold because only 10 of their 40 layers maintain KV caches, reducing memory pressure.

Open WebUI

Add http://127.0.0.1:8000/v1 as a connection (API key: anything). Streaming and tool calling work through both backends.


Configuration

Configuration lives in ~/ai/turboquant/config.env, generated by install.sh:

Variable Default Description
SIDECAR_ENABLED auto (≥ 32 GB) Enable sidecar
SIDECAR_MODEL mlx-community/Qwen3.5-35B-A3B-4bit HuggingFace model for sidecar
OLLAMA_MODEL qwen3.5:35b-a3b-coding-nvfp4 Ollama model tag
TQ_KV_BITS 4 KV cache quantization bits (3 or 4)
TQ_KV_GROUP_SIZE 64 Quantization group size
TQ_USE_ROTATION 1 Enable QR rotation for quality
TQ_USE_NORMALIZATION 1 Enable norm baking for speed
TQ_BOUNDARY_LAYERS 0 First/last N softmax layers at higher precision
PROMPT_TOKEN_THRESHOLD auto Token threshold for sidecar routing

Environment variables can be overridden at service startup by editing config.env before running bash scripts/run-sidecar.sh or bash scripts/run-router.sh.


Architecture

Client (curl, OpenAI SDK, Open WebUI, etc.)
                │
                ▼
    ┌───────────────────────┐
    │  Routing Proxy :8000  │  OpenAI-compatible API
    │  Health monitoring    │  SSE streaming
    │  Auto-routing         │  Graceful fallback
    └───────┬───────┬───────┘
            │       │
            ▼       ▼
    ┌──────────┐  ┌──────────────────┐
    │  Ollama  │  │  TQ Sidecar :8001│
    │  :11434  │  │  KV compression  │
    │  Default │  │  Long-context    │
    └──────────┘  └──────────────────┘

TurboQuant compresses the KV cache from float16 to 4-bit quantization with QR rotation, achieving ~3.5× memory savings. For hybrid models like Qwen 3.5 (30 GDN layers + 10 softmax layers), only the 10 softmax layers have compressible KV caches — the GDN layers manage their own fixed-size recurrent state.

KV Cache Memory (Qwen 3.5-35B)

Context Standard KV Cache TurboQuant Hybrid Saved
8K tokens ~80 MB ~23 MB 57 MB
32K tokens ~320 MB ~91 MB 229 MB
128K tokens ~1.28 GB ~366 MB ~914 MB
262K tokens ~2.62 GB ~749 MB ~1.87 GB

Project Structure

tqstack/
├── services/
│   ├── shared.py           # SSE formatting, config loader
│   ├── hybrid_cache.py     # Auto-detects SDPA vs hybrid architectures
│   ├── sidecar.py          # TurboQuant MLX inference server
│   ├── router.py           # Health-aware routing proxy
│   └── tq_patch_v2.py      # SDPA patch fixing for mlx-lm 0.31+
├── scripts/
│   ├── preflight-check.sh
│   ├── start-ollama-optimized.sh
│   ├── run-sidecar.sh
│   ├── run-router.sh
│   ├── load-launchagents.sh
│   └── unload-launchagents.sh
├── launchagents/
│   ├── com.tqstack-router.plist.template
│   └── com.tqstack-sidecar.plist.template
├── tests/
│   ├── test_hybrid_cache.py
│   ├── test_router.py
│   ├── test_shared.py
│   ├── test_sidecar.py
│   └── test_tq_patch_v2.py
├── open-webui/
│   └── README.md
├── install.sh
└── uninstall.sh

Security

All services bind to 127.0.0.1 only — no network exposure. The router is a stateless pass-through proxy: it does not store, log, or modify request content. No API keys are persisted; the api_key parameter is accepted but ignored. TurboQuant is cloned from the upstream repository at install time and runs within the same Python process as the sidecar.


Troubleshooting

Sidecar not starting — check ~/Library/Logs/tqstack-sidecar.err.log for model download errors or out-of-memory conditions. On 32 GB machines the sidecar is disabled by default.

Requests timeout — verify Ollama is running (ollama list) and that you quit and reopened Ollama after installation so environment variables take effect.

Degenerate repeated-token output — ensure services/tq_patch_v2.py is being used (not the original turboquant.patch). The v2 patch fixes an import-aliasing bug in mlx-lm 0.31+.

Router routes everything to Ollama — check curl http://127.0.0.1:8000/health for sidecar_available. If false, the sidecar crashed or hasn't finished loading its model (~10-60 seconds for 4B models, longer for 70B).


Roadmap

  • Lazy model loading for 32 GB machines (unload Ollama model when sidecar is active)
  • Per-request cache memory reporting in the /health endpoint
  • Support for Qwen 3.6 with verified mlx-lm model compatibility

Contributing

Contributions welcome — open an issue or submit a pull request.


License

MIT


Acknowledgments


Author

Edward Tsang — blockchain & AI engineer. Open to consulting → Email · LinkedIn

About

TurboQuant-accelerated LLM inference stack for Apple Silicon that routes requests between Ollama and a compressed-KV sidecar through a single OpenAI-compatible API

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors