TurboQuant-accelerated LLM inference stack for Apple Silicon that routes requests between Ollama and a compressed-KV sidecar through a single OpenAI-compatible API.
tqstack provides a unified endpoint (localhost:8000) that automatically routes LLM requests to the best available backend: Ollama for short prompts and normal workloads, or a TurboQuant MLX sidecar for long-context generation with up to 3.5× KV cache compression. The router monitors sidecar health and falls back to Ollama seamlessly if it becomes unavailable. Both backends expose OpenAI-compatible /v1/chat/completions endpoints with full SSE streaming support.
Why this repo? The TurboQuant paper first appeared on arXiv in April 2025 and was quietly accepted to ICLR 2026. Then on March 24, 2026, Google Research published a blog post spotlighting the work — and everything erupted. Within hours, memory chip stocks cratered: SK Hynix dropped over 6%, Samsung fell nearly 5%, and Micron slid into a multi-day decline that would eventually wipe out 20% of its value in under a week. The internet immediately drew comparisons to Pied Piper from HBO's Silicon Valley — a compression algorithm so good it destabilises an industry. Financial analysts scrambled to figure out whether a research paper about KV-cache quantization had just killed the bull case for memory semiconductors.
The scepticism came fast too. Critics on Hacker News called it "totally irrelevant compared to current quantization methods." The RaBitQ authors published a detailed rebuttal accusing Google of misrepresenting prior work. Others pointed out that newer hybrid-attention models like Qwen 3.5 already reduce KV-cache pressure by 75% architecturally, making TurboQuant's gains look marginal in context. Some suspected the blog post was timed to move markets rather than advance science.
But then the open-source community did what it does. Google released no official code — just the paper and a blog post. Within 48 hours, independent developers had working implementations in PyTorch, Triton, MLX, and llama.cpp. Within two weeks there were at least five separate open-source implementations, and someone was running a 104-billion-parameter model on a laptop. The paper's core ideas — polar coordinate quantization, random orthogonal rotation, Lloyd-Max codebooks — do work. On pure-softmax models, KV-cache compression of 4–6× with less than 1% perplexity loss is real and reproducible. The limitations are also real: throughput can drop significantly at high compression, the technique only targets the KV cache (not model weights), and it does nothing for the Gated DeltaNet layers found in modern hybrid architectures.
This repo exists because none of the existing implementations did exactly what I needed. I am running Apple Silicon with 64 GB of unified memory. I want to use MLX. I want to use Ollama 0.19's native MLX backend for fast inference. I want to run Qwen 3.5 and Qwen 3.6, which are hybrid GDN-softmax models where only 10 out of 40 layers even have a KV cache to compress. And I want a single OpenAI-compatible endpoint that intelligently routes between Ollama and a TurboQuant-compressed sidecar based on context length. So this repo is not a general-purpose TurboQuant library. It is a personal, opinionated stack built for a specific hardware target, a specific model family, and a specific workflow. If that happens to match yours, welcome.
- macOS 14.0 (Sonoma) or later on Apple Silicon (M1+)
- Ollama 0.19.0 or later (provides MLX backend on 32 GB+ machines)
- Python 3.10 or later
- 30 GB free disk space for the model (~22 GB)
git clone https://github.com/eplt/tqstack.git
cd tqstack
bash install.shAfter installation, quit and reopen Ollama so the environment variables (OLLAMA_MLX=1, OLLAMA_FLASH_ATTENTION=1) take effect.
# Check health
curl http://127.0.0.1:8000/health
# Non-streaming chat
curl -s http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3.5:35b-a3b-coding-nvfp4","messages":[{"role":"user","content":"What is 2+2?"}],"stream":false}'
# Streaming chat (SSE)
curl -N http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3.5:35b-a3b-coding-nvfp4","messages":[{"role":"user","content":"Write a haiku"}],"stream":true}'# Force the TurboQuant sidecar
curl -s http://127.0.0.1:8000/v1/chat/completions \
-d '{"model":"tq:auto","messages":[{"role":"user","content":"Hello"}]}'
# Force Ollama
curl -s http://127.0.0.1:8000/v1/chat/completions \
-d '{"model":"ollama:qwen3.5:35b-a3b-coding-nvfp4","messages":[{"role":"user","content":"Hello"}]}'from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="local")
r = client.chat.completions.create(
model="qwen3.5:35b-a3b-coding-nvfp4",
messages=[{"role": "user", "content": "Say hello"}],
stream=True,
)
for chunk in r:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)| Prefix | Behavior |
|---|---|
tq:, turbo:, sidecar: |
Force the TurboQuant sidecar (falls back to Ollama if unhealthy) |
ollama:, local: |
Force Ollama, strip prefix from model name |
| (none) | Auto-routed based on estimated context length |
The router makes per-request decisions. If no prefix is specified, it estimates the total token count across all messages in the conversation (roughly 4 characters per token plus ~40 characters overhead per message for the chat template). Long prompts are routed to the sidecar; short prompts go to Ollama. If the sidecar is unreachable, requests silently fall back to Ollama.
| Model Architecture | Default Threshold |
|---|---|
| Pure SDPA (Llama, Mistral, Qwen 2.x, Phi, Gemma) | 6 000 tokens |
| Hybrid GDN+SDPA (Qwen 3.5, Qwen 3.6) | 16 000 tokens |
Hybrid models have a higher threshold because only 10 of their 40 layers maintain KV caches, reducing memory pressure.
Add http://127.0.0.1:8000/v1 as a connection (API key: anything). Streaming and tool calling work through both backends.
Configuration lives in ~/ai/turboquant/config.env, generated by install.sh:
| Variable | Default | Description |
|---|---|---|
SIDECAR_ENABLED |
auto (≥ 32 GB) | Enable sidecar |
SIDECAR_MODEL |
mlx-community/Qwen3.5-35B-A3B-4bit |
HuggingFace model for sidecar |
OLLAMA_MODEL |
qwen3.5:35b-a3b-coding-nvfp4 |
Ollama model tag |
TQ_KV_BITS |
4 |
KV cache quantization bits (3 or 4) |
TQ_KV_GROUP_SIZE |
64 |
Quantization group size |
TQ_USE_ROTATION |
1 |
Enable QR rotation for quality |
TQ_USE_NORMALIZATION |
1 |
Enable norm baking for speed |
TQ_BOUNDARY_LAYERS |
0 |
First/last N softmax layers at higher precision |
PROMPT_TOKEN_THRESHOLD |
auto | Token threshold for sidecar routing |
Environment variables can be overridden at service startup by editing config.env before running bash scripts/run-sidecar.sh or bash scripts/run-router.sh.
Client (curl, OpenAI SDK, Open WebUI, etc.)
│
▼
┌───────────────────────┐
│ Routing Proxy :8000 │ OpenAI-compatible API
│ Health monitoring │ SSE streaming
│ Auto-routing │ Graceful fallback
└───────┬───────┬───────┘
│ │
▼ ▼
┌──────────┐ ┌──────────────────┐
│ Ollama │ │ TQ Sidecar :8001│
│ :11434 │ │ KV compression │
│ Default │ │ Long-context │
└──────────┘ └──────────────────┘
TurboQuant compresses the KV cache from float16 to 4-bit quantization with QR rotation, achieving ~3.5× memory savings. For hybrid models like Qwen 3.5 (30 GDN layers + 10 softmax layers), only the 10 softmax layers have compressible KV caches — the GDN layers manage their own fixed-size recurrent state.
| Context | Standard KV Cache | TurboQuant Hybrid | Saved |
|---|---|---|---|
| 8K tokens | ~80 MB | ~23 MB | 57 MB |
| 32K tokens | ~320 MB | ~91 MB | 229 MB |
| 128K tokens | ~1.28 GB | ~366 MB | ~914 MB |
| 262K tokens | ~2.62 GB | ~749 MB | ~1.87 GB |
tqstack/
├── services/
│ ├── shared.py # SSE formatting, config loader
│ ├── hybrid_cache.py # Auto-detects SDPA vs hybrid architectures
│ ├── sidecar.py # TurboQuant MLX inference server
│ ├── router.py # Health-aware routing proxy
│ └── tq_patch_v2.py # SDPA patch fixing for mlx-lm 0.31+
├── scripts/
│ ├── preflight-check.sh
│ ├── start-ollama-optimized.sh
│ ├── run-sidecar.sh
│ ├── run-router.sh
│ ├── load-launchagents.sh
│ └── unload-launchagents.sh
├── launchagents/
│ ├── com.tqstack-router.plist.template
│ └── com.tqstack-sidecar.plist.template
├── tests/
│ ├── test_hybrid_cache.py
│ ├── test_router.py
│ ├── test_shared.py
│ ├── test_sidecar.py
│ └── test_tq_patch_v2.py
├── open-webui/
│ └── README.md
├── install.sh
└── uninstall.sh
All services bind to 127.0.0.1 only — no network exposure. The router is a stateless pass-through proxy: it does not store, log, or modify request content. No API keys are persisted; the api_key parameter is accepted but ignored. TurboQuant is cloned from the upstream repository at install time and runs within the same Python process as the sidecar.
Sidecar not starting — check ~/Library/Logs/tqstack-sidecar.err.log for model download errors or out-of-memory conditions. On 32 GB machines the sidecar is disabled by default.
Requests timeout — verify Ollama is running (ollama list) and that you quit and reopened Ollama after installation so environment variables take effect.
Degenerate repeated-token output — ensure services/tq_patch_v2.py is being used (not the original turboquant.patch). The v2 patch fixes an import-aliasing bug in mlx-lm 0.31+.
Router routes everything to Ollama — check curl http://127.0.0.1:8000/health for sidecar_available. If false, the sidecar crashed or hasn't finished loading its model (~10-60 seconds for 4B models, longer for 70B).
- Lazy model loading for 32 GB machines (unload Ollama model when sidecar is active)
- Per-request cache memory reporting in the
/healthendpoint - Support for Qwen 3.6 with verified mlx-lm model compatibility
Contributions welcome — open an issue or submit a pull request.
- TurboQuant MLX — KV cache compression library
- vakaobr/poorsman-mac-turboquant-stack-bundle — Original project (Dev.to article)
- mlx-lm — Apple's MLX inference framework
- Ollama — Local model serving
Edward Tsang — blockchain & AI engineer. Open to consulting → Email · LinkedIn