tqstack

TurboQuant-accelerated LLM inference stack for Apple Silicon that routes requests between Ollama and a compressed-KV sidecar through a single OpenAI-compatible API.

What It Does

tqstack provides a unified endpoint (localhost:8000) that automatically routes LLM requests to the best available backend: Ollama for short prompts and normal workloads, or a TurboQuant MLX sidecar for long-context generation with up to 3.5× KV cache compression. The router monitors sidecar health and falls back to Ollama seamlessly if it becomes unavailable. Both backends expose OpenAI-compatible /v1/chat/completions endpoints with full SSE streaming support.

Why this repo?

Why this repo? The TurboQuant paper first appeared on arXiv in April 2025 and was quietly accepted to ICLR 2026. Then on March 24, 2026, Google Research published a blog post spotlighting the work — and everything erupted. Within hours, memory chip stocks cratered: SK Hynix dropped over 6%, Samsung fell nearly 5%, and Micron slid into a multi-day decline that would eventually wipe out 20% of its value in under a week. The internet immediately drew comparisons to Pied Piper from HBO's Silicon Valley — a compression algorithm so good it destabilises an industry. Financial analysts scrambled to figure out whether a research paper about KV-cache quantization had just killed the bull case for memory semiconductors.

The scepticism came fast too. Critics on Hacker News called it "totally irrelevant compared to current quantization methods." The RaBitQ authors published a detailed rebuttal accusing Google of misrepresenting prior work. Others pointed out that newer hybrid-attention models like Qwen 3.5 already reduce KV-cache pressure by 75% architecturally, making TurboQuant's gains look marginal in context. Some suspected the blog post was timed to move markets rather than advance science.

But then the open-source community did what it does. Google released no official code — just the paper and a blog post. Within 48 hours, independent developers had working implementations in PyTorch, Triton, MLX, and llama.cpp. Within two weeks there were at least five separate open-source implementations, and someone was running a 104-billion-parameter model on a laptop. The paper's core ideas — polar coordinate quantization, random orthogonal rotation, Lloyd-Max codebooks — do work. On pure-softmax models, KV-cache compression of 4–6× with less than 1% perplexity loss is real and reproducible. The limitations are also real: throughput can drop significantly at high compression, the technique only targets the KV cache (not model weights), and it does nothing for the Gated DeltaNet layers found in modern hybrid architectures.

This repo exists because none of the existing implementations did exactly what I needed. I am running Apple Silicon with 64 GB of unified memory. I want to use MLX. I want to use Ollama 0.19's native MLX backend for fast inference. I want to run Qwen 3.5 and Qwen 3.6, which are hybrid GDN-softmax models where only 10 out of 40 layers even have a KV cache to compress. And I want a single OpenAI-compatible endpoint that intelligently routes between Ollama and a TurboQuant-compressed sidecar based on context length. So this repo is not a general-purpose TurboQuant library. It is a personal, opinionated stack built for a specific hardware target, a specific model family, and a specific workflow. If that happens to match yours, welcome.

Prerequisites

macOS 14.0 (Sonoma) or later on Apple Silicon (M1+)
Ollama 0.19.0 or later (provides MLX backend on 32 GB+ machines)
Python 3.10 or later
30 GB free disk space for the model (~22 GB)

Installation

git clone https://github.com/eplt/tqstack.git
cd tqstack
bash install.sh

After installation, quit and reopen Ollama so the environment variables (OLLAMA_MLX=1, OLLAMA_FLASH_ATTENTION=1) take effect.

Quick Start

# Check health
curl http://127.0.0.1:8000/health

# Non-streaming chat
curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3.5:35b-a3b-coding-nvfp4","messages":[{"role":"user","content":"What is 2+2?"}],"stream":false}'

# Streaming chat (SSE)
curl -N http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3.5:35b-a3b-coding-nvfp4","messages":[{"role":"user","content":"Write a haiku"}],"stream":true}'

Usage

cURL

# Force the TurboQuant sidecar
curl -s http://127.0.0.1:8000/v1/chat/completions \
  -d '{"model":"tq:auto","messages":[{"role":"user","content":"Hello"}]}'

# Force Ollama
curl -s http://127.0.0.1:8000/v1/chat/completions \
  -d '{"model":"ollama:qwen3.5:35b-a3b-coding-nvfp4","messages":[{"role":"user","content":"Hello"}]}'

OpenAI Python SDK

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="local")

r = client.chat.completions.create(
    model="qwen3.5:35b-a3b-coding-nvfp4",
    messages=[{"role": "user", "content": "Say hello"}],
    stream=True,
)
for chunk in r:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Model Prefixes

Prefix	Behavior
`tq:`, `turbo:`, `sidecar:`	Force the TurboQuant sidecar (falls back to Ollama if unhealthy)
`ollama:`, `local:`	Force Ollama, strip prefix from model name
(none)	Auto-routed based on estimated context length

Routing Logic

The router makes per-request decisions. If no prefix is specified, it estimates the total token count across all messages in the conversation (roughly 4 characters per token plus ~40 characters overhead per message for the chat template). Long prompts are routed to the sidecar; short prompts go to Ollama. If the sidecar is unreachable, requests silently fall back to Ollama.

Model Architecture	Default Threshold
Pure SDPA (Llama, Mistral, Qwen 2.x, Phi, Gemma)	6 000 tokens
Hybrid GDN+SDPA (Qwen 3.5, Qwen 3.6)	16 000 tokens

Hybrid models have a higher threshold because only 10 of their 40 layers maintain KV caches, reducing memory pressure.

Open WebUI

Add http://127.0.0.1:8000/v1 as a connection (API key: anything). Streaming and tool calling work through both backends.

Configuration

Configuration lives in ~/ai/turboquant/config.env, generated by install.sh:

Variable	Default	Description
`SIDECAR_ENABLED`	auto (≥ 32 GB)	Enable sidecar
`SIDECAR_MODEL`	`mlx-community/Qwen3.5-35B-A3B-4bit`	HuggingFace model for sidecar
`OLLAMA_MODEL`	`qwen3.5:35b-a3b-coding-nvfp4`	Ollama model tag
`TQ_KV_BITS`	`4`	KV cache quantization bits (3 or 4)
`TQ_KV_GROUP_SIZE`	`64`	Quantization group size
`TQ_USE_ROTATION`	`1`	Enable QR rotation for quality
`TQ_USE_NORMALIZATION`	`1`	Enable norm baking for speed
`TQ_BOUNDARY_LAYERS`	`0`	First/last N softmax layers at higher precision
`PROMPT_TOKEN_THRESHOLD`	auto	Token threshold for sidecar routing

Environment variables can be overridden at service startup by editing config.env before running bash scripts/run-sidecar.sh or bash scripts/run-router.sh.

Architecture

Client (curl, OpenAI SDK, Open WebUI, etc.)
                │
                ▼
    ┌───────────────────────┐
    │  Routing Proxy :8000  │  OpenAI-compatible API
    │  Health monitoring    │  SSE streaming
    │  Auto-routing         │  Graceful fallback
    └───────┬───────┬───────┘
            │       │
            ▼       ▼
    ┌──────────┐  ┌──────────────────┐
    │  Ollama  │  │  TQ Sidecar :8001│
    │  :11434  │  │  KV compression  │
    │  Default │  │  Long-context    │
    └──────────┘  └──────────────────┘

TurboQuant compresses the KV cache from float16 to 4-bit quantization with QR rotation, achieving ~3.5× memory savings. For hybrid models like Qwen 3.5 (30 GDN layers + 10 softmax layers), only the 10 softmax layers have compressible KV caches — the GDN layers manage their own fixed-size recurrent state.

KV Cache Memory (Qwen 3.5-35B)

Context	Standard KV Cache	TurboQuant Hybrid	Saved
8K tokens	~80 MB	~23 MB	57 MB
32K tokens	~320 MB	~91 MB	229 MB
128K tokens	~1.28 GB	~366 MB	~914 MB
262K tokens	~2.62 GB	~749 MB	~1.87 GB

Project Structure

tqstack/
├── services/
│   ├── shared.py           # SSE formatting, config loader
│   ├── hybrid_cache.py     # Auto-detects SDPA vs hybrid architectures
│   ├── sidecar.py          # TurboQuant MLX inference server
│   ├── router.py           # Health-aware routing proxy
│   └── tq_patch_v2.py      # SDPA patch fixing for mlx-lm 0.31+
├── scripts/
│   ├── preflight-check.sh
│   ├── start-ollama-optimized.sh
│   ├── run-sidecar.sh
│   ├── run-router.sh
│   ├── load-launchagents.sh
│   └── unload-launchagents.sh
├── launchagents/
│   ├── com.tqstack-router.plist.template
│   └── com.tqstack-sidecar.plist.template
├── tests/
│   ├── test_hybrid_cache.py
│   ├── test_router.py
│   ├── test_shared.py
│   ├── test_sidecar.py
│   └── test_tq_patch_v2.py
├── open-webui/
│   └── README.md
├── install.sh
└── uninstall.sh

Security

All services bind to 127.0.0.1 only — no network exposure. The router is a stateless pass-through proxy: it does not store, log, or modify request content. No API keys are persisted; the api_key parameter is accepted but ignored. TurboQuant is cloned from the upstream repository at install time and runs within the same Python process as the sidecar.

Troubleshooting

Sidecar not starting — check ~/Library/Logs/tqstack-sidecar.err.log for model download errors or out-of-memory conditions. On 32 GB machines the sidecar is disabled by default.

Requests timeout — verify Ollama is running (ollama list) and that you quit and reopened Ollama after installation so environment variables take effect.

Degenerate repeated-token output — ensure services/tq_patch_v2.py is being used (not the original turboquant.patch). The v2 patch fixes an import-aliasing bug in mlx-lm 0.31+.

Router routes everything to Ollama — check curl http://127.0.0.1:8000/health for sidecar_available. If false, the sidecar crashed or hasn't finished loading its model (~10-60 seconds for 4B models, longer for 70B).

Roadmap

Lazy model loading for 32 GB machines (unload Ollama model when sidecar is active)
Per-request cache memory reporting in the /health endpoint
Support for Qwen 3.6 with verified mlx-lm model compatibility

Contributing

Contributions welcome — open an issue or submit a pull request.

License

MIT

Acknowledgments

TurboQuant MLX — KV cache compression library
vakaobr/poorsman-mac-turboquant-stack-bundle — Original project (Dev.to article)
mlx-lm — Apple's MLX inference framework
Ollama — Local model serving

Author

Edward Tsang — blockchain & AI engineer. Open to consulting → Email · LinkedIn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tqstack

What It Does

Why this repo?

Prerequisites

Installation

Quick Start

Usage

cURL

OpenAI Python SDK

Model Prefixes

Routing Logic

Open WebUI

Configuration

Architecture

KV Cache Memory (Qwen 3.5-35B)

Project Structure

Security

Troubleshooting

Roadmap

Contributing

License

Acknowledgments

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
launchagents		launchagents
open-webui		open-webui
scripts		scripts
services		services
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh
uninstall.sh		uninstall.sh

Folders and files

Latest commit

History

Repository files navigation

tqstack

What It Does

Why this repo?

Prerequisites

Installation

Quick Start

Usage

cURL

OpenAI Python SDK

Model Prefixes

Routing Logic

Open WebUI

Configuration

Architecture

KV Cache Memory (Qwen 3.5-35B)

Project Structure

Security

Troubleshooting

Roadmap

Contributing

License

Acknowledgments

Author

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages