TurboQuant KV cache compression plugin for vLLM — asymmetric K/V, 8 models validated, consumer GPUs
-
Updated
Apr 10, 2026 - Python
TurboQuant KV cache compression plugin for vLLM — asymmetric K/V, 8 models validated, consumer GPUs
Lightweight Modular AI Routing Engine for Local LLMs — Run specialised experts efficiently on consumer GPUs using smart Mixture-of-Experts routing.
RAM-Backed MCP Memory Architecture for Consumer LLM Inference — 900K token context on 16GB VRAM
Arbitrary Numbers
Dynamic GPU Layer Swapping: Train large models on consumer GPUs with intelligent memory management
Self-hosted LLM chat client with streaming UI for vLLM servers. Run Mistral-24B locally on RTX 4090/3090. Privacy-focused ChatGPT alternative for homelab/gaming PCs. Python/Rich terminal UI.
Surgical reasoning on consumer silicon. Hybrid SSM + causal memory architecture with entropy-gated System 1/2 dispatch, O(1) inference memory, and continual learning — designed for 16 GB VRAM.
RAMP: RL-guided Adaptive Mixed-Precision quantization for GGUF models. Data-free sensitivity analysis, evolutionary search, per-tensor type optimization. Produces hardware-optimized GGUF for consumer GPUs.
Technical notes on building a local-first AI coding assistant with local LLMs, Ollama, SwiftUI, and consumer GPU constraints.
Tiered GPU memory architecture for consumer AI inference. VRAM as execution cache, system RAM as passive staging layer.
GPT-OSS B20 Local Execution. Lightweight local environment for running it with Python 3.12 and CUDA acceleration. - Run GPT-OSS B20 entirely offline - Optimize text generation with GPU - Enable fast, secure inference on consumer hardware.
A comprehensive, modular framework for fine-tuning Stable Diffusion 3.5 models using LoRA (Low-Rank Adaptation). Create custom AI image generators tailored to your artistic style, objects, or concepts with memory-efficient training on consumer GPUs.
ismail is a from-scratch Turkish language model implementation designed for low-end hardware, built and trained on a single RTX 5070 (12GB).
PILON (Primitive-Induced Linear Operator Network) explores a compositional weight parameterization for transformer FFN layers. The goal is to replace dense FFN matrices with shared low-rank primitives plus learned composition weights.
Add a description, image, and links to the consumer-gpu topic page so that developers can more easily learn about it.
To associate your repository with the consumer-gpu topic, visit your repo's landing page and select "manage topics."