🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
-
Updated
Sep 7, 2024 - Python
🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.
Slicing a PyTorch Tensor Into Parallel Shards
LLM inference engine from scratch — paged KV cache, continuous batching, chunked prefill, prefix caching, speculative decoding, CUDA graph, tensor parallelism, OpenAI-compatible serving
Decentralized LLMs fine-tuning and inference with offloading
Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*
gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
JORA: JAX Tensor-Parallel LoRA Library (ACL 2024)
A distributed training framework for large language models powered by Lightning.
Fast and easy distributed model training examples.
Tensor Parallelism with JAX + Shard Map
GPU Memory Calculator for LLM Training - Calculate GPU memory requirements for training Large Language Models with support for multiple training engines including PyTorch DDP, DeepSpeed ZeRO, Megatron-LM, and FSDP.
A production-grade, native Rust speculative inference engine for Apple Silicon with Metal GPU acceleration and paged attention.
Multi-GPU tensor/context parallel diffusion on AMD ROCm — with the patch that makes it actually work.
Production-grade LLM inference API built from scratch. NestJS gateway + Python GPU workers. Scheduling, batching, KV cache, tensor parallelism, multi-modal — all against real GPUs.
A reference implementation of Matrix Multiplication algorithms for ML on UPMEM PIM - a processing-in-memory platform
This repository focuses on distributed and parallel computing with PyTorch, covering model parallelism, data parallelism, and advanced optimization techniques. It provides resources for scaling AI training and inference efficiently across multiple devices.
Interactive 3D visualization of dense decoder-only LLM inference. Companion to the AI Inference Engineer 2026 course.
vLLM - High-throughput, memory-efficient LLM inference engine with PagedAttention, continuous batching, CUDA/HIP optimization, quantization (GPTQ/AWQ/INT4/INT8/FP8), tensor/pipeline parallelism, OpenAI-compatible API, multi-GPU/TPU/Neuron support, prefix caching, and multi-LoRA capabilities
Add a description, image, and links to the tensor-parallelism topic page so that developers can more easily learn about it.
To associate your repository with the tensor-parallelism topic, visit your repo's landing page and select "manage topics."