Skip to content

LessUp/llm-speed

Repository files navigation

CUDA LLM Kernel Optimization

CI Docs License: MIT

English | 简体中文 | Docs CUDA C++ Python

LLM-Speed is a CUDA kernel optimization project for LLM inference experiments, covering FlashAttention, Tensor Core GEMM, Python bindings, and verification workflows.

Repository Overview

  • CUDA kernels in src/ and reusable primitives in include/
  • Python bindings and packaging in python/, setup.py, and pyproject.toml
  • Tests and benchmarks in tests/ and benchmarks/
  • GitHub Pages site for documentation entry, reading paths, and project updates

Quick Start

pip install -r requirements.txt
pip install -e .

cmake --preset release
cmake --build build/release -j$(nproc)

pytest tests/ -v

Docs

  • Project docs: https://lessup.github.io/llm-speed/
  • Site home explains where to start, what to read next, and how the docs are organized
  • See CONTRIBUTING.md for contribution workflow

License

MIT License

About

CUDA Kernel Library for LLM Inference: FlashAttention, HGEMM, Tensor Core GEMM with pybind11 Bindings | LLM 推理加速 CUDA Kernel 库:FlashAttention、HGEMM、Tensor Core GEMM,含 pybind11 Python 绑定

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors