English | 简体中文
A CUDA SGEMM engineering notebook designed for both deep learning and interview presentation: from readable FP32 baselines to guarded Tensor Core WMMA, with cuBLAS-backed verification and explicit benchmark boundaries.
- Progressive kernel ladder: naive -> tiled -> bank-conflict-free -> double-buffer -> Tensor Core.
- Evidence-first reporting: performance claims are paired with correctness policy and scope labels.
- Comparable interfaces: FP32 kernels share a unified
(A, B, C, M, K, N, stream)launcher contract. - Interview-ready narrative: dedicated pages for project highlights, interview walkthrough, and references.
- Bilingual mirrored docs: English and Chinese public pages stay aligned.
git clone https://github.com/LessUp/sgemm-optimization.git
cd sgemm-optimization
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
./build/bin/sgemm_benchmark -a
ctest --test-dir buildRuntime tests and benchmarks require a CUDA-capable local machine. Hosted CI is limited to compile-time, formatting, repository-structure, OpenSpec, and Pages checks.
| Goal | Entry point |
|---|---|
| Open English home | Docs Home |
| Open Chinese home | 中文首页 |
| Build and run once | Getting Started |
| Understand differentiation | Project Highlights |
| Prepare interview explanation | Interview Playbook |
| Trace technical lineage | References |
| Read normative specs | OpenSpec Specs |
| Environment | What to trust |
|---|---|
| Hosted CI | Formatting, compile validity, OpenSpec structure, Pages buildability |
| Local CUDA GPU | Runtime correctness verification and benchmark performance |
This split is deliberate. CI keeps repository health; real GPU hardware validates runtime behavior and speed claims.
src/kernels/ CUDA SGEMM implementations
src/utils/ CUDA RAII, verification, benchmark helpers
src/main.cu benchmark CLI
tests/ Google Test coverage against cuBLAS
docs/ learning documentation mirrored on Pages
openspec/ stable specs and change workflow
MIT. See LICENSE.md.