Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ option(USE_OMP "Use OpenMP as backend for Eigen" ON)
option(USE_NCCL "Build project for distributed running" ON)
option(BUILD_TEST "Build InfiniTrain tests" OFF)

project(infini_train VERSION 0.5.0 LANGUAGES CXX)
project(infini_train VERSION 0.6.0 LANGUAGES CXX)

set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
Expand Down
41 changes: 39 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ Build Options:
| ------------------------- | ------------------------------- | ---------------------------------------------------- | -------------- |
| Model Support | GPT-2 | Decoder-only Transformer language model | ✔ Supported |
| | LLaMA 3 | Modern LLaMA-family Transformer architecture | ✔ Supported |
| | Qwen3-8B | Qwen3 8B language model | 🗓 Planned |
| | DeepSeek-V3 | Large-scale MoE-based language model | 🗓 Planned |
| Precision | Multiple Data Type | FP32, BF16 | ✔ Supported |
| | Mixed Precision | Autocast-based BF16 compute with FP32 accumulation | ✔ Supported |
Expand All @@ -69,15 +70,23 @@ Build Options:
| | Hybrid Parallelism | Arbitrary combination of DDP + TP + SP + PP | ✔ Supported |
| Core Components | Multi-backend | CPU and CUDA execution backends | ✔ Supported |
| | Multi-node Distributed Training | Distributed execution across multiple nodes | ✔ Supported |
| | Transformer Abstraction | Generic Transformer structure abstraction | ✔ Supported |
| | Backend Registries | Device / CCL / dtype abstraction and registration | ✔ Supported |
| | Kernel Dispatcher | Kernel registration and dynamic dispatch mechanism | ✔ Supported |
| | Autograd | Automatic differentiation engine | ✔ Supported |
| | Autocast | Automatic mixed precision runtime | ✔ Supported |
| | Checkpointing | Training checkpoint save and restore | 🗓 Planned |
| Fine-tuning | LoRA | Memory-efficient fine-tuning with merge / unmerge | ✔ Supported |
| Memory Optimizations | ZeRO Stage-1 | Sharded optimizer states for DDP | ✔ Supported |
| | ZeRO Stage-2 | Sharded gradients across DDP ranks | ✔ Supported |
| | Activation Recomputation | Recompute activations to reduce memory usage | 🗓 Planned |
| Performance Optimizations | Compute–Comm Overlap | Explicit scheduling to hide communication latency | ✔ Supported |
| | DDP Gradient Bucketing | Deferred and bucketed gradient synchronization | ✔ Supported |
| | ZeRO-DP | DistributedOptimizer-based ZeRO-1 | 🚧 In Progress |
| Execution Mode | Training Mode | Full forward–backward training with autograd | ✔ Supported |
| | `no_grad` Inference | Forward-only execution without gradient tracking | ✔ Supported |
| Debugging & Tooling | Built-in Profiler | Kernel-level performance profiling | ✔ Supported |
| | Precision Alignment Checker | Function / Module precision checks and E2E loss diff | ✔ Supported |
| | CTest + GTest Infrastructure | Automated unit tests with CTest integration | ✔ Supported |
| | Automated Benchmarking | One-click execution, log analysis and Feishu export | ✔ Supported |

## 🏋️ Training
Expand Down Expand Up @@ -171,4 +180,32 @@ Multiple parallelism strategies (DDP, TP, SP, PP) can be freely combined to scal
Added Autocast, multi-dimensional distributed parallelism
(DDP, TP, SP, PP with GPipe / 1F1B / vPP),
multi-node training, `no_grad` mode,
and communication–computation overlap with bucketed gradient synchronization.
and communication–computation overlap with bucketed gradient synchronization.

- **2026/06/08** — InfiniTrain **v0.6.0**

Added loss alignment tooling for Function / Module level precision checks
and end-to-end loss comparison, with a unified hook mechanism.

Added memory optimizations for DDP training and Autograd execution.
ZeRO Stage-1 shards optimizer states across DDP ranks, while ZeRO Stage-2
further shards gradients. Autograd Tensor release timing was also optimized
to reduce peak memory usage.

Introduced LoRA fine-tuning with `merge` / `unmerge` support for efficient
training and inference-time weight merging.

Refactored core backend abstractions around device, communication, and
low-precision dtype registration. The framework layer now uses
`DeviceGuard`, `CclGroupGuard`, and backend-registered FP16 / BF16 native
types to avoid hardware-specialized framework code.

Introduced a generic Transformer structure abstraction backed by
`TransformerConfig`, providing a common foundation for GPT-2 and LLaMA 3
style model construction.

Improved BF16 training performance through autocast and elementwise kernel
optimizations.

Integrated a CTest + GTest based testing infrastructure to strengthen the
framework's automated test workflow.
Loading