diff --git a/CMakeLists.txt b/CMakeLists.txt index 4c6da822..e25de71d 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -6,7 +6,7 @@ option(USE_OMP "Use OpenMP as backend for Eigen" ON) option(USE_NCCL "Build project for distributed running" ON) option(BUILD_TEST "Build InfiniTrain tests" OFF) -project(infini_train VERSION 0.5.0 LANGUAGES CXX) +project(infini_train VERSION 0.6.0 LANGUAGES CXX) set(CMAKE_CXX_STANDARD 20) set(CMAKE_CXX_STANDARD_REQUIRED ON) diff --git a/README.md b/README.md index 3dc739ba..1a51d1a8 100644 --- a/README.md +++ b/README.md @@ -58,6 +58,7 @@ Build Options: | ------------------------- | ------------------------------- | ---------------------------------------------------- | -------------- | | Model Support | GPT-2 | Decoder-only Transformer language model | ✔ Supported | | | LLaMA 3 | Modern LLaMA-family Transformer architecture | ✔ Supported | +| | Qwen3-8B | Qwen3 8B language model | 🗓 Planned | | | DeepSeek-V3 | Large-scale MoE-based language model | 🗓 Planned | | Precision | Multiple Data Type | FP32, BF16 | ✔ Supported | | | Mixed Precision | Autocast-based BF16 compute with FP32 accumulation | ✔ Supported | @@ -69,15 +70,23 @@ Build Options: | | Hybrid Parallelism | Arbitrary combination of DDP + TP + SP + PP | ✔ Supported | | Core Components | Multi-backend | CPU and CUDA execution backends | ✔ Supported | | | Multi-node Distributed Training | Distributed execution across multiple nodes | ✔ Supported | +| | Transformer Abstraction | Generic Transformer structure abstraction | ✔ Supported | +| | Backend Registries | Device / CCL / dtype abstraction and registration | ✔ Supported | | | Kernel Dispatcher | Kernel registration and dynamic dispatch mechanism | ✔ Supported | | | Autograd | Automatic differentiation engine | ✔ Supported | | | Autocast | Automatic mixed precision runtime | ✔ Supported | +| | Checkpointing | Training checkpoint save and restore | 🗓 Planned | +| Fine-tuning | LoRA | Memory-efficient fine-tuning with merge / unmerge | ✔ Supported | +| Memory Optimizations | ZeRO Stage-1 | Sharded optimizer states for DDP | ✔ Supported | +| | ZeRO Stage-2 | Sharded gradients across DDP ranks | ✔ Supported | +| | Activation Recomputation | Recompute activations to reduce memory usage | 🗓 Planned | | Performance Optimizations | Compute–Comm Overlap | Explicit scheduling to hide communication latency | ✔ Supported | | | DDP Gradient Bucketing | Deferred and bucketed gradient synchronization | ✔ Supported | -| | ZeRO-DP | DistributedOptimizer-based ZeRO-1 | 🚧 In Progress | | Execution Mode | Training Mode | Full forward–backward training with autograd | ✔ Supported | | | `no_grad` Inference | Forward-only execution without gradient tracking | ✔ Supported | | Debugging & Tooling | Built-in Profiler | Kernel-level performance profiling | ✔ Supported | +| | Precision Alignment Checker | Function / Module precision checks and E2E loss diff | ✔ Supported | +| | CTest + GTest Infrastructure | Automated unit tests with CTest integration | ✔ Supported | | | Automated Benchmarking | One-click execution, log analysis and Feishu export | ✔ Supported | ## 🏋️ Training @@ -171,4 +180,32 @@ Multiple parallelism strategies (DDP, TP, SP, PP) can be freely combined to scal Added Autocast, multi-dimensional distributed parallelism (DDP, TP, SP, PP with GPipe / 1F1B / vPP), multi-node training, `no_grad` mode, - and communication–computation overlap with bucketed gradient synchronization. \ No newline at end of file + and communication–computation overlap with bucketed gradient synchronization. + +- **2026/06/08** — InfiniTrain **v0.6.0** + + Added loss alignment tooling for Function / Module level precision checks + and end-to-end loss comparison, with a unified hook mechanism. + + Added memory optimizations for DDP training and Autograd execution. + ZeRO Stage-1 shards optimizer states across DDP ranks, while ZeRO Stage-2 + further shards gradients. Autograd Tensor release timing was also optimized + to reduce peak memory usage. + + Introduced LoRA fine-tuning with `merge` / `unmerge` support for efficient + training and inference-time weight merging. + + Refactored core backend abstractions around device, communication, and + low-precision dtype registration. The framework layer now uses + `DeviceGuard`, `CclGroupGuard`, and backend-registered FP16 / BF16 native + types to avoid hardware-specialized framework code. + + Introduced a generic Transformer structure abstraction backed by + `TransformerConfig`, providing a common foundation for GPT-2 and LLaMA 3 + style model construction. + + Improved BF16 training performance through autocast and elementwise kernel + optimizations. + + Integrated a CTest + GTest based testing infrastructure to strengthen the + framework's automated test workflow. \ No newline at end of file