InfiniTensor · kilinchange · Jun 8, 2026 · Jun 8, 2026
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -6,7 +6,7 @@ option(USE_OMP "Use OpenMP as backend for Eigen" ON)
 option(USE_NCCL "Build project for distributed running" ON)
 option(BUILD_TEST "Build InfiniTrain tests" OFF)
 
-project(infini_train VERSION 0.5.0 LANGUAGES CXX)
+project(infini_train VERSION 0.6.0 LANGUAGES CXX)
 
 set(CMAKE_CXX_STANDARD 20)
 set(CMAKE_CXX_STANDARD_REQUIRED ON)

diff --git a/README.md b/README.md
@@ -58,6 +58,7 @@ Build Options:
 | ------------------------- | ------------------------------- | ---------------------------------------------------- | -------------- |
 | Model Support             | GPT-2                           | Decoder-only Transformer language model              | ✔ Supported    |
 |                           | LLaMA 3                         | Modern LLaMA-family Transformer architecture         | ✔ Supported    |
+|                           | Qwen3-8B                        | Qwen3 8B language model                              | 🗓 Planned     |
 |                           | DeepSeek-V3                     | Large-scale MoE-based language model                 | 🗓 Planned     |
 | Precision                 | Multiple Data Type              | FP32, BF16                                           | ✔ Supported    |
 |                           | Mixed Precision                 | Autocast-based BF16 compute with FP32 accumulation   | ✔ Supported    |
@@ -69,15 +70,23 @@ Build Options:
 |                           | Hybrid Parallelism              | Arbitrary combination of DDP + TP + SP + PP          | ✔ Supported    |
 | Core Components           | Multi-backend                   | CPU and CUDA execution backends                      | ✔ Supported    |
 |                           | Multi-node Distributed Training | Distributed execution across multiple nodes          | ✔ Supported    |
+|                           | Transformer Abstraction         | Generic Transformer structure abstraction            | ✔ Supported    |
+|                           | Backend Registries              | Device / CCL / dtype abstraction and registration    | ✔ Supported    |
 |                           | Kernel Dispatcher               | Kernel registration and dynamic dispatch mechanism   | ✔ Supported    |
 |                           | Autograd                        | Automatic differentiation engine                     | ✔ Supported    |
 |                           | Autocast                        | Automatic mixed precision runtime                    | ✔ Supported    |
+|                           | Checkpointing                   | Training checkpoint save and restore                 | 🗓 Planned     |
+| Fine-tuning               | LoRA                            | Memory-efficient fine-tuning with merge / unmerge    | ✔ Supported    |
+| Memory Optimizations      | ZeRO Stage-1                    | Sharded optimizer states for DDP                     | ✔ Supported    |
+|                           | ZeRO Stage-2                    | Sharded gradients across DDP ranks                   | ✔ Supported    |
+|                           | Activation Recomputation        | Recompute activations to reduce memory usage         | 🗓 Planned     |
 | Performance Optimizations | Compute–Comm Overlap            | Explicit scheduling to hide communication latency    | ✔ Supported    |
 |                           | DDP Gradient Bucketing          | Deferred and bucketed gradient synchronization       | ✔ Supported    |
-|                           | ZeRO-DP                         | DistributedOptimizer-based ZeRO-1                    | 🚧 In Progress |
 | Execution Mode            | Training Mode                   | Full forward–backward training with autograd         | ✔ Supported    |
 |                           | `no_grad` Inference             | Forward-only execution without gradient tracking     | ✔ Supported    |
 | Debugging & Tooling       | Built-in Profiler               | Kernel-level performance profiling                   | ✔ Supported    |
+|                           | Precision Alignment Checker     | Function / Module precision checks and E2E loss diff | ✔ Supported    |
+|                           | CTest + GTest Infrastructure    | Automated unit tests with CTest integration          | ✔ Supported    |
 |                           | Automated Benchmarking          | One-click execution, log analysis and Feishu export  | ✔ Supported    |
 
 ## 🏋️ Training
@@ -171,4 +180,32 @@ Multiple parallelism strategies (DDP, TP, SP, PP) can be freely combined to scal
   Added Autocast, multi-dimensional distributed parallelism
    (DDP, TP, SP, PP with GPipe / 1F1B / vPP),
    multi-node training, `no_grad` mode,
-   and communication–computation overlap with bucketed gradient synchronization.
+   and communication–computation overlap with bucketed gradient synchronization.
+
+- **2026/06/08** — InfiniTrain **v0.6.0**
+
+  Added loss alignment tooling for Function / Module level precision checks
+   and end-to-end loss comparison, with a unified hook mechanism.
+
+  Added memory optimizations for DDP training and Autograd execution.
+   ZeRO Stage-1 shards optimizer states across DDP ranks, while ZeRO Stage-2
+   further shards gradients. Autograd Tensor release timing was also optimized
+   to reduce peak memory usage.
+
+  Introduced LoRA fine-tuning with `merge` / `unmerge` support for efficient
+   training and inference-time weight merging.
+
+  Refactored core backend abstractions around device, communication, and
+   low-precision dtype registration. The framework layer now uses
+   `DeviceGuard`, `CclGroupGuard`, and backend-registered FP16 / BF16 native
+   types to avoid hardware-specialized framework code.
+
+  Introduced a generic Transformer structure abstraction backed by
+   `TransformerConfig`, providing a common foundation for GPT-2 and LLaMA 3
+   style model construction.
+
+  Improved BF16 training performance through autocast and elementwise kernel
+   optimizations.
+
+  Integrated a CTest + GTest based testing infrastructure to strengthen the
+   framework's automated test workflow.