Feat: add checkpoint loading mechanism by JYMiracle305 · Pull Request #146 · InfiniTensor/InfiniTrain

JYMiracle305 · 2026-04-21T05:56:22Z

Checkpoint 读取工具主要参数：

--checkpoint_dir 训练过程中的保存目录
--save_steps 每 N 次保存一次，设置为 0 则不保存
--max_checkpoint_keep 最多保留 K 个 checkpoint
--save_optimizer_state 是否保存优化器的状态
--resume_from 从指定 checkpoint 目录恢复训练

Checkpoint 文件可以通过从 /data/shared/....../llmc/gpt2 (or llama3) 的原始模型参数训练而来，例子可见仓库中的 REPORT.md（Experiment 实际上也测试了llama3，但是命令只记录了 GPT2 训练），model.bin, optimizer.bin, trainer_state.json 都可以从训练中获取．因此不在附件中提供

Experiment

CUDA_VISIBLE_DEVICES=5,6,7 ./gpt2 --input_bin ../../data/llmc/gpt2/tinyshakespeare/tiny_shakespeare_train.bin --llmc_filepath ../../data/llmc/gpt2/gpt2_124M.bin --checkpoint_dir ../ckpt2/gpt2-noresume/ --num_iteration 100 --save_steps 20 --save_optimizer_state true --max_checkpoint_keep 10

CUDA_VISIBLE_DEVICES=5,6,7 ./gpt2 --input_bin ../../data/llmc/gpt2/tinyshakespeare/tiny_shakespeare_train.bin --llmc_filepath ../../data/llmc/gpt2/gpt2_124M.bin --checkpoint_dir ../ckpt2/gpt2-resumefrom40/ --num_iteration 100 --save_steps 20 --save_optimizer_state true --max_checkpoint_keep 10 --resume_from ../ckpt2/gpt2-noresume/checkpoint_step_000040/ > ../ckpt2/gpt2-resumefrom40/gpt2-resume.log 2>&1

（以上两条训练命令同样用 llama3 也运行了）

运行 compare_loss.py，对于 llama3 模型，由于从 step 40 恢复训练，所以 step 1~40 数据缺失，而其余 60 步的 loss 在 FP32, BF16 下均吻合

  Summary: 60/100 steps matched

==================================================
Overall Summary:
  fp32:    0/1 test cases passed (threshold: 1e-05)
  bfloat16: 0/0 test cases passed (threshold: 1e-02)
  Total:   0/1 test cases passed
==================================================

==================================================
Overall Summary:
  fp32:    0/0 test cases passed (threshold: 1e-05)
  bfloat16: 0/1 test cases passed (threshold: 1e-02)
  Total:   0/1 test cases passed
==================================================

对于 GPT2，模型保存的逻辑有误：训练中 lm_head 与 wte 并非真共享，而 LLMC 存取又按“共享”假设处理，resume 后 lm_head 很容易和 no resume 不一致。解决方法是把训练用 checkpoint 从 LLMC 回调路径切到原生 StateDict 二进制路径，并在加载后显式重建权重绑定语义 (example/gpt2/main.cc)．经过修复后，也可以通过．

format: use clang-format-16 instead

remove redundent arguments

format files

chen2021673 · 2026-05-08T01:55:03Z

+DEFINE_string(checkpoint_format, "pth",
+              "checkpoint format: bin|pth. "
+              "'bin' generates model.bin/optimizer.bin (bin supports LLMC model format via callbacks); "
+              "'pth' generates model.pth/optimizer.pth (native StateDict binary).");


这里 pth 和 bin是不是存的是同一种 native StateDict binary 格式啊，能不能去掉 pth

chen2021673 · 2026-05-08T02:09:20Z

+std::unordered_map<std::string, std::shared_ptr<Tensor>> Adam::StateDict() const {
+    std::unordered_map<std::string, std::shared_ptr<Tensor>> state;
+    for (size_t i = 0; i < m_.size(); ++i) {
+        state.emplace(std::format("adam.m.{}", i), m_[i]);


这里m_来自Module::Parameters()，是个vector，不是强保序的，可能有点风险

chen2021673 · 2026-05-08T02:14:31Z

    return state;
 }

+void Module::LoadStateDict(const std::unordered_map<std::string, std::shared_ptr<Tensor>> &state_dict) {


这里需不需要检查一下 checkpoint 中是否有多余 key

chen2021673 · 2026-05-08T02:27:37Z

@@ -0,0 +1,1060 @@
+#include "example/common/checkpoint_loader.h"


这个文件太重了，有一千多行，checkpoint相关的基建和 llama / gpt 的 save / load 都混在一起了。要不要拆分一个example/common/checkpoint_utils.h/.cc，然后保留 gpt2 和 llama3 各自的特化调用？这个可以再讨论一下

chen2021673 · 2026-05-08T06:14:13Z

+
+ResumeFromCheckpointResult ResumeFromCheckpoint(const ResumeFromCheckpointArgs &args) {
+    ResumeFromCheckpointResult result;
+    int ddp_world_size = nn::parallel::global::GetDataParallelSize();


这里还需要检查 TP / PP / SP size 吗？

chen2021673 · 2026-05-08T06:20:32Z

+#include "gflags/gflags.h"
+#include "gflags/gflags_declare.h"
+#include "glog/logging.h"
+#include "infini_train/include/nn/parallel/global.h"


这几个 include 还需要吗，是不是可以删掉

chen2021673 · 2026-05-08T06:46:21Z

    return DataLoaderIterator(*dataset_, batch_size_, max_batch_idx_, max_batch_idx_);
 }

+DataLoaderIterator DataLoader::IteratorAtBatchIndex(size_t batch_idx) const {


检查一下batch_idx % ddp_world_size == ddp_rank

chen2021673 · 2026-05-08T06:51:17Z

+    int64_t global_step = 0;
+    int64_t data_batch_idx = 0;
+    int64_t data_batch_stride = 1;
+    float best_loss = 0.0f;


best_loss 不应该是0.0，应该是infinity

ArcaLunar and others added 5 commits April 21, 2026 11:25

feat: checkpoint save & load

69a0729

format: format files in examples and infini_train

ade0893

format: use clang-format-16 instead

feat: extract resuming to utils

39e89bf

remove redundent arguments

feat: extract similar logic in ckpt_save

b363779

format files

feat(checkpoint): reorganize checkpoint code and improve robustness

0a3deb2

JYMiracle305 force-pushed the feature/add_checkpoint branch from e8c5dd5 to 0a3deb2 Compare April 24, 2026 09:22

JYMiracle305 requested a review from chen2021673 April 29, 2026 01:16

JYMiracle305 changed the title ~~[WIP] Feat: add checkpoint loading mechanism~~ Feat: add checkpoint loading mechanism Apr 29, 2026

chen2021673 reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: add checkpoint loading mechanism#146

Feat: add checkpoint loading mechanism#146
JYMiracle305 wants to merge 5 commits intomasterfrom
feature/add_checkpoint

JYMiracle305 commented Apr 21, 2026 •

edited

Loading

Uh oh!

chen2021673 May 8, 2026

Uh oh!

chen2021673 May 8, 2026

Uh oh!

chen2021673 May 8, 2026

Uh oh!

chen2021673 May 8, 2026 •

edited

Loading

Uh oh!

chen2021673 May 8, 2026 •

edited

Loading

Uh oh!

chen2021673 May 8, 2026

Uh oh!

chen2021673 May 8, 2026

Uh oh!

chen2021673 May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,1060 @@
		#include "example/common/checkpoint_loader.h"

Conversation

JYMiracle305 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Experiment

Uh oh!

chen2021673 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

chen2021673 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

chen2021673 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

chen2021673 May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chen2021673 May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chen2021673 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

chen2021673 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

chen2021673 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JYMiracle305 commented Apr 21, 2026 •

edited

Loading

chen2021673 May 8, 2026 •

edited

Loading

chen2021673 May 8, 2026 •

edited

Loading