Skip to content

Add async checkpoint feature#1703

Merged
HAOCHENYE merged 2 commits into
InternLM:mainfrom
VincentCheungKokomo:feature/async-checkpoint
May 21, 2026
Merged

Add async checkpoint feature#1703
HAOCHENYE merged 2 commits into
InternLM:mainfrom
VincentCheungKokomo:feature/async-checkpoint

Conversation

@VincentCheungKokomo

@VincentCheungKokomo VincentCheungKokomo commented Apr 22, 2026

Copy link
Copy Markdown
Contributor
  • feat(checkpoint): add async DCP checkpoint saving
  1. Use PyTorch DCP async_save to offload checkpoint file writes to a background process, allowing training to continue while checkpoint I/O is in progress.
  2. Create a dedicated gloo process group for async checkpoint collectives to avoid interfering with the training communication group.
  3. Write async checkpoints to an .incomplete directory first, then rename after all ranks finish, so partially written checkpoints are never exposed as valid checkpoints.
  4. Keep at most one async checkpoint in flight and wait for pending saves before the next checkpoint or trainer shutdown.
  5. Reuse a shared-memory staging cache in XtunerCacheWriter to reduce CPU buffer allocation and process handoff overhead between checkpoints.
  6. Add optional file write lock slots to throttle concurrent DCP file writes and reduce storage contention during async checkpointing.
  7. Store model and optimizer DCP state under a single weights directory and update trainer, RL worker, resume logic, and tests for the merged layout.

@VincentCheungKokomo VincentCheungKokomo force-pushed the feature/async-checkpoint branch 2 times, most recently from 7a7136b to 302b6ec Compare April 23, 2026 03:47
Comment thread xtuner/v1/engine/train_engine.py Outdated
from xtuner.v1.utils.grad_norm import cal_grad_norm


if BlockingAsyncStager is not None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [2]: fw = FileSystemWriter("./")

In [3]: from torch.distributed.checkpoint.staging import AsyncStager, BlockingAsyncStager

In [4]: isinstance(fw, AsyncStager)
Out[4]: True

is _CachingStagingWriter necessary?

Comment thread xtuner/v1/engine/train_engine.py Outdated
Comment thread xtuner/v1/train/trainer.py Outdated
Comment thread xtuner/v1/train/trainer.py Outdated
@VincentCheungKokomo VincentCheungKokomo force-pushed the feature/async-checkpoint branch 6 times, most recently from 695d2b3 to b6701ef Compare April 30, 2026 08:08
Comment thread xtuner/v1/engine/train_engine.py Outdated
Comment thread xtuner/v1/engine/train_engine.py Outdated
Comment thread xtuner/v1/engine/train_engine.py Outdated
Comment thread xtuner/v1/engine/train_engine.py Outdated
if cached_has_optim != save_optimizer:
self._async_state_dict_cache = None
storage_writer = FileSystemWriter(weights_dir, cache_staged_state_dict=True)
storage_writer.state_dict_cache = self._async_state_dict_cache

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this injection necessary?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache_staged_state_dict keeps pinned staging buffers on the FileSystemWriter instance. XTuner creates one writer per checkpoint path, so carry the cache across writers to preserve steady-state async_save launch performance.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we hold the storage writer all the time if async_save is enabled?

Comment thread xtuner/v1/engine/train_engine.py
Comment thread xtuner/v1/train/trainer.py Outdated
@VincentCheungKokomo VincentCheungKokomo force-pushed the feature/async-checkpoint branch from b6701ef to b8c953d Compare May 7, 2026 09:39
Comment thread xtuner/v1/train/trainer.py Outdated
Comment thread xtuner/v1/train/trainer.py Outdated
@VincentCheungKokomo VincentCheungKokomo force-pushed the feature/async-checkpoint branch 2 times, most recently from 1eb91e5 to 962cc16 Compare May 8, 2026 03:19
@VincentCheungKokomo VincentCheungKokomo force-pushed the feature/async-checkpoint branch from 1231fce to a9b2a2e Compare May 19, 2026 13:31
Comment thread autotest/config/qwen3_moe_30BA3_ep8.py Outdated
Comment thread xtuner/v1/train/trainer.py Outdated
@@ -397,8 +337,10 @@ class TrainerConfig(BaseModel):
strict_load: bool = True
checkpoint_interval: int | None = -1
checkpoint_maxkeep: int | None = -1
checkpoint_save_optimizer: bool = True

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hhaAndroid Please evaluate whether we actually have a need to not store the optimizer

Comment thread xtuner/v1/train/trainer.py Outdated
Comment thread xtuner/v1/train/trainer.py Outdated
Comment thread run_train.sh Outdated
Comment thread xtuner/v1/engine/train_engine.py Outdated
Comment thread xtuner/v1/engine/train_engine.py Outdated
Comment thread xtuner/v1/engine/train_engine.py Outdated

threading.Thread(target=commit, daemon=True).start()

dcp_future.add_done_callback(commit_async_save)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the purpose of using a committed future + dcp future here? I understand that committed simply involves recording the time consumed, printing some logs, and finally moving the path, which are all lightweight operations. What is the purpose of such an implementation?

Comment thread xtuner/v1/patch/xtuner_storage.py
Comment thread xtuner/v1/patch/xtuner_storage.py
@VincentCheungKokomo VincentCheungKokomo force-pushed the feature/async-checkpoint branch 6 times, most recently from 9d875e4 to 93eacaf Compare May 20, 2026 14:36
@VincentCheungKokomo VincentCheungKokomo force-pushed the feature/async-checkpoint branch 2 times, most recently from b7fe4fd to fe5ea66 Compare May 21, 2026 06:16
@VincentCheungKokomo VincentCheungKokomo force-pushed the feature/async-checkpoint branch from fe5ea66 to 98e9cf1 Compare May 21, 2026 07:21
@HAOCHENYE HAOCHENYE merged commit ed797dd into InternLM:main May 21, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants