Add async checkpoint feature by VincentCheungKokomo · Pull Request #1703 · InternLM/xtuner

VincentCheungKokomo · 2026-04-22T16:35:37Z

feat(checkpoint): add async DCP checkpoint saving

Use PyTorch DCP async_save to offload checkpoint file writes to a background process, allowing training to continue while checkpoint I/O is in progress.
Create a dedicated gloo process group for async checkpoint collectives to avoid interfering with the training communication group.
Write async checkpoints to an .incomplete directory first, then rename after all ranks finish, so partially written checkpoints are never exposed as valid checkpoints.
Keep at most one async checkpoint in flight and wait for pending saves before the next checkpoint or trainer shutdown.
Reuse a shared-memory staging cache in XtunerCacheWriter to reduce CPU buffer allocation and process handoff overhead between checkpoints.
Add optional file write lock slots to throttle concurrent DCP file writes and reduce storage contention during async checkpointing.
Store model and optimizer DCP state under a single weights directory and update trainer, RL worker, resume logic, and tests for the merged layout.

HAOCHENYE · 2026-04-27T17:43:41Z

 from xtuner.v1.utils.grad_norm import cal_grad_norm


+if BlockingAsyncStager is not None:


In [2]: fw = FileSystemWriter("./") In [3]: from torch.distributed.checkpoint.staging import AsyncStager, BlockingAsyncStager In [4]: isinstance(fw, AsyncStager) Out[4]: True

is _CachingStagingWriter necessary?

HAOCHENYE · 2026-05-07T06:18:13Z

+            if cached_has_optim != save_optimizer:
+                self._async_state_dict_cache = None
+        storage_writer = FileSystemWriter(weights_dir, cache_staged_state_dict=True)
+        storage_writer.state_dict_cache = self._async_state_dict_cache


Is this injection necessary?

cache_staged_state_dict keeps pinned staging buffers on the FileSystemWriter instance. XTuner creates one writer per checkpoint path, so carry the cache across writers to preserve steady-state async_save launch performance.

Could we hold the storage writer all the time if async_save is enabled?

HAOCHENYE · 2026-05-19T15:45:28Z

@@ -397,8 +337,10 @@ class TrainerConfig(BaseModel):
    strict_load: bool = True
    checkpoint_interval: int | None = -1
    checkpoint_maxkeep: int | None = -1
+    checkpoint_save_optimizer: bool = True


@hhaAndroid Please evaluate whether we actually have a need to not store the optimizer

HAOCHENYE · 2026-05-19T16:35:06Z

+
+            threading.Thread(target=commit, daemon=True).start()
+
+        dcp_future.add_done_callback(commit_async_save)


Where is the purpose of using a committed future + dcp future here? I understand that committed simply involves recording the time consumed, printing some logs, and finally moving the path, which are all lightweight operations. What is the purpose of such an implementation?

VincentCheungKokomo force-pushed the feature/async-checkpoint branch 2 times, most recently from 7a7136b to 302b6ec Compare April 23, 2026 03:47

HAOCHENYE reviewed Apr 27, 2026

View reviewed changes

VincentCheungKokomo force-pushed the feature/async-checkpoint branch 6 times, most recently from 695d2b3 to b6701ef Compare April 30, 2026 08:08

HAOCHENYE reviewed May 7, 2026

View reviewed changes

VincentCheungKokomo force-pushed the feature/async-checkpoint branch from b6701ef to b8c953d Compare May 7, 2026 09:39

HAOCHENYE reviewed May 7, 2026

View reviewed changes

Comment thread xtuner/v1/train/trainer.py Outdated

Comment thread xtuner/v1/train/trainer.py Outdated

VincentCheungKokomo force-pushed the feature/async-checkpoint branch 2 times, most recently from 1eb91e5 to 962cc16 Compare May 8, 2026 03:19

VincentCheungKokomo force-pushed the feature/async-checkpoint branch from 1231fce to a9b2a2e Compare May 19, 2026 13:31

HAOCHENYE reviewed May 19, 2026

View reviewed changes

VincentCheungKokomo force-pushed the feature/async-checkpoint branch 6 times, most recently from 9d875e4 to 93eacaf Compare May 20, 2026 14:36

Add async checkpoint feature

667b5c8

VincentCheungKokomo force-pushed the feature/async-checkpoint branch 2 times, most recently from b7fe4fd to fe5ea66 Compare May 21, 2026 06:16

Optimize async DCP process checkpointing with shm cache and file locks

98e9cf1

VincentCheungKokomo force-pushed the feature/async-checkpoint branch from fe5ea66 to 98e9cf1 Compare May 21, 2026 07:21

HAOCHENYE approved these changes May 21, 2026

View reviewed changes

HAOCHENYE merged commit ed797dd into InternLM:main May 21, 2026
4 of 5 checks passed

		from xtuner.v1.utils.grad_norm import cal_grad_norm


		if BlockingAsyncStager is not None:


		threading.Thread(target=commit, daemon=True).start()

		dcp_future.add_done_callback(commit_async_save)

Conversation

VincentCheungKokomo commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HAOCHENYE Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HAOCHENYE May 7, 2026

Choose a reason for hiding this comment

Uh oh!

VincentCheungKokomo May 7, 2026

Choose a reason for hiding this comment

Uh oh!

HAOCHENYE May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HAOCHENYE May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HAOCHENYE May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

VincentCheungKokomo commented Apr 22, 2026 •

edited

Loading