Add async checkpoint feature#1703
Conversation
7a7136b to
302b6ec
Compare
| from xtuner.v1.utils.grad_norm import cal_grad_norm | ||
|
|
||
|
|
||
| if BlockingAsyncStager is not None: |
There was a problem hiding this comment.
In [2]: fw = FileSystemWriter("./")
In [3]: from torch.distributed.checkpoint.staging import AsyncStager, BlockingAsyncStager
In [4]: isinstance(fw, AsyncStager)
Out[4]: True
is _CachingStagingWriter necessary?
695d2b3 to
b6701ef
Compare
| if cached_has_optim != save_optimizer: | ||
| self._async_state_dict_cache = None | ||
| storage_writer = FileSystemWriter(weights_dir, cache_staged_state_dict=True) | ||
| storage_writer.state_dict_cache = self._async_state_dict_cache |
There was a problem hiding this comment.
Is this injection necessary?
There was a problem hiding this comment.
cache_staged_state_dict keeps pinned staging buffers on the FileSystemWriter instance. XTuner creates one writer per checkpoint path, so carry the cache across writers to preserve steady-state async_save launch performance.
There was a problem hiding this comment.
Could we hold the storage writer all the time if async_save is enabled?
b6701ef to
b8c953d
Compare
1eb91e5 to
962cc16
Compare
1231fce to
a9b2a2e
Compare
| @@ -397,8 +337,10 @@ class TrainerConfig(BaseModel): | |||
| strict_load: bool = True | |||
| checkpoint_interval: int | None = -1 | |||
| checkpoint_maxkeep: int | None = -1 | |||
| checkpoint_save_optimizer: bool = True | |||
There was a problem hiding this comment.
@hhaAndroid Please evaluate whether we actually have a need to not store the optimizer
|
|
||
| threading.Thread(target=commit, daemon=True).start() | ||
|
|
||
| dcp_future.add_done_callback(commit_async_save) |
There was a problem hiding this comment.
Where is the purpose of using a committed future + dcp future here? I understand that committed simply involves recording the time consumed, printing some logs, and finally moving the path, which are all lightweight operations. What is the purpose of such an implementation?
9d875e4 to
93eacaf
Compare
b7fe4fd to
fe5ea66
Compare
fe5ea66 to
98e9cf1
Compare
Uh oh!
There was an error while loading. Please reload this page.