[bugfix] update results of state_dict loading, embedding resizing to secondary partitions (hpz)#7130
[bugfix] update results of state_dict loading, embedding resizing to secondary partitions (hpz)#7130cyr0930 wants to merge 6 commits intodeepspeedai:masterfrom
Conversation
…secondary partitions (hpz)
|
@cyr0930, thanks for this PR. Can you provide more context about the failure this fixes? For example, did you encounter convergence issues after checkpoint loading? |
|
Partitioned parameters are updated only when ds_secondary_partition_tensor is None by this commit (#4906). For now, after parameter initialization, ds_secondary_partition_tensors are created and existed for each params, so they are not updated when we perform state_dict loading or embedding resizing. |
@cyr0930, apologies for the delay on this. My understanding is the |
|
I'm not sure because this logic is a bit complicated, but IMO, while HfArgumentParser.parse_args_into_dataclasses is executed deepspeed zero3 is enabled by this (https://github.com/huggingface/transformers/blob/v4.51.2/src/transformers/training_args.py#L2046). And while load model by from_pretrained method, deepspeed.zero.Init context is introduced by this (https://github.com/huggingface/transformers/blob/v4.51.2/src/transformers/modeling_utils.py#L3727). This context wrapped initialization of modules, so parameters are converted to deepspeed parameters during initialization in here (https://github.com/deepspeedai/DeepSpeed/blob/master/deepspeed/runtime/zero/partition_parameters.py#L1107). That's what I get from debugging for now. |
|
@cyr0930, thanks for sharing this information and for debugging. As this is a critical part of hpz, can you please share your repro steps with me so I can try on my side? |
|
This is a minimal reproducing code I can make. deepspeed_init.py |
After this commit (#4906), secondary partitioned tensors are updated only after optimizer.step().
When loading state_dict or resizing embedding after init, secondary partitioned tensors should be updated.
e.g., https://github.com/huggingface/transformers/blob/1c4b62b219323a31011bac3bd3cece7675d9e4c3/src/transformers/integrations/deepspeed.py#L344