fix(superoffload) preserve multi-group updates with shared cpu buffers (#7905)#7906
fix(superoffload) preserve multi-group updates with shared cpu buffers (#7905)#7906
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f35fcc8790
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…ZeRO-3 subgroup buffering
| if len(bucket.params) == 0: | ||
| self._cur_bucket_index = i | ||
| if getattr(param, "ds_grad_is_ready", True): | ||
| if getattr(param, "ds_grad_is_ready", True) and param.grad is not None: |
There was a problem hiding this comment.
While this approach works, I am curious why execution gets this far for params without grads. We have a mechanism for excluding frozen params in Z1/2 and Z3. Could it be that we are incorrectly adding backward hooks to frozen parameters?
Looking further, I see that for Z1/Z2 we exclude frozen parameters from backward hooks. It seems we don't have similar exclusion in Z3/SuperOffload. Can you please double check in this logic?
…cro-steps and sync lr to worker process
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7da1d33ca9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if self.clip_grad: | ||
| self._step_with_clipping(scaled_global_grad_norm, timer_names) | ||
| else: | ||
| self._step_without_clipping(scaled_global_grad_norm, timer_names) |
There was a problem hiding this comment.
Gate the synchronous path on actual clipping
DeepSpeed's default gradient_clipping is 1.0 (deepspeed/runtime/constants.py:251), so branching on self.clip_grad here sends the default SuperOffload configuration into _step_with_clipping() on every step, even when check_clip_grads(...) would be false. Because partition_grads() now only launches async CPUAdam work when not self.clip_grad, this change disables the new backward/CPU overlap for most users and turns the patch into a throughput regression unless they explicitly set clipping to 0.0.
Useful? React with 👍 / 👎.
| optimizer_configs.append({ | ||
| "lr": pg["lr"], | ||
| "betas": pg["betas"], | ||
| "eps": pg["eps"], | ||
| "weight_decay": pg["weight_decay"], | ||
| "amsgrad": pg["amsgrad"], | ||
| }) |
There was a problem hiding this comment.
Preserve
bias_correction when copying optimizer groups
The worker-side DeepSpeedCPUAdam still reads group['bias_correction'] during step_subgroup() (deepspeed/ops/adam/cpu_adam.py:199-202), but the new per-group config copy only forwards lr, betas, eps, weight_decay, and amsgrad. If a SuperOffload run has multiple optimizer groups with different bias_correction settings, every mirrored worker group silently falls back to the constructor default instead of the original group's value, so those subgroups will use the wrong Adam update rule.
Useful? React with 👍 / 👎.
Fix issue #7905
The figures below compare per-iteration time and GPU memory usage against the non-offload. The second figure presents a correctness check of the updated version.
