Skip to content

fix(zero): guard ds_grads_remaining#7904

Closed
ailuntz wants to merge 2 commits intodeepspeedai:masterfrom
ailuntz:fix/ds-grads-remaining-guard
Closed

fix(zero): guard ds_grads_remaining#7904
ailuntz wants to merge 2 commits intodeepspeedai:masterfrom
ailuntz:fix/ds-grads-remaining-guard

Conversation

@ailuntz
Copy link
Copy Markdown

@ailuntz ailuntz commented Mar 14, 2026

Summary

  • Guard ds_grads_remaining to avoid unexpected errors in edge cases.

Testing

  • Not run (not requested).

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d997d509c8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@torch.compiler.disable
def _post_backward_module_hook(module, inputs):
module.ds_grads_remaining = 0
_ensure_ds_grads_remaining(module)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reset ds_grads_remaining at each forward-pre hook

In _post_backward_module_hook, replacing the per-forward reset with _ensure_ds_grads_remaining(module) keeps any stale ds_grads_remaining value from previous iterations instead of clearing it. This counter is decremented in PostBackwardFunctionModule.backward, and the nearby comment already notes some backward hooks may not trigger in edge cases; when that happens, a leftover positive count now carries into the next forward, so remaining can stop reaching 0 and post_sub_module_backward_function is skipped indefinitely, which can block parameter release and lead to persistent memory growth.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ailuntz did you see this feedback?

@sfc-gh-truwase
Copy link
Copy Markdown
Collaborator

@ailuntz thanks for the PR. Given that large number of modifications this flag has had in the past, can you please explain the issue that triggered this PR?

@sfc-gh-truwase
Copy link
Copy Markdown
Collaborator

@ailuntz is this PR still needed? Thanks!

Signed-off-by: ailuntz <ailuntz@ailuntzdeMac-mini.local>
@ailuntz ailuntz force-pushed the fix/ds-grads-remaining-guard branch from d997d50 to 7380eb9 Compare April 1, 2026 08:06
@ailuntz
Copy link
Copy Markdown
Author

ailuntz commented Apr 1, 2026

The PR is still needed on current master. I checked the latest upstream parameter_offload.py on 2026-04-01, and the ds_grads_remaining accesses are still unguarded there.

The trigger for this fix is an edge case where a module reaches the backward hook path before ds_grads_remaining has ever been initialized on that module. In that case reads/writes like module.ds_grads_remaining += 1, module.ds_grads_remaining - 1, or if sub_module.ds_grads_remaining == 0 raise AttributeError and break the ZeRO-3 hook flow.

This patch keeps the behavior the same for the normal path, but lazily initializes the counter to 0 before the first read/write and preserves any existing non-zero value. I also added a focused unit test for the helper behavior.

I rebased the branch onto the latest upstream master and force-pushed a signed-off commit so the DCO check should be addressed now.

@sfc-gh-truwase
Copy link
Copy Markdown
Collaborator

@ailuntz thanks for explaining the possible need for this fix. However, it would be better to an actual repro of that scenario as you described earlier:
The trigger for this fix is an edge case where a module reaches the backward hook path before ds_grads_remaining has ever been initialized on that module.

I can't think of how the above could happen. The provided unit test does not show this either. An actual repro would be useful to exclude some other true root cause, or a better solution. For example, I am curious if #7929 which ensures that full initialization of ZeRO states before usage could be extended for this case.


def _run_after_backward_hook(*unused):
module.ds_grads_remaining = module.ds_grads_remaining - 1
remaining = _ensure_ds_grads_remaining(module) - 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that remaining could become negative here, right?

Signed-off-by: ailuntz <ailuntz@ailuntzdeMac-mini.local>
@ailuntz
Copy link
Copy Markdown
Author

ailuntz commented Apr 1, 2026

You're right that the previous version still did not provide a concrete standalone repro, and the broader read/write guards were more invasive than necessary.

I narrowed the patch in the latest push:

  • keep the original read/write logic for ds_grads_remaining
  • initialize ds_grads_remaining once during module registration, before the hook paths can use it

That makes the state fully initialized before usage, which seems closer to the direction you pointed out around ensuring ZeRO state initialization before use, without masking other potential root causes by guarding every access.

If you still want a concrete runtime repro before considering even this smaller hardening, I can keep digging there too. I mainly wanted to reduce the patch to the smallest defensible version first.

@ailuntz
Copy link
Copy Markdown
Author

ailuntz commented Apr 1, 2026

Closing this for now because I do not have a stable reproduction or a confirmed root cause yet. I will reopen or send a fresh PR once I can provide a concrete repro.

@ailuntz ailuntz closed this Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants