Skip to content

support new logic of common state dict#1669

Open
dimapihtar wants to merge 2 commits into
NVIDIA:mainfrom
dimapihtar:fix_common_strategy
Open

support new logic of common state dict#1669
dimapihtar wants to merge 2 commits into
NVIDIA:mainfrom
dimapihtar:fix_common_strategy

Conversation

@dimapihtar

@dimapihtar dimapihtar commented Jun 10, 2026

Copy link
Copy Markdown

What does this PR do?

Adds support for new logic of common state dict in MCore. MCore's PR: NVIDIA/Megatron-LM#5160

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A
  • Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Bug Fixes

  • Enhanced checkpoint restoration to intelligently detect and support both legacy and modern checkpoint formats. This ensures complete backward compatibility with existing checkpoints while enabling seamless transitions to newer distributed storage systems.

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
@dimapihtar dimapihtar requested a review from a team as a code owner June 10, 2026 14:28
@copy-pr-bot

copy-pr-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6649946f-1562-4085-9a02-c67d0ba149d1

📥 Commits

Reviewing files that changed from the base of the PR and between bde162a and c4d963b.

📒 Files selected for processing (1)
  • modelopt/torch/opt/plugins/mcore_dist_checkpointing.py

📝 Walkthrough

Walkthrough

This PR adds backward compatibility to restore_sharded_modelopt_state by detecting and loading legacy standalone common-state checkpoint files. The function now checks for the presence of a legacy checkpoint before attempting to load from the newer distributed checkpoint format, ensuring compatibility with both checkpoint generations.

Changes

Backward-compatible modelopt checkpoint restoration

Layer / File(s) Summary
Legacy checkpoint format detection
modelopt/torch/opt/plugins/mcore_dist_checkpointing.py
restore_sharded_modelopt_state now conditionally loads the common replicated state: it checks for a standalone legacy checkpoint file at modelopt_checkpoint_name/COMMON_STATE_FNAME and loads it via safe_load if present; otherwise, it falls back to dist_checkpointing.load_common_state_dict for the newer distributed checkpoint format.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding support for new common state dict logic in the checkpoint restoration function.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed PR changes only mcore_dist_checkpointing.py, adding legacy checkpoint format detection using safe_load() and dist_checkpointing. No security anti-patterns detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.30%. Comparing base (bde162a) to head (618b566).

Files with missing lines Patch % Lines
...lopt/torch/opt/plugins/mcore_dist_checkpointing.py 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1669      +/-   ##
==========================================
- Coverage   77.31%   77.30%   -0.01%     
==========================================
  Files         509      509              
  Lines       55912    55915       +3     
==========================================
- Hits        43226    43225       -1     
- Misses      12686    12690       +4     
Flag Coverage Δ
unit 54.44% <0.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kevalmorabia97

Copy link
Copy Markdown
Collaborator

@kevalmorabia97 kevalmorabia97 added the cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants