Skip to content

[sync] Save memory using main_param for moe in param_l2_norm#2091

Merged
ananthsub merged 2 commits intoNVIDIA-NeMo:mainfrom
ananthsub:sync-2249
Jan 29, 2026
Merged

[sync] Save memory using main_param for moe in param_l2_norm#2091
ananthsub merged 2 commits intoNVIDIA-NeMo:mainfrom
ananthsub:sync-2249

Conversation

@ananthsub
Copy link
Copy Markdown
Contributor

@ananthsub ananthsub commented Jan 27, 2026

What does this PR do ?

Sync with changes from NVIDIA/Megatron-LM#2249

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

  • Bug Fixes

    • Fixed a documentation typo in training utilities.
  • Tests

    • Added comprehensive test coverage for parameter handling in BF16 training mode, including edge cases with parameter configurations and distributed training scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jan 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ananthsub
Copy link
Copy Markdown
Contributor Author

/ok to test 65342e6

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
@ananthsub
Copy link
Copy Markdown
Contributor Author

/ok to test d37e4af

@ananthsub ananthsub marked this pull request as ready for review January 29, 2026 18:32
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 29, 2026

📝 Walkthrough

Walkthrough

The PR fixes a typo and enhances parameter norm calculation logic in BF16 mode by adding conditional handling for main_param availability and sharding status, with fallback mechanisms to create FP32 copies when necessary. Comprehensive test coverage is added to validate BF16 and MoE parameter handling scenarios.

Changes

Cohort / File(s) Summary
Training utility logic
src/megatron/bridge/training/utils/train_utils.py
Corrects comment typo ("Seperate" to "Separate") and refactors calc_params_l2_norm to introduce conditional logic for bf16 handling: checks for main_param availability (sharded or non-sharded), uses main_param when force_create_fp32_copy is False, otherwise falls back to param.data.float(); applies equivalent logic to both MoE and non-MoE branches for consistent parameter norm calculation.
Test coverage for BF16/MoE scenarios
tests/unit_tests/training/utils/test_train_utils.py
Adds extensive test suite covering calc_params_l2_norm behavior in BF16 mode with MoE, including scenarios with main_param present/absent, sharding status variations, dense/MoE parameter mixing, edge cases with force_create_fp32_copy flag, and comprehensive mocking of distributed training components (data parallel groups, model parallel groups, expert tensor groups).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • maanug-nv
  • yaoyu-33
🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR contains major numerical computation changes to L2 norm calculation in BF16 with MoE, but PR description lacks test results, numerical validation, memory measurements, and has incomplete placeholder text. Add test execution results, numerical regression validation, before-and-after memory measurements, and replace placeholder text with detailed functional changes.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly summarizes the main change: optimizing memory usage by utilizing main_param for mixture-of-experts in parameter L2 norm calculation.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ananthsub ananthsub enabled auto-merge (squash) January 29, 2026 19:31
@ananthsub ananthsub merged commit 3a12356 into NVIDIA-NeMo:main Jan 29, 2026
80 of 85 checks passed
@ananthsub ananthsub deleted the sync-2249 branch January 29, 2026 19:50
conver334 pushed a commit to conver334/Megatron-Bridge that referenced this pull request Jan 30, 2026
…NeMo#2091)

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: conver334 <conver334@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants