[sync] Save memory using main_param for moe in param_l2_norm by ananthsub · Pull Request #2091 · NVIDIA-NeMo/Megatron-Bridge

ananthsub · 2026-01-27T22:52:39Z

What does this PR do ?

Sync with changes from NVIDIA/Megatron-LM#2249

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

Bug Fixes
- Fixed a documentation typo in training utilities.
Tests
- Added comprehensive test coverage for parameter handling in BF16 training mode, including edge cases with parameter configurations and distributed training scenarios.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

copy-pr-bot · 2026-01-27T22:52:43Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ananthsub · 2026-01-27T22:52:51Z

/ok to test 65342e6

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

ananthsub · 2026-01-29T10:44:44Z

/ok to test d37e4af

coderabbitai · 2026-01-29T18:36:19Z

📝 Walkthrough

Walkthrough

The PR fixes a typo and enhances parameter norm calculation logic in BF16 mode by adding conditional handling for main_param availability and sharding status, with fallback mechanisms to create FP32 copies when necessary. Comprehensive test coverage is added to validate BF16 and MoE parameter handling scenarios.

Changes

Cohort / File(s)	Summary
Training utility logic `src/megatron/bridge/training/utils/train_utils.py`	Corrects comment typo ("Seperate" to "Separate") and refactors calc_params_l2_norm to introduce conditional logic for bf16 handling: checks for main_param availability (sharded or non-sharded), uses main_param when force_create_fp32_copy is False, otherwise falls back to param.data.float(); applies equivalent logic to both MoE and non-MoE branches for consistent parameter norm calculation.
Test coverage for BF16/MoE scenarios `tests/unit_tests/training/utils/test_train_utils.py`	Adds extensive test suite covering calc_params_l2_norm behavior in BF16 mode with MoE, including scenarios with main_param present/absent, sharding status variations, dense/MoE parameter mixing, edge cases with force_create_fp32_copy flag, and comprehensive mocking of distributed training components (data parallel groups, model parallel groups, expert tensor groups).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

maanug-nv
yaoyu-33

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR contains major numerical computation changes to L2 norm calculation in BF16 with MoE, but PR description lacks test results, numerical validation, memory measurements, and has incomplete placeholder text.	Add test execution results, numerical regression validation, before-and-after memory measurements, and replace placeholder text with detailed functional changes.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly summarizes the main change: optimizing memory usage by utilizing main_param for mixture-of-experts in parameter L2 norm calculation.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…NeMo#2091) Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: conver334 <conver334@gmail.com>

ananthsub requested review from maanug-nv and yaoyu-33 January 27, 2026 22:52

copy-pr-bot Bot temporarily deployed to nemo-ci January 27, 2026 22:53 Inactive

copy-pr-bot Bot temporarily deployed to test January 27, 2026 22:53 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci January 27, 2026 23:09 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci January 27, 2026 23:15 Failure

copy-pr-bot Bot had a problem deploying to nemo-ci January 28, 2026 18:29 Failure

[sync] Save memory using main_param for moe in param_l2_norm

4702505

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

ananthsub force-pushed the sync-2249 branch from 65342e6 to 4702505 Compare January 29, 2026 10:35

add unit tests

d37e4af

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

copy-pr-bot Bot temporarily deployed to nemo-ci January 29, 2026 10:45 Inactive

copy-pr-bot Bot temporarily deployed to test January 29, 2026 10:45 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci January 29, 2026 11:27 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci January 29, 2026 11:34 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci January 29, 2026 11:44 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci January 29, 2026 11:44 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci January 29, 2026 11:44 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci January 29, 2026 11:44 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci January 29, 2026 11:44 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci January 29, 2026 11:44 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci January 29, 2026 11:44 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci January 29, 2026 18:27 Inactive

ananthsub marked this pull request as ready for review January 29, 2026 18:32

ananthsub enabled auto-merge (squash) January 29, 2026 19:31

maanug-nv approved these changes Jan 29, 2026

View reviewed changes

ananthsub merged commit 3a12356 into NVIDIA-NeMo:main Jan 29, 2026
80 of 85 checks passed

ananthsub deleted the sync-2249 branch January 29, 2026 19:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sync] Save memory using main_param for moe in param_l2_norm#2091

[sync] Save memory using main_param for moe in param_l2_norm#2091
ananthsub merged 2 commits intoNVIDIA-NeMo:mainfrom
ananthsub:sync-2249

ananthsub commented Jan 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Jan 27, 2026

Uh oh!

ananthsub commented Jan 27, 2026

Uh oh!

ananthsub commented Jan 29, 2026

Uh oh!

coderabbitai Bot commented Jan 29, 2026

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ananthsub commented Jan 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Jan 27, 2026

Uh oh!

ananthsub commented Jan 27, 2026

Uh oh!

ananthsub commented Jan 29, 2026

Uh oh!

coderabbitai Bot commented Jan 29, 2026

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ananthsub commented Jan 27, 2026 •

edited by coderabbitai Bot

Loading