[draft] bug for MoE distributed parallelism #752

realAsma · 2026-01-08T21:35:04Z

What does this PR do?

Type of change: ?

Overview: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

copy-pr-bot · 2026-01-08T21:35:12Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

codecov · 2026-01-08T21:46:19Z

Codecov Report

❌ Patch coverage is 52.17391% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.63%. Comparing base (68d604d) to head (148c82c).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
.../torch/quantization/nn/modules/tensor_quantizer.py	16.66%	10 Missing ⚠️
modelopt/torch/quantization/model_calib.py	90.90%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #752      +/-   ##
==========================================
- Coverage   74.65%   74.63%   -0.03%     
==========================================
  Files         192      192              
  Lines       18969    18984      +15     
==========================================
+ Hits        14162    14169       +7     
- Misses       4807     4815       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

realAsma · 2026-01-08T22:36:28Z

modelopt/torch/quantization/model_calib.py

+                sync_quantizer_amax_across_dp_ep(
+                    child, module.parallel_state, get_module_device(module)
+                )


could you please test if all MoE quantizers have amax after this line (locally)?

if `experts` in name and "weight_quantizer` in name: assert child.amax is not None

jenchen13 · 2026-01-09T16:13:54Z

modelopt/torch/quantization/nn/modules/tensor_quantizer.py

+            if synced_amax is not None:
+                # Move to target device
+                if target_device is not None:
+                    synced_amax = synced_amax.to(target_device)


need to add
synced_amax = synced_amax.clone().detach()
otherwise the sharding metadata of global_offset=(0, 0) on all ranks will be kept during save checkpoint

Good catch, I am hoping you could take over the PR and address this

added below

jenchen13 · 2026-01-09T16:24:29Z

modelopt/torch/quantization/nn/modules/tensor_quantizer.py

+            # Iterative max handles both scalar and tensor amax values
+            result = valid_amaxs[0]
+            for amax in valid_amaxs[1:]:
+                result = torch.maximum(result, amax)


what happens if this line is comparing a scalar vs a tensor? how does it determine the max?

see https://docs.pytorch.org/docs/stable/generated/torch.maximum.html

it simply performs element wise maximum -> the shape does not matter as long as both are pytorch tensors (including scalar tensors)

realAsma · 2026-01-09T17:15:36Z

modelopt/torch/quantization/nn/modules/tensor_quantizer.py

-                    "supported by the current distributed backend. This warning can be ignored"
-                    "if happening during modelopt restore."
-                )
+    def sync_amax_across_distributed_group(


the current sync_amax_across_distributed_group moves the amax to cpu -> this is to accommodate the case were some amaxs are None and some are tensors. However this happens typically only for MoEs.
so can we do the old method of sync for non MoEs:

dist.all_reduce(self._amax, op=dist.ReduceOp.MAX, group=parallel_group.group)

and the sync as object via CPU only for MoEs?

Signed-off-by: realAsma <akuriparambi@nvidia.com>

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

Signed-off-by: realAsma <akuriparambi@nvidia.com>

coderabbitai · 2026-01-12T17:28:37Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

realAsma requested a review from jenchen13 January 8, 2026 21:35

realAsma requested a review from a team as a code owner January 8, 2026 21:35

realAsma requested a review from cjluo-nv January 8, 2026 21:35

realAsma marked this pull request as draft January 8, 2026 21:35

realAsma commented Jan 8, 2026

View reviewed changes

realAsma mentioned this pull request Jan 8, 2026

Fix MSE calibration distributed amax sync and add multi‑GPU test #730

Open

jenchen13 reviewed Jan 9, 2026

View reviewed changes

realAsma commented Jan 9, 2026

View reviewed changes

realAsma and others added 4 commits January 12, 2026 09:28

bug for MoE distributed parallelism

9494260

Signed-off-by: realAsma <akuriparambi@nvidia.com>

minor

01239a3

fix sharded ckpt bug

04e3f2f

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

minor

b5b583e

Signed-off-by: realAsma <akuriparambi@nvidia.com>

realAsma force-pushed the asma/MoE_amax_sync branch from 36e4ad1 to b5b583e Compare January 12, 2026 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[draft] bug for MoE distributed parallelism #752

[draft] bug for MoE distributed parallelism #752

Uh oh!

realAsma commented Jan 8, 2026

Uh oh!

copy-pr-bot bot commented Jan 8, 2026

Uh oh!

codecov bot commented Jan 8, 2026

Uh oh!

realAsma Jan 8, 2026 •

edited

Loading

Uh oh!

jenchen13 Jan 9, 2026

Uh oh!

realAsma Jan 9, 2026

Uh oh!

jenchen13 Jan 9, 2026

Uh oh!

jenchen13 Jan 9, 2026

Uh oh!

realAsma Jan 9, 2026

Uh oh!

realAsma Jan 9, 2026

Uh oh!

coderabbitai bot commented Jan 12, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[draft] bug for MoE distributed parallelism #752

Are you sure you want to change the base?

[draft] bug for MoE distributed parallelism #752

Uh oh!

Conversation

realAsma commented Jan 8, 2026

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Jan 8, 2026

Uh oh!

codecov bot commented Jan 8, 2026

Codecov Report

Uh oh!

realAsma Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jenchen13 Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

realAsma Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

jenchen13 Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

jenchen13 Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

realAsma Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

realAsma Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Jan 12, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

realAsma Jan 8, 2026 •

edited

Loading