Autoquant and GPTQ in support in Megatron-Core [OMNIML-3151] by jenchen13 · Pull Request #1562 · NVIDIA/Model-Optimizer

jenchen13 · 2026-05-28T21:30:02Z

What does this PR do?

Type of change: New Feature

Autoquant and GPTQ in support in Megatron-Core

Usage

# Add a code snippet demonstrating how to use this

Testing

Tested AutoQuant on Nemotron Nano and Ultra.
Tested GPTQ on Nano 3.
Added unit tests for both AutoQuant and GPTQ

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A
Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

New Features
- Added Megatron-Core auto-quantization support with lazy Megatron plugin registration.
- Improved distributed AutoQuantize synchronization for expert-parallel (EP) models and made final recipe selection consistent across parallel groups.
- Expanded layerwise calibration discovery for Megatron decoder-layer workflows.
Bug Fixes
- Prevented division-by-zero during Hessian updates when calibration inputs contain zero tokens.
Tests
- Added coverage for EP auto-quantization and decoder-layer calibration discovery (including non-mutating behavior).

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

…izers The branch previously short-circuited mse_calibrate's Step 2 with an early `continue` that skipped any quantizer whose config didn't match the NVFP4 static pattern (num_bits=(2,1) + scale_bits=(4,3)). This broke main's contract that: - fp8_scale_sweep=True + registered backend -> backend factory called - any enabled quantizer -> calibrator replaced with MseCalibrator (default) Tests TestRegisterFP8SweepCalibrator::{ test_mse_calibrate_dispatches_to_registered_factory, test_unregistered_backend_uses_default_mse_calibrator, } regressed on this branch because they use INT8 quantizers which were silently skipped. Restructure so: 1. NVFP4-static promotion runs only when applicable (gated on module.is_nvfp4_static) 2. Backend factory dispatch runs for any backend with fp8_scale_sweep=True 3. NVFP4MSECalibrator runs only for NVFP4-static + fp8_scale_sweep 4. MseCalibrator default fallback runs for everything else (INT8, FP8, non-sweep NVFP4) Also drops the misleading 'skipped non-NVFP4' warning (it implied we skip, but we now always set a calibrator). Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

Signed-off-by: Jenny Chen <jennifchen@nvidia.com>

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

coderabbitai · 2026-05-28T21:31:00Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: eab7434b-45d1-4758-a348-feb19235f4a8

📥 Commits

Reviewing files that changed from the base of the PR and between 967d5ef and 2acbc03.

📒 Files selected for processing (1)

CHANGELOG.rst

✅ Files skipped from review due to trivial changes (1)

CHANGELOG.rst

📝 Walkthrough

Walkthrough

AutoQuantize internals are updated to include the expert model parallel group in all distributed reductions (scores, costs, and final recipe selection). Weight-size computation is refactored to derive from candidate_stats via a new static helper. A Megatron plugin adds register_megatron_autoquant_support() and get_mcore_layerwise_calibration_layers(), lazily invoked from auto_quantize. A zero-input guard prevents division-by-zero in update_hessian.

Changes

AutoQuantize Expert Model Parallelism and Megatron Support

Layer / File(s)	Summary
EP group inclusion in distributed synchronization and grouping rules `modelopt/torch/quantization/algorithms.py`	`get_score()`, `get_cost()`, and final `best_format` selection now reduce across `expert_model_parallel_group` in addition to TP/DP groups. An extra regex groups NemotronH MCore `local_experts` fused linear layers.
Weight size computation refactored to use candidate stats `modelopt/torch/quantization/algorithms.py`	`_get_total_weight_size_from_candidate_stats()` sums no-quant costs from `candidate_stats`. Both `run_search()` and `_resolve_best_recipe()` now call this helper instead of scanning module parameters.
Megatron auto-quantization and layerwise calibration plugin `modelopt/torch/quantization/plugins/megatron.py`, `modelopt/torch/quantization/model_quant.py`	Adds `register_megatron_autoquant_support()` (support predicate, no-op grad-checkpoint context, weight-name parameter filter) and `get_mcore_layerwise_calibration_layers()` registered with `LayerActivationCollector`. `auto_quantize` lazily imports and invokes the registration.
Calibration zero-input guard in `update_hessian` `modelopt/torch/quantization/utils/calib_utils.py`, `tests/gpu/torch/quantization/test_gptq.py`	Early return when `batch_size == 0` prevents division-by-zero in Hessian update; test asserts no-op behavior for zero-token inputs.
Unit test: weight budget derived from candidate stats `tests/unit/torch/quantization/test_autoquant.py`	Monkeypatches `_get_total_weight_size` to fail if called, then asserts `max_weight_size` equals the expected value from candidate costs.
Megatron EP auto-quantize and layerwise calibration tests `tests/gpu_megatron/torch/quantization/plugins/test_megatron.py`	Adds `_test_auto_quantize_moe_ep_helper` building a GPT MoE model with `expert_model_parallel_size`, running `auto_quantize_helper` with NVFP4/FP8 formats under `dist_workers_size_2`. Adds `test_mcore_layerwise_calibration_layers_do_not_mutate_decoder` asserting decoder layer immutability.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Suggested reviewers

cjluo-nv
ChenhanYu

🚥 Pre-merge checks | ✅ 4 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title contains a grammatical error ('in support' should be 'support') and is somewhat vague about which specific changes are most important, though it does identify the main feature being added.	Consider revising the title to 'Add AutoQuant and GPTQ support for Megatron-Core models' for clarity and grammatical correctness.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	All Python code in the PR passes security review: no torch.load/numpy.load unsafe patterns, no eval/exec on external input, no hardcoded trust_remote_code, no nosec comments, lazy megatron plugin i...

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jennifchen/mcore_autoquant_gptq

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-28T21:35:00Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1562/
Built to branch `gh-pages` at 2026-06-16 21:17 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/torch/quantization/utils/calib_utils.py (1)

60-61: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the docstring note to reflect the new behavior.

The note states that "input must be non-empty" and "a zero-sized input causes division by zero", but the new guard clause at lines 66-67 now handles batch_size == 0 gracefully. Update the docstring to reflect that empty inputs are now supported.

📝 Proposed docstring update

-    Note: input must be non-empty (batch_size > 0); a zero-sized input causes division by zero.
+    Note: Empty inputs (batch_size == 0) are handled gracefully and return unchanged hessian/n_samples.
+          This can occur in MoE models when some experts receive no tokens.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/utils/calib_utils.py` around lines 60 - 61,
Update the docstring Note to reflect that empty inputs are now supported:
replace "input must be non-empty (batch_size > 0); a zero-sized input causes
division by zero" with a sentence stating that the function now handles
batch_size == 0 via the guard clause (which returns early when batch_size == 0)
and will not raise a division-by-zero error; mention that non-empty inputs are
still processed normally. Target the docstring for the function that contains
the guard checking batch_size == 0 (the docstring immediately above that guard)
and keep the wording brief and clear.

🧹 Nitpick comments (2)

modelopt/torch/quantization/plugins/megatron.py (1)

810-837: ⚡ Quick win

Document and export the newly added public APIs.

register_megatron_autoquant_support and get_mcore_decoder_layers are public (non-underscore) but only one has a docstring, and neither is reflected in __all__.

As per coding guidelines, "Document public APIs with docstrings, including examples when useful" and "Define the public API with __all__ at the top of each module".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/plugins/megatron.py` around lines 810 - 837, Add
a docstring to the newly public function get_mcore_decoder_layers describing
purpose, parameters, return type and an example, and ensure
register_megatron_autoquant_support also has appropriate public-docstring
coverage if needed; then export both symbols by adding
"register_megatron_autoquant_support" and "get_mcore_decoder_layers" to the
module's __all__ list at the top of the file so they are part of the public API
surface.

modelopt/torch/quantization/model_quant.py (1)

510-515: ⚡ Quick win

Don’t silently swallow plugin import failures.

Line 514 currently suppresses all ImportErrors, which can hide real regressions and make Megatron auto-quant support silently disappear. Emit a warning (or gate the exception type more narrowly) so failures are diagnosable.

Proposed change

     try:
         from .plugins.megatron import register_megatron_autoquant_support

         register_megatron_autoquant_support()
-    except ImportError:
-        pass
+    except ImportError as exc:
+        warnings.warn(
+            f"Skipping Megatron auto-quant support registration due to import error: {exc}",
+            RuntimeWarning,
+            stacklevel=2,
+        )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/model_quant.py` around lines 510 - 515, The
current try/except around importing and calling
register_megatron_autoquant_support silently swallows ImportError; update the
block to either catch a more specific exception (e.g., ModuleNotFoundError for
the plugin import) or log a warning when import/call fails so failures are
visible; specifically wrap the import and call to
register_megatron_autoquant_support() and on failure call the module's logger or
warnings.warn/processLogger.warning with a clear message including the exception
text and that Megatron auto-quant support is disabled.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/quantization/plugins/megatron.py`:
- Around line 830-831: get_mcore_decoder_layers is mutating model.decoder.layers
by appending model.output_layer which causes duplicated entries on repeated
calls; instead return a new nn.ModuleList (e.g., copy model.decoder.layers into
a fresh list/ModuleList) and append the output_layer to that new collection or
check for existence before appending so augmentation is idempotent; update
get_mcore_decoder_layers (and calls from
LayerActivationCollector.get_decoder_layers /
LayerActivationCollector._patch_all_layers) to use the non-mutating copy so
_cleanup_layers need not undo permanent changes.

---

Outside diff comments:
In `@modelopt/torch/quantization/utils/calib_utils.py`:
- Around line 60-61: Update the docstring Note to reflect that empty inputs are
now supported: replace "input must be non-empty (batch_size > 0); a zero-sized
input causes division by zero" with a sentence stating that the function now
handles batch_size == 0 via the guard clause (which returns early when
batch_size == 0) and will not raise a division-by-zero error; mention that
non-empty inputs are still processed normally. Target the docstring for the
function that contains the guard checking batch_size == 0 (the docstring
immediately above that guard) and keep the wording brief and clear.

---

Nitpick comments:
In `@modelopt/torch/quantization/model_quant.py`:
- Around line 510-515: The current try/except around importing and calling
register_megatron_autoquant_support silently swallows ImportError; update the
block to either catch a more specific exception (e.g., ModuleNotFoundError for
the plugin import) or log a warning when import/call fails so failures are
visible; specifically wrap the import and call to
register_megatron_autoquant_support() and on failure call the module's logger or
warnings.warn/processLogger.warning with a clear message including the exception
text and that Megatron auto-quant support is disabled.

In `@modelopt/torch/quantization/plugins/megatron.py`:
- Around line 810-837: Add a docstring to the newly public function
get_mcore_decoder_layers describing purpose, parameters, return type and an
example, and ensure register_megatron_autoquant_support also has appropriate
public-docstring coverage if needed; then export both symbols by adding
"register_megatron_autoquant_support" and "get_mcore_decoder_layers" to the
module's __all__ list at the top of the file so they are part of the public API
surface.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 30c2390a-c99c-4b41-8c0c-0be68734dc77

📥 Commits

Reviewing files that changed from the base of the PR and between d63bf70 and 2ba29fd.

📒 Files selected for processing (7)

modelopt/torch/quantization/algorithms.py
modelopt/torch/quantization/model_quant.py
modelopt/torch/quantization/nn/modules/tensor_quantizer.py
modelopt/torch/quantization/plugins/megatron.py
modelopt/torch/quantization/utils/calib_utils.py
tests/gpu_megatron/torch/quantization/plugins/test_megatron.py
tests/unit/torch/quantization/test_autoquant.py

codecov · 2026-05-28T21:43:45Z

Codecov Report

❌ Patch coverage is 52.38095% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.13%. Comparing base (1cccf66) to head (2acbc03).
⚠️ Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/quantization/plugins/megatron.py	38.46%	16 Missing ⚠️
modelopt/torch/quantization/utils/calib_utils.py	0.00%	2 Missing ⚠️
modelopt/torch/quantization/algorithms.py	90.90%	1 Missing ⚠️
modelopt/torch/quantization/model_quant.py	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1562      +/-   ##
==========================================
+ Coverage   58.45%   65.13%   +6.67%     
==========================================
  Files         510      511       +1     
  Lines       56274    56390     +116     
==========================================
+ Hits        32896    36728    +3832     
+ Misses      23378    19662    -3716

Flag	Coverage Δ
examples	`41.81% <52.38%> (+19.37%)`	⬆️
gpu	`20.58% <7.14%> (-0.02%)`	⬇️
regression	`14.69% <7.14%> (+0.06%)`	⬆️
unit	`54.33% <30.95%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

…e_autoquant_gptq

sugunav14 · 2026-06-02T17:53:12Z

+
+
+# GPTQ layerwise calibration support
+def get_mcore_decoder_layers(model: torch.nn.Module) -> torch.nn.ModuleList | None:


since we are returning both decoder layers and output layer could we rename this to better reflect that?

i can rename to get_mcore_model_layers, but it's still being called by LayerActivationCollector.register_decoder_layer_support

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

sugunav14

Reviewed the GPTQ support! LGTM

copy-pr-bot · 2026-06-15T17:26:31Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

kevalmorabia97 · 2026-06-16T19:52:04Z

FAILED tests/unit/torch/quantization/test_autoquant.py::test_auto_quantize_budget_uses_no_quant_candidate_cost - AttributeError: '_BudgetCaptureSearcher' object has no attribute '_cost_model'. Did you mean: 'cost_model'?

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

jenchen13 added 14 commits May 22, 2026 10:52

MSE & mixed precision in mcore

676eac4

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

lint

9143e02

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

update mse test for mixed precision

2dea94a

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

fix gpu test, no duplicate backend registration

918ed6a

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

Merge branch 'main' into feature/mcore_mse_mixed_precision

05a436f

Signed-off-by: Jenny Chen <jennifchen@nvidia.com>

fix unit tests

ff20eca

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

lazy init autoquant register

985da85

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

revert breaking change on MSECalibrator

d88e54a

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

revert autoquant and gptq changes

5f291a6

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

fallback to copy HF remote code if no dir

5c9cd43

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

Merge branch 'main' into feature/mcore_mse_mixed_precision

c5c7a2e

fix if else

d63bf70

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

autoquant and gptq in mcore

2ba29fd

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

jenchen13 requested a review from a team as a code owner May 28, 2026 21:30

jenchen13 requested review from ajrasane, realAsma and sugunav14 and removed request for a team May 28, 2026 21:30

jenchen13 changed the title ~~Autoquant and GPTQ in support in Megatron-Core~~ Autoquant and GPTQ in support in Megatron-Core [OMNIML-3151] May 28, 2026

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

Comment thread modelopt/torch/quantization/plugins/megatron.py Outdated

jenchen13 added 3 commits May 29, 2026 06:06

fix logic again

e14fa62

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

Merge branch 'main' into feature/mcore_mse_mixed_precision

e985e93

Merge branch 'feature/mcore_mse_mixed_precision' into jennifchen/mcor…

9088966

…e_autoquant_gptq

sugunav14 reviewed Jun 2, 2026

View reviewed changes

revert tensor quantizer changes

68e6837

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

sugunav14 reviewed Jun 3, 2026

View reviewed changes

jenchen13 requested a review from kevalmorabia97 June 15, 2026 17:26

jenchen13 force-pushed the jennifchen/mcore_autoquant_gptq branch from a58dad0 to 967d5ef Compare June 15, 2026 17:29

jenchen13 removed request for a team June 15, 2026 17:30

jenchen13 force-pushed the jennifchen/mcore_autoquant_gptq branch from 23b65c8 to 967d5ef Compare June 15, 2026 21:13

jenchen13 changed the base branch from feature/mcore_mse_mixed_precision to main June 16, 2026 17:59

jenchen13 requested a review from a team as a code owner June 16, 2026 17:59

kevalmorabia97 reviewed Jun 16, 2026

View reviewed changes

Comment thread noxfile.py

kevalmorabia97 reviewed Jun 16, 2026

View reviewed changes

Comment thread CHANGELOG.rst

Merge branch 'main' into jennifchen/mcore_autoquant_gptq

5b141b5

jenchen13 added 2 commits June 16, 2026 13:49

Fix AutoQuant budget denominator for restored stats

55ed710

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

Update changelog for MCore quantization support

2acbc03

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>



		# GPTQ layerwise calibration support
		def get_mcore_decoder_layers(model: torch.nn.Module) -> torch.nn.ModuleList \| None:

Conversation

jenchen13 commented May 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-06-16 21:17 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sugunav14 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

jenchen13 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

sugunav14 left a comment

Choose a reason for hiding this comment

Uh oh!

copy-pr-bot Bot commented Jun 15, 2026

Uh oh!

Uh oh!

Uh oh!

kevalmorabia97 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jenchen13 commented May 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-06-16 21:17 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov Bot commented May 28, 2026 •

edited

Loading