Skip to content

Autoquant and GPTQ in support in Megatron-Core [OMNIML-3151]#1562

Open
jenchen13 wants to merge 22 commits into
mainfrom
jennifchen/mcore_autoquant_gptq
Open

Autoquant and GPTQ in support in Megatron-Core [OMNIML-3151]#1562
jenchen13 wants to merge 22 commits into
mainfrom
jennifchen/mcore_autoquant_gptq

Conversation

@jenchen13

@jenchen13 jenchen13 commented May 28, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: New Feature

Autoquant and GPTQ in support in Megatron-Core

Usage

# Add a code snippet demonstrating how to use this

Testing

Tested AutoQuant on Nemotron Nano and Ultra.
Tested GPTQ on Nano 3.
Added unit tests for both AutoQuant and GPTQ

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A
  • Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Added Megatron-Core auto-quantization support with lazy Megatron plugin registration.
    • Improved distributed AutoQuantize synchronization for expert-parallel (EP) models and made final recipe selection consistent across parallel groups.
    • Expanded layerwise calibration discovery for Megatron decoder-layer workflows.
  • Bug Fixes

    • Prevented division-by-zero during Hessian updates when calibration inputs contain zero tokens.
  • Tests

    • Added coverage for EP auto-quantization and decoder-layer calibration discovery (including non-mutating behavior).

jenchen13 added 14 commits May 22, 2026 10:52
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
…izers

The branch previously short-circuited mse_calibrate's Step 2 with an early
`continue` that skipped any quantizer whose config didn't match the NVFP4
static pattern (num_bits=(2,1) + scale_bits=(4,3)). This broke main's
contract that:
  - fp8_scale_sweep=True + registered backend  -> backend factory called
  - any enabled quantizer                       -> calibrator replaced with
                                                  MseCalibrator (default)

Tests TestRegisterFP8SweepCalibrator::{
  test_mse_calibrate_dispatches_to_registered_factory,
  test_unregistered_backend_uses_default_mse_calibrator,
} regressed on this branch because they use INT8 quantizers which were
silently skipped.

Restructure so:
  1. NVFP4-static promotion runs only when applicable (gated on
     module.is_nvfp4_static)
  2. Backend factory dispatch runs for any backend with fp8_scale_sweep=True
  3. NVFP4MSECalibrator runs only for NVFP4-static + fp8_scale_sweep
  4. MseCalibrator default fallback runs for everything else (INT8, FP8,
     non-sweep NVFP4)

Also drops the misleading 'skipped non-NVFP4' warning (it implied we skip,
but we now always set a calibrator).

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jenny Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
@jenchen13 jenchen13 requested a review from a team as a code owner May 28, 2026 21:30
@jenchen13 jenchen13 requested review from ajrasane, realAsma and sugunav14 and removed request for a team May 28, 2026 21:30
@jenchen13 jenchen13 changed the title Autoquant and GPTQ in support in Megatron-Core Autoquant and GPTQ in support in Megatron-Core [OMNIML-3151] May 28, 2026
@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: eab7434b-45d1-4758-a348-feb19235f4a8

📥 Commits

Reviewing files that changed from the base of the PR and between 967d5ef and 2acbc03.

📒 Files selected for processing (1)
  • CHANGELOG.rst
✅ Files skipped from review due to trivial changes (1)
  • CHANGELOG.rst

📝 Walkthrough

Walkthrough

AutoQuantize internals are updated to include the expert model parallel group in all distributed reductions (scores, costs, and final recipe selection). Weight-size computation is refactored to derive from candidate_stats via a new static helper. A Megatron plugin adds register_megatron_autoquant_support() and get_mcore_layerwise_calibration_layers(), lazily invoked from auto_quantize. A zero-input guard prevents division-by-zero in update_hessian.

Changes

AutoQuantize Expert Model Parallelism and Megatron Support

Layer / File(s) Summary
EP group inclusion in distributed synchronization and grouping rules
modelopt/torch/quantization/algorithms.py
get_score(), get_cost(), and final best_format selection now reduce across expert_model_parallel_group in addition to TP/DP groups. An extra regex groups NemotronH MCore local_experts fused linear layers.
Weight size computation refactored to use candidate stats
modelopt/torch/quantization/algorithms.py
_get_total_weight_size_from_candidate_stats() sums no-quant costs from candidate_stats. Both run_search() and _resolve_best_recipe() now call this helper instead of scanning module parameters.
Megatron auto-quantization and layerwise calibration plugin
modelopt/torch/quantization/plugins/megatron.py, modelopt/torch/quantization/model_quant.py
Adds register_megatron_autoquant_support() (support predicate, no-op grad-checkpoint context, weight-name parameter filter) and get_mcore_layerwise_calibration_layers() registered with LayerActivationCollector. auto_quantize lazily imports and invokes the registration.
Calibration zero-input guard in update_hessian
modelopt/torch/quantization/utils/calib_utils.py, tests/gpu/torch/quantization/test_gptq.py
Early return when batch_size == 0 prevents division-by-zero in Hessian update; test asserts no-op behavior for zero-token inputs.
Unit test: weight budget derived from candidate stats
tests/unit/torch/quantization/test_autoquant.py
Monkeypatches _get_total_weight_size to fail if called, then asserts max_weight_size equals the expected value from candidate costs.
Megatron EP auto-quantize and layerwise calibration tests
tests/gpu_megatron/torch/quantization/plugins/test_megatron.py
Adds _test_auto_quantize_moe_ep_helper building a GPT MoE model with expert_model_parallel_size, running auto_quantize_helper with NVFP4/FP8 formats under dist_workers_size_2. Adds test_mcore_layerwise_calibration_layers_do_not_mutate_decoder asserting decoder layer immutability.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Suggested reviewers

  • cjluo-nv
  • ChenhanYu
🚥 Pre-merge checks | ✅ 4 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title contains a grammatical error ('in support' should be 'support') and is somewhat vague about which specific changes are most important, though it does identify the main feature being added. Consider revising the title to 'Add AutoQuant and GPTQ support for Megatron-Core models' for clarity and grammatical correctness.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed All Python code in the PR passes security review: no torch.load/numpy.load unsafe patterns, no eval/exec on external input, no hardcoded trust_remote_code, no nosec comments, lazy megatron plugin i...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jennifchen/mcore_autoquant_gptq

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1562/

Built to branch gh-pages at 2026-06-16 21:17 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/quantization/utils/calib_utils.py (1)

60-61: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the docstring note to reflect the new behavior.

The note states that "input must be non-empty" and "a zero-sized input causes division by zero", but the new guard clause at lines 66-67 now handles batch_size == 0 gracefully. Update the docstring to reflect that empty inputs are now supported.

📝 Proposed docstring update
-    Note: input must be non-empty (batch_size > 0); a zero-sized input causes division by zero.
+    Note: Empty inputs (batch_size == 0) are handled gracefully and return unchanged hessian/n_samples.
+          This can occur in MoE models when some experts receive no tokens.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/utils/calib_utils.py` around lines 60 - 61,
Update the docstring Note to reflect that empty inputs are now supported:
replace "input must be non-empty (batch_size > 0); a zero-sized input causes
division by zero" with a sentence stating that the function now handles
batch_size == 0 via the guard clause (which returns early when batch_size == 0)
and will not raise a division-by-zero error; mention that non-empty inputs are
still processed normally. Target the docstring for the function that contains
the guard checking batch_size == 0 (the docstring immediately above that guard)
and keep the wording brief and clear.
🧹 Nitpick comments (2)
modelopt/torch/quantization/plugins/megatron.py (1)

810-837: ⚡ Quick win

Document and export the newly added public APIs.

register_megatron_autoquant_support and get_mcore_decoder_layers are public (non-underscore) but only one has a docstring, and neither is reflected in __all__.

As per coding guidelines, "Document public APIs with docstrings, including examples when useful" and "Define the public API with __all__ at the top of each module".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/plugins/megatron.py` around lines 810 - 837, Add
a docstring to the newly public function get_mcore_decoder_layers describing
purpose, parameters, return type and an example, and ensure
register_megatron_autoquant_support also has appropriate public-docstring
coverage if needed; then export both symbols by adding
"register_megatron_autoquant_support" and "get_mcore_decoder_layers" to the
module's __all__ list at the top of the file so they are part of the public API
surface.
modelopt/torch/quantization/model_quant.py (1)

510-515: ⚡ Quick win

Don’t silently swallow plugin import failures.

Line 514 currently suppresses all ImportErrors, which can hide real regressions and make Megatron auto-quant support silently disappear. Emit a warning (or gate the exception type more narrowly) so failures are diagnosable.

Proposed change
     try:
         from .plugins.megatron import register_megatron_autoquant_support

         register_megatron_autoquant_support()
-    except ImportError:
-        pass
+    except ImportError as exc:
+        warnings.warn(
+            f"Skipping Megatron auto-quant support registration due to import error: {exc}",
+            RuntimeWarning,
+            stacklevel=2,
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/model_quant.py` around lines 510 - 515, The
current try/except around importing and calling
register_megatron_autoquant_support silently swallows ImportError; update the
block to either catch a more specific exception (e.g., ModuleNotFoundError for
the plugin import) or log a warning when import/call fails so failures are
visible; specifically wrap the import and call to
register_megatron_autoquant_support() and on failure call the module's logger or
warnings.warn/processLogger.warning with a clear message including the exception
text and that Megatron auto-quant support is disabled.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/quantization/plugins/megatron.py`:
- Around line 830-831: get_mcore_decoder_layers is mutating model.decoder.layers
by appending model.output_layer which causes duplicated entries on repeated
calls; instead return a new nn.ModuleList (e.g., copy model.decoder.layers into
a fresh list/ModuleList) and append the output_layer to that new collection or
check for existence before appending so augmentation is idempotent; update
get_mcore_decoder_layers (and calls from
LayerActivationCollector.get_decoder_layers /
LayerActivationCollector._patch_all_layers) to use the non-mutating copy so
_cleanup_layers need not undo permanent changes.

---

Outside diff comments:
In `@modelopt/torch/quantization/utils/calib_utils.py`:
- Around line 60-61: Update the docstring Note to reflect that empty inputs are
now supported: replace "input must be non-empty (batch_size > 0); a zero-sized
input causes division by zero" with a sentence stating that the function now
handles batch_size == 0 via the guard clause (which returns early when
batch_size == 0) and will not raise a division-by-zero error; mention that
non-empty inputs are still processed normally. Target the docstring for the
function that contains the guard checking batch_size == 0 (the docstring
immediately above that guard) and keep the wording brief and clear.

---

Nitpick comments:
In `@modelopt/torch/quantization/model_quant.py`:
- Around line 510-515: The current try/except around importing and calling
register_megatron_autoquant_support silently swallows ImportError; update the
block to either catch a more specific exception (e.g., ModuleNotFoundError for
the plugin import) or log a warning when import/call fails so failures are
visible; specifically wrap the import and call to
register_megatron_autoquant_support() and on failure call the module's logger or
warnings.warn/processLogger.warning with a clear message including the exception
text and that Megatron auto-quant support is disabled.

In `@modelopt/torch/quantization/plugins/megatron.py`:
- Around line 810-837: Add a docstring to the newly public function
get_mcore_decoder_layers describing purpose, parameters, return type and an
example, and ensure register_megatron_autoquant_support also has appropriate
public-docstring coverage if needed; then export both symbols by adding
"register_megatron_autoquant_support" and "get_mcore_decoder_layers" to the
module's __all__ list at the top of the file so they are part of the public API
surface.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 30c2390a-c99c-4b41-8c0c-0be68734dc77

📥 Commits

Reviewing files that changed from the base of the PR and between d63bf70 and 2ba29fd.

📒 Files selected for processing (7)
  • modelopt/torch/quantization/algorithms.py
  • modelopt/torch/quantization/model_quant.py
  • modelopt/torch/quantization/nn/modules/tensor_quantizer.py
  • modelopt/torch/quantization/plugins/megatron.py
  • modelopt/torch/quantization/utils/calib_utils.py
  • tests/gpu_megatron/torch/quantization/plugins/test_megatron.py
  • tests/unit/torch/quantization/test_autoquant.py

Comment thread modelopt/torch/quantization/plugins/megatron.py Outdated
@codecov

codecov Bot commented May 28, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 52.38095% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.13%. Comparing base (1cccf66) to head (2acbc03).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/quantization/plugins/megatron.py 38.46% 16 Missing ⚠️
modelopt/torch/quantization/utils/calib_utils.py 0.00% 2 Missing ⚠️
modelopt/torch/quantization/algorithms.py 90.90% 1 Missing ⚠️
modelopt/torch/quantization/model_quant.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1562      +/-   ##
==========================================
+ Coverage   58.45%   65.13%   +6.67%     
==========================================
  Files         510      511       +1     
  Lines       56274    56390     +116     
==========================================
+ Hits        32896    36728    +3832     
+ Misses      23378    19662    -3716     
Flag Coverage Δ
examples 41.81% <52.38%> (+19.37%) ⬆️
gpu 20.58% <7.14%> (-0.02%) ⬇️
regression 14.69% <7.14%> (+0.06%) ⬆️
unit 54.33% <30.95%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.



# GPTQ layerwise calibration support
def get_mcore_decoder_layers(model: torch.nn.Module) -> torch.nn.ModuleList | None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are returning both decoder layers and output layer could we rename this to better reflect that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can rename to get_mcore_model_layers, but it's still being called by LayerActivationCollector.register_decoder_layer_support

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

@sugunav14 sugunav14 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the GPTQ support! LGTM

@jenchen13 jenchen13 requested a review from kevalmorabia97 June 15, 2026 17:26
@copy-pr-bot

copy-pr-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jenchen13 jenchen13 force-pushed the jennifchen/mcore_autoquant_gptq branch from a58dad0 to 967d5ef Compare June 15, 2026 17:29
@jenchen13 jenchen13 removed request for a team June 15, 2026 17:30
@jenchen13 jenchen13 force-pushed the jennifchen/mcore_autoquant_gptq branch from 23b65c8 to 967d5ef Compare June 15, 2026 21:13
@jenchen13 jenchen13 changed the base branch from feature/mcore_mse_mixed_precision to main June 16, 2026 17:59
@jenchen13 jenchen13 requested a review from a team as a code owner June 16, 2026 17:59
Comment thread noxfile.py
Comment thread CHANGELOG.rst
@kevalmorabia97

Copy link
Copy Markdown
Collaborator
FAILED tests/unit/torch/quantization/test_autoquant.py::test_auto_quantize_budget_uses_no_quant_candidate_cost - AttributeError: '_BudgetCaptureSearcher' object has no attribute '_cost_model'. Did you mean: 'cost_model'?

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants