Autoquant and GPTQ in support in Megatron-Core [OMNIML-3151]#1562
Autoquant and GPTQ in support in Megatron-Core [OMNIML-3151]#1562jenchen13 wants to merge 22 commits into
Conversation
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
…izers
The branch previously short-circuited mse_calibrate's Step 2 with an early
`continue` that skipped any quantizer whose config didn't match the NVFP4
static pattern (num_bits=(2,1) + scale_bits=(4,3)). This broke main's
contract that:
- fp8_scale_sweep=True + registered backend -> backend factory called
- any enabled quantizer -> calibrator replaced with
MseCalibrator (default)
Tests TestRegisterFP8SweepCalibrator::{
test_mse_calibrate_dispatches_to_registered_factory,
test_unregistered_backend_uses_default_mse_calibrator,
} regressed on this branch because they use INT8 quantizers which were
silently skipped.
Restructure so:
1. NVFP4-static promotion runs only when applicable (gated on
module.is_nvfp4_static)
2. Backend factory dispatch runs for any backend with fp8_scale_sweep=True
3. NVFP4MSECalibrator runs only for NVFP4-static + fp8_scale_sweep
4. MseCalibrator default fallback runs for everything else (INT8, FP8,
non-sweep NVFP4)
Also drops the misleading 'skipped non-NVFP4' warning (it implied we skip,
but we now always set a calibrator).
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jenny Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAutoQuantize internals are updated to include the expert model parallel group in all distributed reductions (scores, costs, and final recipe selection). Weight-size computation is refactored to derive from ChangesAutoQuantize Expert Model Parallelism and Megatron Support
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/quantization/utils/calib_utils.py (1)
60-61:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate the docstring note to reflect the new behavior.
The note states that "input must be non-empty" and "a zero-sized input causes division by zero", but the new guard clause at lines 66-67 now handles
batch_size == 0gracefully. Update the docstring to reflect that empty inputs are now supported.📝 Proposed docstring update
- Note: input must be non-empty (batch_size > 0); a zero-sized input causes division by zero. + Note: Empty inputs (batch_size == 0) are handled gracefully and return unchanged hessian/n_samples. + This can occur in MoE models when some experts receive no tokens.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/quantization/utils/calib_utils.py` around lines 60 - 61, Update the docstring Note to reflect that empty inputs are now supported: replace "input must be non-empty (batch_size > 0); a zero-sized input causes division by zero" with a sentence stating that the function now handles batch_size == 0 via the guard clause (which returns early when batch_size == 0) and will not raise a division-by-zero error; mention that non-empty inputs are still processed normally. Target the docstring for the function that contains the guard checking batch_size == 0 (the docstring immediately above that guard) and keep the wording brief and clear.
🧹 Nitpick comments (2)
modelopt/torch/quantization/plugins/megatron.py (1)
810-837: ⚡ Quick winDocument and export the newly added public APIs.
register_megatron_autoquant_supportandget_mcore_decoder_layersare public (non-underscore) but only one has a docstring, and neither is reflected in__all__.As per coding guidelines, "Document public APIs with docstrings, including examples when useful" and "Define the public API with
__all__at the top of each module".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/quantization/plugins/megatron.py` around lines 810 - 837, Add a docstring to the newly public function get_mcore_decoder_layers describing purpose, parameters, return type and an example, and ensure register_megatron_autoquant_support also has appropriate public-docstring coverage if needed; then export both symbols by adding "register_megatron_autoquant_support" and "get_mcore_decoder_layers" to the module's __all__ list at the top of the file so they are part of the public API surface.modelopt/torch/quantization/model_quant.py (1)
510-515: ⚡ Quick winDon’t silently swallow plugin import failures.
Line 514 currently suppresses all
ImportErrors, which can hide real regressions and make Megatron auto-quant support silently disappear. Emit a warning (or gate the exception type more narrowly) so failures are diagnosable.Proposed change
try: from .plugins.megatron import register_megatron_autoquant_support register_megatron_autoquant_support() - except ImportError: - pass + except ImportError as exc: + warnings.warn( + f"Skipping Megatron auto-quant support registration due to import error: {exc}", + RuntimeWarning, + stacklevel=2, + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/quantization/model_quant.py` around lines 510 - 515, The current try/except around importing and calling register_megatron_autoquant_support silently swallows ImportError; update the block to either catch a more specific exception (e.g., ModuleNotFoundError for the plugin import) or log a warning when import/call fails so failures are visible; specifically wrap the import and call to register_megatron_autoquant_support() and on failure call the module's logger or warnings.warn/processLogger.warning with a clear message including the exception text and that Megatron auto-quant support is disabled.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelopt/torch/quantization/plugins/megatron.py`:
- Around line 830-831: get_mcore_decoder_layers is mutating model.decoder.layers
by appending model.output_layer which causes duplicated entries on repeated
calls; instead return a new nn.ModuleList (e.g., copy model.decoder.layers into
a fresh list/ModuleList) and append the output_layer to that new collection or
check for existence before appending so augmentation is idempotent; update
get_mcore_decoder_layers (and calls from
LayerActivationCollector.get_decoder_layers /
LayerActivationCollector._patch_all_layers) to use the non-mutating copy so
_cleanup_layers need not undo permanent changes.
---
Outside diff comments:
In `@modelopt/torch/quantization/utils/calib_utils.py`:
- Around line 60-61: Update the docstring Note to reflect that empty inputs are
now supported: replace "input must be non-empty (batch_size > 0); a zero-sized
input causes division by zero" with a sentence stating that the function now
handles batch_size == 0 via the guard clause (which returns early when
batch_size == 0) and will not raise a division-by-zero error; mention that
non-empty inputs are still processed normally. Target the docstring for the
function that contains the guard checking batch_size == 0 (the docstring
immediately above that guard) and keep the wording brief and clear.
---
Nitpick comments:
In `@modelopt/torch/quantization/model_quant.py`:
- Around line 510-515: The current try/except around importing and calling
register_megatron_autoquant_support silently swallows ImportError; update the
block to either catch a more specific exception (e.g., ModuleNotFoundError for
the plugin import) or log a warning when import/call fails so failures are
visible; specifically wrap the import and call to
register_megatron_autoquant_support() and on failure call the module's logger or
warnings.warn/processLogger.warning with a clear message including the exception
text and that Megatron auto-quant support is disabled.
In `@modelopt/torch/quantization/plugins/megatron.py`:
- Around line 810-837: Add a docstring to the newly public function
get_mcore_decoder_layers describing purpose, parameters, return type and an
example, and ensure register_megatron_autoquant_support also has appropriate
public-docstring coverage if needed; then export both symbols by adding
"register_megatron_autoquant_support" and "get_mcore_decoder_layers" to the
module's __all__ list at the top of the file so they are part of the public API
surface.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 30c2390a-c99c-4b41-8c0c-0be68734dc77
📒 Files selected for processing (7)
modelopt/torch/quantization/algorithms.pymodelopt/torch/quantization/model_quant.pymodelopt/torch/quantization/nn/modules/tensor_quantizer.pymodelopt/torch/quantization/plugins/megatron.pymodelopt/torch/quantization/utils/calib_utils.pytests/gpu_megatron/torch/quantization/plugins/test_megatron.pytests/unit/torch/quantization/test_autoquant.py
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1562 +/- ##
==========================================
+ Coverage 58.45% 65.13% +6.67%
==========================================
Files 510 511 +1
Lines 56274 56390 +116
==========================================
+ Hits 32896 36728 +3832
+ Misses 23378 19662 -3716
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
|
|
||
|
|
||
| # GPTQ layerwise calibration support | ||
| def get_mcore_decoder_layers(model: torch.nn.Module) -> torch.nn.ModuleList | None: |
There was a problem hiding this comment.
since we are returning both decoder layers and output layer could we rename this to better reflect that?
There was a problem hiding this comment.
i can rename to get_mcore_model_layers, but it's still being called by LayerActivationCollector.register_decoder_layer_support
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
sugunav14
left a comment
There was a problem hiding this comment.
Reviewed the GPTQ support! LGTM
a58dad0 to
967d5ef
Compare
23b65c8 to
967d5ef
Compare
|
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
What does this PR do?
Type of change: New Feature
Autoquant and GPTQ in support in Megatron-Core
Usage
# Add a code snippet demonstrating how to use thisTesting
Tested AutoQuant on Nemotron Nano and Ultra.
Tested GPTQ on Nano 3.
Added unit tests for both AutoQuant and GPTQ
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Bug Fixes
Tests