Kinjal/vllm fix prequant scale by kinjalpatel27 · Pull Request #1242 · NVIDIA/Model-Optimizer

kinjalpatel27 · 2026-04-13T03:45:30Z

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Documentation
- Removed AWQ reload limitation from known issues.
New Features
- Added support for NemotronHf and NemotronHMOE model architectures.
- Enhanced MoE/AWQ export with expert weight resmoothing and requantization.
Refactor
- Improved checkpoint deserialization and quantizer state validation.
- Optimized tensor-parallel sharding operations for quantized models.

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

copy-pr-bot · 2026-04-13T03:45:34Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-04-13T03:45:37Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 29cff1ec-6a38-4059-a59c-7af6ad70df3a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

✅ Review completed - (🔄 Check again to review again)

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch kinjal/vllm_fix_prequant_scale

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

examples/vllm_serve/fakequant_worker.py (2)

64-71: Adequate security justification for weights_only=False.

The inline comment correctly explains why weights_only=False is required (ModelOpt state contains metadata, nested dicts, dtypes that PyTorch's weights_only=True rejects) and appropriately warns users to only load trusted paths. This satisfies the coding guideline requirement for documented justification.

However, consider adding a note that this file format is internally-generated by export_hf_vllm_fq_checkpoint to strengthen the justification.

💡 Optional: Strengthen the security comment

         # map_location="cpu": load tensors on CPU so device ids in the file need not match this worker.
         # weights_only=False: ``vllm_fq_modelopt_state.pth`` is a full ModelOpt pickle (metadata,
         # nested dicts, dtypes, etc.); PyTorch's ``weights_only=True`` rejects that and only
         # allows tensor-only checkpoints. Loading arbitrary pickles can execute stored code—use
-        # paths you trust (your own exports or verified checkpoints).
+        # paths you trust. This file format is internally-generated by export_hf_vllm_fq_checkpoint;
+        # only load checkpoints you created or from trusted sources.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/vllm_serve/fakequant_worker.py` around lines 64 - 71, The comment
justifying weights_only=False is fine but should explicitly state that the
ModelOpt pickle format is produced by our exporter; update the inline comment
near the torch.load call that reads modelopt_state =
torch.load(quant_config["modelopt_state_path"], weights_only=False,
map_location="cpu") to mention that the checkpoint file is the internal format
created by export_hf_vllm_fq_checkpoint, so only trusted exporter outputs should
be loaded; keep the existing security warning about arbitrary pickle execution
and retain the explanation why weights_only=False is required.

136-136: Consider documenting why _dummy_run is needed here.

The _dummy_run(1) call was added before load_state_dict_from_path, but there's no comment explaining why this is necessary. If this initializes lazy modules or allocates buffers needed for state dict loading, a brief comment would help maintainability.

📝 Optional: Add explanatory comment

         if quantizer_file_path:
+            # Ensure model buffers/lazy modules are initialized before loading state dict
             self.model_runner._dummy_run(1)
             current_state_dict = load_state_dict_from_path(self, quantizer_file_path, model)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/vllm_serve/fakequant_worker.py` at line 136, Add a brief inline
comment above the self.model_runner._dummy_run(1) call explaining why the dummy
run is required (e.g., it triggers lazy module initialization or allocates
buffers so load_state_dict_from_path can succeed), referencing the involved
symbols (_dummy_run, load_state_dict_from_path, model_runner) so future
maintainers understand the dependency between the dummy run and subsequent state
dict loading.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 56-70: In _collect_expert_pre_quant_scales replace the private
attribute check getattr(iq, "_disabled", False) with the public API by testing
iq.is_enabled (negated) so the condition becomes: if iq is None or not
iq.is_enabled or iq.pre_quant_scale is None: return None; update the check in
that function to use the public is_enabled property of the input_quantizer for
consistency with other code (e.g., other uses of is_enabled in the file).

---

Nitpick comments:
In `@examples/vllm_serve/fakequant_worker.py`:
- Around line 64-71: The comment justifying weights_only=False is fine but
should explicitly state that the ModelOpt pickle format is produced by our
exporter; update the inline comment near the torch.load call that reads
modelopt_state = torch.load(quant_config["modelopt_state_path"],
weights_only=False, map_location="cpu") to mention that the checkpoint file is
the internal format created by export_hf_vllm_fq_checkpoint, so only trusted
exporter outputs should be loaded; keep the existing security warning about
arbitrary pickle execution and retain the explanation why weights_only=False is
required.
- Line 136: Add a brief inline comment above the self.model_runner._dummy_run(1)
call explaining why the dummy run is required (e.g., it triggers lazy module
initialization or allocates buffers so load_state_dict_from_path can succeed),
referencing the involved symbols (_dummy_run, load_state_dict_from_path,
model_runner) so future maintainers understand the dependency between the dummy
run and subsequent state dict loading.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 74cd1761-c237-4334-a1d7-e995964be2e1

📥 Commits

Reviewing files that changed from the base of the PR and between 687ceea and b83afd8.

📒 Files selected for processing (5)

examples/vllm_serve/README.md
examples/vllm_serve/fakequant_worker.py
examples/vllm_serve/vllm_reload_utils.py
modelopt/torch/export/layer_utils.py
modelopt/torch/export/plugins/vllm_fakequant_hf.py

💤 Files with no reviewable changes (1)

examples/vllm_serve/README.md

coderabbitai · 2026-04-13T04:11:00Z

modelopt/torch/export/plugins/vllm_fakequant_hf.py

+def _collect_expert_pre_quant_scales(
+    experts: list[nn.Module],
+) -> list[torch.Tensor] | None:
+    """Return per-expert ``pre_quant_scale`` tensors if every expert can be averaged; else None.
+
+    Skips groups where any expert has no input quantizer, no pqs (e.g. weight-only AWQ INT4),
+    or a disabled input quantizer (pqs already folded / not used).
+    """
+    pqs_list: list[torch.Tensor] = []
+    for ex in experts:
+        iq = getattr(ex, "input_quantizer", None)
+        if iq is None or getattr(iq, "_disabled", False) or iq.pre_quant_scale is None:
+            return None
+        pqs_list.append(iq.pre_quant_scale)
+    return pqs_list


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for how _disabled vs is_enabled is used in TensorQuantizer rg -nC3 '_disabled|is_enabled' modelopt/torch/quantization/nn/modules/tensor_quantizer.py | head -60

Repository: NVIDIA/Model-Optimizer

Length of output: 2384

🏁 Script executed:

#!/bin/bash # Check line 138 and surrounding context in vllm_fakequant_hf.py sed -n '130,150p' modelopt/torch/export/plugins/vllm_fakequant_hf.py

Repository: NVIDIA/Model-Optimizer

Length of output: 988

🏁 Script executed:

#!/bin/bash # Check how _disabled and is_enabled are used throughout vllm_fakequant_hf.py rg -n '_disabled|is_enabled' modelopt/torch/export/plugins/vllm_fakequant_hf.py

Repository: NVIDIA/Model-Optimizer

Length of output: 568

🏁 Script executed:

#!/bin/bash # Check if there are other files that use getattr(iq, "_disabled", False) rg -n 'getattr.*_disabled' modelopt/torch --type py

Repository: NVIDIA/Model-Optimizer

Length of output: 207

Use the public is_enabled property instead of the private _disabled attribute for consistency.

Line 67 checks getattr(iq, "_disabled", False), but the file already uses the public is_enabled property elsewhere (lines 138, 201, 252). Replace with not iq.is_enabled to match the established pattern:

if iq is None or not iq.is_enabled or iq.pre_quant_scale is None: return None

The check is functionally equivalent—is_enabled wraps _disabled in TensorQuantizer (line 486: return not self._disabled)—but using the public API improves consistency and avoids relying on internal implementation details.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 56 - 70, In _collect_expert_pre_quant_scales replace the private attribute check getattr(iq, "_disabled", False) with the public API by testing iq.is_enabled (negated) so the condition becomes: if iq is None or not iq.is_enabled or iq.pre_quant_scale is None: return None; update the check in that function to use the public is_enabled property of the input_quantizer for consistency with other code (e.g., other uses of is_enabled in the file).

kinjalpatel27 added 4 commits April 13, 2026 03:33

support for prequant scale when input quantizer is enabled

ff5d7a9

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

Updated fakequant export for AWQ smoothing

84a5b53

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

fixing issues

38fd354

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

fixed failures

b83afd8

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

coderabbitai bot reviewed Apr 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kinjal/vllm fix prequant scale#1242

Kinjal/vllm fix prequant scale#1242
kinjalpatel27 wants to merge 4 commits intokinjal/vllm_super_nano_supportfrom
kinjal/vllm_fix_prequant_scale

kinjalpatel27 commented Apr 13, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Apr 13, 2026

Uh oh!

coderabbitai bot commented Apr 13, 2026 •

edited

Loading

Review skipped

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kinjalpatel27 commented Apr 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Apr 13, 2026

Uh oh!

coderabbitai bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kinjalpatel27 commented Apr 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 13, 2026 •

edited

Loading