Skip to content

Kinjal/vllm fix prequant scale#1242

Draft
kinjalpatel27 wants to merge 4 commits intokinjal/vllm_super_nano_supportfrom
kinjal/vllm_fix_prequant_scale
Draft

Kinjal/vllm fix prequant scale#1242
kinjalpatel27 wants to merge 4 commits intokinjal/vllm_super_nano_supportfrom
kinjal/vllm_fix_prequant_scale

Conversation

@kinjalpatel27
Copy link
Copy Markdown
Contributor

@kinjalpatel27 kinjalpatel27 commented Apr 13, 2026

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • Documentation

    • Removed AWQ reload limitation from known issues.
  • New Features

    • Added support for NemotronHf and NemotronHMOE model architectures.
    • Enhanced MoE/AWQ export with expert weight resmoothing and requantization.
  • Refactor

    • Improved checkpoint deserialization and quantizer state validation.
    • Optimized tensor-parallel sharding operations for quantized models.

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 13, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 13, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 29cff1ec-6a38-4059-a59c-7af6ad70df3a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • ✅ Review completed - (🔄 Check again to review again)
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kinjal/vllm_fix_prequant_scale

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
examples/vllm_serve/fakequant_worker.py (2)

64-71: Adequate security justification for weights_only=False.

The inline comment correctly explains why weights_only=False is required (ModelOpt state contains metadata, nested dicts, dtypes that PyTorch's weights_only=True rejects) and appropriately warns users to only load trusted paths. This satisfies the coding guideline requirement for documented justification.

However, consider adding a note that this file format is internally-generated by export_hf_vllm_fq_checkpoint to strengthen the justification.

💡 Optional: Strengthen the security comment
         # map_location="cpu": load tensors on CPU so device ids in the file need not match this worker.
         # weights_only=False: ``vllm_fq_modelopt_state.pth`` is a full ModelOpt pickle (metadata,
         # nested dicts, dtypes, etc.); PyTorch's ``weights_only=True`` rejects that and only
         # allows tensor-only checkpoints. Loading arbitrary pickles can execute stored code—use
-        # paths you trust (your own exports or verified checkpoints).
+        # paths you trust. This file format is internally-generated by export_hf_vllm_fq_checkpoint;
+        # only load checkpoints you created or from trusted sources.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/vllm_serve/fakequant_worker.py` around lines 64 - 71, The comment
justifying weights_only=False is fine but should explicitly state that the
ModelOpt pickle format is produced by our exporter; update the inline comment
near the torch.load call that reads modelopt_state =
torch.load(quant_config["modelopt_state_path"], weights_only=False,
map_location="cpu") to mention that the checkpoint file is the internal format
created by export_hf_vllm_fq_checkpoint, so only trusted exporter outputs should
be loaded; keep the existing security warning about arbitrary pickle execution
and retain the explanation why weights_only=False is required.

136-136: Consider documenting why _dummy_run is needed here.

The _dummy_run(1) call was added before load_state_dict_from_path, but there's no comment explaining why this is necessary. If this initializes lazy modules or allocates buffers needed for state dict loading, a brief comment would help maintainability.

📝 Optional: Add explanatory comment
         if quantizer_file_path:
+            # Ensure model buffers/lazy modules are initialized before loading state dict
             self.model_runner._dummy_run(1)
             current_state_dict = load_state_dict_from_path(self, quantizer_file_path, model)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/vllm_serve/fakequant_worker.py` at line 136, Add a brief inline
comment above the self.model_runner._dummy_run(1) call explaining why the dummy
run is required (e.g., it triggers lazy module initialization or allocates
buffers so load_state_dict_from_path can succeed), referencing the involved
symbols (_dummy_run, load_state_dict_from_path, model_runner) so future
maintainers understand the dependency between the dummy run and subsequent state
dict loading.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 56-70: In _collect_expert_pre_quant_scales replace the private
attribute check getattr(iq, "_disabled", False) with the public API by testing
iq.is_enabled (negated) so the condition becomes: if iq is None or not
iq.is_enabled or iq.pre_quant_scale is None: return None; update the check in
that function to use the public is_enabled property of the input_quantizer for
consistency with other code (e.g., other uses of is_enabled in the file).

---

Nitpick comments:
In `@examples/vllm_serve/fakequant_worker.py`:
- Around line 64-71: The comment justifying weights_only=False is fine but
should explicitly state that the ModelOpt pickle format is produced by our
exporter; update the inline comment near the torch.load call that reads
modelopt_state = torch.load(quant_config["modelopt_state_path"],
weights_only=False, map_location="cpu") to mention that the checkpoint file is
the internal format created by export_hf_vllm_fq_checkpoint, so only trusted
exporter outputs should be loaded; keep the existing security warning about
arbitrary pickle execution and retain the explanation why weights_only=False is
required.
- Line 136: Add a brief inline comment above the self.model_runner._dummy_run(1)
call explaining why the dummy run is required (e.g., it triggers lazy module
initialization or allocates buffers so load_state_dict_from_path can succeed),
referencing the involved symbols (_dummy_run, load_state_dict_from_path,
model_runner) so future maintainers understand the dependency between the dummy
run and subsequent state dict loading.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 74cd1761-c237-4334-a1d7-e995964be2e1

📥 Commits

Reviewing files that changed from the base of the PR and between 687ceea and b83afd8.

📒 Files selected for processing (5)
  • examples/vllm_serve/README.md
  • examples/vllm_serve/fakequant_worker.py
  • examples/vllm_serve/vllm_reload_utils.py
  • modelopt/torch/export/layer_utils.py
  • modelopt/torch/export/plugins/vllm_fakequant_hf.py
💤 Files with no reviewable changes (1)
  • examples/vllm_serve/README.md

Comment on lines +56 to +70
def _collect_expert_pre_quant_scales(
experts: list[nn.Module],
) -> list[torch.Tensor] | None:
"""Return per-expert ``pre_quant_scale`` tensors if every expert can be averaged; else None.

Skips groups where any expert has no input quantizer, no pqs (e.g. weight-only AWQ INT4),
or a disabled input quantizer (pqs already folded / not used).
"""
pqs_list: list[torch.Tensor] = []
for ex in experts:
iq = getattr(ex, "input_quantizer", None)
if iq is None or getattr(iq, "_disabled", False) or iq.pre_quant_scale is None:
return None
pqs_list.append(iq.pre_quant_scale)
return pqs_list
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for how _disabled vs is_enabled is used in TensorQuantizer
rg -nC3 '_disabled|is_enabled' modelopt/torch/quantization/nn/modules/tensor_quantizer.py | head -60

Repository: NVIDIA/Model-Optimizer

Length of output: 2384


🏁 Script executed:

#!/bin/bash
# Check line 138 and surrounding context in vllm_fakequant_hf.py
sed -n '130,150p' modelopt/torch/export/plugins/vllm_fakequant_hf.py

Repository: NVIDIA/Model-Optimizer

Length of output: 988


🏁 Script executed:

#!/bin/bash
# Check how _disabled and is_enabled are used throughout vllm_fakequant_hf.py
rg -n '_disabled|is_enabled' modelopt/torch/export/plugins/vllm_fakequant_hf.py

Repository: NVIDIA/Model-Optimizer

Length of output: 568


🏁 Script executed:

#!/bin/bash
# Check if there are other files that use getattr(iq, "_disabled", False)
rg -n 'getattr.*_disabled' modelopt/torch --type py

Repository: NVIDIA/Model-Optimizer

Length of output: 207


Use the public is_enabled property instead of the private _disabled attribute for consistency.

Line 67 checks getattr(iq, "_disabled", False), but the file already uses the public is_enabled property elsewhere (lines 138, 201, 252). Replace with not iq.is_enabled to match the established pattern:

if iq is None or not iq.is_enabled or iq.pre_quant_scale is None:
    return None

The check is functionally equivalent—is_enabled wraps _disabled in TensorQuantizer (line 486: return not self._disabled)—but using the public API improves consistency and avoids relying on internal implementation details.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 56 - 70, In
_collect_expert_pre_quant_scales replace the private attribute check getattr(iq,
"_disabled", False) with the public API by testing iq.is_enabled (negated) so
the condition becomes: if iq is None or not iq.is_enabled or iq.pre_quant_scale
is None: return None; update the check in that function to use the public
is_enabled property of the input_quantizer for consistency with other code
(e.g., other uses of is_enabled in the file).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant