Kinjal/vllm fix prequant scale#1242
Kinjal/vllm fix prequant scale#1242kinjalpatel27 wants to merge 4 commits intokinjal/vllm_super_nano_supportfrom
Conversation
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
examples/vllm_serve/fakequant_worker.py (2)
64-71: Adequate security justification forweights_only=False.The inline comment correctly explains why
weights_only=Falseis required (ModelOpt state contains metadata, nested dicts, dtypes that PyTorch'sweights_only=Truerejects) and appropriately warns users to only load trusted paths. This satisfies the coding guideline requirement for documented justification.However, consider adding a note that this file format is internally-generated by
export_hf_vllm_fq_checkpointto strengthen the justification.💡 Optional: Strengthen the security comment
# map_location="cpu": load tensors on CPU so device ids in the file need not match this worker. # weights_only=False: ``vllm_fq_modelopt_state.pth`` is a full ModelOpt pickle (metadata, # nested dicts, dtypes, etc.); PyTorch's ``weights_only=True`` rejects that and only # allows tensor-only checkpoints. Loading arbitrary pickles can execute stored code—use - # paths you trust (your own exports or verified checkpoints). + # paths you trust. This file format is internally-generated by export_hf_vllm_fq_checkpoint; + # only load checkpoints you created or from trusted sources.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/vllm_serve/fakequant_worker.py` around lines 64 - 71, The comment justifying weights_only=False is fine but should explicitly state that the ModelOpt pickle format is produced by our exporter; update the inline comment near the torch.load call that reads modelopt_state = torch.load(quant_config["modelopt_state_path"], weights_only=False, map_location="cpu") to mention that the checkpoint file is the internal format created by export_hf_vllm_fq_checkpoint, so only trusted exporter outputs should be loaded; keep the existing security warning about arbitrary pickle execution and retain the explanation why weights_only=False is required.
136-136: Consider documenting why_dummy_runis needed here.The
_dummy_run(1)call was added beforeload_state_dict_from_path, but there's no comment explaining why this is necessary. If this initializes lazy modules or allocates buffers needed for state dict loading, a brief comment would help maintainability.📝 Optional: Add explanatory comment
if quantizer_file_path: + # Ensure model buffers/lazy modules are initialized before loading state dict self.model_runner._dummy_run(1) current_state_dict = load_state_dict_from_path(self, quantizer_file_path, model)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/vllm_serve/fakequant_worker.py` at line 136, Add a brief inline comment above the self.model_runner._dummy_run(1) call explaining why the dummy run is required (e.g., it triggers lazy module initialization or allocates buffers so load_state_dict_from_path can succeed), referencing the involved symbols (_dummy_run, load_state_dict_from_path, model_runner) so future maintainers understand the dependency between the dummy run and subsequent state dict loading.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 56-70: In _collect_expert_pre_quant_scales replace the private
attribute check getattr(iq, "_disabled", False) with the public API by testing
iq.is_enabled (negated) so the condition becomes: if iq is None or not
iq.is_enabled or iq.pre_quant_scale is None: return None; update the check in
that function to use the public is_enabled property of the input_quantizer for
consistency with other code (e.g., other uses of is_enabled in the file).
---
Nitpick comments:
In `@examples/vllm_serve/fakequant_worker.py`:
- Around line 64-71: The comment justifying weights_only=False is fine but
should explicitly state that the ModelOpt pickle format is produced by our
exporter; update the inline comment near the torch.load call that reads
modelopt_state = torch.load(quant_config["modelopt_state_path"],
weights_only=False, map_location="cpu") to mention that the checkpoint file is
the internal format created by export_hf_vllm_fq_checkpoint, so only trusted
exporter outputs should be loaded; keep the existing security warning about
arbitrary pickle execution and retain the explanation why weights_only=False is
required.
- Line 136: Add a brief inline comment above the self.model_runner._dummy_run(1)
call explaining why the dummy run is required (e.g., it triggers lazy module
initialization or allocates buffers so load_state_dict_from_path can succeed),
referencing the involved symbols (_dummy_run, load_state_dict_from_path,
model_runner) so future maintainers understand the dependency between the dummy
run and subsequent state dict loading.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 74cd1761-c237-4334-a1d7-e995964be2e1
📒 Files selected for processing (5)
examples/vllm_serve/README.mdexamples/vllm_serve/fakequant_worker.pyexamples/vllm_serve/vllm_reload_utils.pymodelopt/torch/export/layer_utils.pymodelopt/torch/export/plugins/vllm_fakequant_hf.py
💤 Files with no reviewable changes (1)
- examples/vllm_serve/README.md
| def _collect_expert_pre_quant_scales( | ||
| experts: list[nn.Module], | ||
| ) -> list[torch.Tensor] | None: | ||
| """Return per-expert ``pre_quant_scale`` tensors if every expert can be averaged; else None. | ||
|
|
||
| Skips groups where any expert has no input quantizer, no pqs (e.g. weight-only AWQ INT4), | ||
| or a disabled input quantizer (pqs already folded / not used). | ||
| """ | ||
| pqs_list: list[torch.Tensor] = [] | ||
| for ex in experts: | ||
| iq = getattr(ex, "input_quantizer", None) | ||
| if iq is None or getattr(iq, "_disabled", False) or iq.pre_quant_scale is None: | ||
| return None | ||
| pqs_list.append(iq.pre_quant_scale) | ||
| return pqs_list |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for how _disabled vs is_enabled is used in TensorQuantizer
rg -nC3 '_disabled|is_enabled' modelopt/torch/quantization/nn/modules/tensor_quantizer.py | head -60Repository: NVIDIA/Model-Optimizer
Length of output: 2384
🏁 Script executed:
#!/bin/bash
# Check line 138 and surrounding context in vllm_fakequant_hf.py
sed -n '130,150p' modelopt/torch/export/plugins/vllm_fakequant_hf.pyRepository: NVIDIA/Model-Optimizer
Length of output: 988
🏁 Script executed:
#!/bin/bash
# Check how _disabled and is_enabled are used throughout vllm_fakequant_hf.py
rg -n '_disabled|is_enabled' modelopt/torch/export/plugins/vllm_fakequant_hf.pyRepository: NVIDIA/Model-Optimizer
Length of output: 568
🏁 Script executed:
#!/bin/bash
# Check if there are other files that use getattr(iq, "_disabled", False)
rg -n 'getattr.*_disabled' modelopt/torch --type pyRepository: NVIDIA/Model-Optimizer
Length of output: 207
Use the public is_enabled property instead of the private _disabled attribute for consistency.
Line 67 checks getattr(iq, "_disabled", False), but the file already uses the public is_enabled property elsewhere (lines 138, 201, 252). Replace with not iq.is_enabled to match the established pattern:
if iq is None or not iq.is_enabled or iq.pre_quant_scale is None:
return NoneThe check is functionally equivalent—is_enabled wraps _disabled in TensorQuantizer (line 486: return not self._disabled)—but using the public API improves consistency and avoids relying on internal implementation details.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 56 - 70, In
_collect_expert_pre_quant_scales replace the private attribute check getattr(iq,
"_disabled", False) with the public API by testing iq.is_enabled (negated) so
the condition becomes: if iq is None or not iq.is_enabled or iq.pre_quant_scale
is None: return None; update the check in that function to use the public
is_enabled property of the input_quantizer for consistency with other code
(e.g., other uses of is_enabled in the file).
What does this PR do?
Type of change: ?
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
Documentation
New Features
Refactor