Skip to content

vLLM fakequant: add recipe-based quantization support#1233

Open
kinjalpatel27 wants to merge 2 commits intomainfrom
kinjal/vllm_fq_recipe
Open

vLLM fakequant: add recipe-based quantization support#1233
kinjalpatel27 wants to merge 2 commits intomainfrom
kinjal/vllm_fq_recipe

Conversation

@kinjalpatel27
Copy link
Copy Markdown
Contributor

@kinjalpatel27 kinjalpatel27 commented Apr 10, 2026

What does this PR do?

Type of change: example update

This PR adds recipe-based quantization support to the vLLM fakequant example.

Testing

docker run --gpus all -it --shm-size=160GB --network host --rm --entrypoint bash -v <modelopt>:/home/modelopt vllm/vllm-openai:v0.15.0 -c "cd /home/modelopt && pip install . && pip install datasets && RECIPE_PATH=/home/modelopt/modelopt_recipes/general/ptq/nvfp4_mlp_only-fp8_kv.yml python3 /home/modelopt/examples/vllm_serve/vllm_serve_fakequant.py Qwen/Qwen3-0.6B -tp 1 --served-model-name Qwen3-0.6B --host 0.0.0.0 --port 8001 --trust-remote-code --disable-custom-all-reduce --gpu-memory-utilization 0.8"

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A
  • Did you update Changelog?: N/A

Additional Information

Summary by CodeRabbit

  • New Features

    • Added RECIPE_PATH environment variable support enabling users to specify ModelOpt PTQ recipe YAML files for quantization configuration in vLLM serving.
  • Documentation

    • Updated examples and documentation to support recipe-driven quantization configuration, aligning export workflow with recipe-based setup.

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 10, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 10, 2026


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error Critical security violation found at examples/vllm_serve/vllm_ptq_utils.py lines 147-149 where assert isinstance() is used for runtime validation, which can be disabled with Python optimization flags. Replace assert isinstance(recipe, ModelOptPTQRecipe) with explicit validation using if not isinstance(): raise ValueError() to ensure validation cannot be bypassed by optimization flags.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately describes the main change: adding recipe-based quantization support to vLLM fakequant, which is reflected across all modified files (README, fakequant_worker.py, vllm_ptq_utils.py, vllm_serve_fakequant.py).
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kinjal/vllm_fq_recipe

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1233/

Built to branch gh-pages at 2026-04-10 19:47 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@kinjalpatel27 kinjalpatel27 marked this pull request as ready for review April 10, 2026 19:54
@kinjalpatel27 kinjalpatel27 requested a review from a team as a code owner April 10, 2026 19:54
@kinjalpatel27 kinjalpatel27 requested a review from sugunav14 April 10, 2026 19:54
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.58%. Comparing base (3baa2da) to head (531fa3c).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1233      +/-   ##
==========================================
+ Coverage   76.03%   77.58%   +1.54%     
==========================================
  Files         350      350              
  Lines       40469    40537      +68     
==========================================
+ Hits        30772    31449     +677     
+ Misses       9697     9088     -609     
Flag Coverage Δ
examples 44.12% <ø> (+2.76%) ⬆️
unit 55.51% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/vllm_serve/vllm_ptq_utils.py`:
- Around line 145-150: Replace the unsafe assert in the recipe validation with
an explicit runtime check: after calling
load_recipe(quant_config["recipe_path"]) and assigning to recipe, verify type
with if not isinstance(recipe, ModelOptPTQRecipe): raise a ValueError containing
the same descriptive message (including the actual type and recipe path) and
then set quant_cfg = recipe.quantize; this ensures load_recipe,
ModelOptPTQRecipe, quant_cfg and recipe behavior remains the same but prevents
the check from being skipped under Python optimizations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fe9bb658-ad3b-4a06-aa28-6e2bb5155f6f

📥 Commits

Reviewing files that changed from the base of the PR and between da0e8ff and 531fa3c.

📒 Files selected for processing (4)
  • examples/vllm_serve/README.md
  • examples/vllm_serve/fakequant_worker.py
  • examples/vllm_serve/vllm_ptq_utils.py
  • examples/vllm_serve/vllm_serve_fakequant.py

Comment on lines +145 to +150
if quant_config["recipe_path"]:
recipe = load_recipe(quant_config["recipe_path"])
assert isinstance(recipe, ModelOptPTQRecipe), (
f"Expected PTQ recipe, but got {type(recipe).__name__} from {quant_config['recipe_path']}"
)
quant_cfg = recipe.quantize
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify current assertion usage in this path.
rg -n -C2 'assert isinstance\(recipe, ModelOptPTQRecipe\)' --type=py

# Demonstrate Python optimization removing assert checks.
python - <<'PY'
src = """def f(x):
    assert isinstance(x, int), "bad type"
    return x
"""
ns = {}
exec(compile(src, "<inline>", "exec", optimize=1), ns)
import dis
dis.dis(ns["f"])
PY

Repository: NVIDIA/Model-Optimizer

Length of output: 1818


Use explicit exception instead of assert for runtime validation.

At line 147, using assert isinstance() is unsafe because assertions can be disabled when Python runs with optimization flags (e.g., python -O), allowing invalid recipe types to bypass this check silently. Use an explicit if/raise ValueError() pattern instead.

Proposed fix
     if quant_config["recipe_path"]:
         recipe = load_recipe(quant_config["recipe_path"])
-        assert isinstance(recipe, ModelOptPTQRecipe), (
-            f"Expected PTQ recipe, but got {type(recipe).__name__} from {quant_config['recipe_path']}"
-        )
+        if not isinstance(recipe, ModelOptPTQRecipe):
+            raise ValueError(
+                f"Expected PTQ recipe, but got {type(recipe).__name__} from {quant_config['recipe_path']}"
+            )
         quant_cfg = recipe.quantize
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if quant_config["recipe_path"]:
recipe = load_recipe(quant_config["recipe_path"])
assert isinstance(recipe, ModelOptPTQRecipe), (
f"Expected PTQ recipe, but got {type(recipe).__name__} from {quant_config['recipe_path']}"
)
quant_cfg = recipe.quantize
if quant_config["recipe_path"]:
recipe = load_recipe(quant_config["recipe_path"])
if not isinstance(recipe, ModelOptPTQRecipe):
raise ValueError(
f"Expected PTQ recipe, but got {type(recipe).__name__} from {quant_config['recipe_path']}"
)
quant_cfg = recipe.quantize
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/vllm_serve/vllm_ptq_utils.py` around lines 145 - 150, Replace the
unsafe assert in the recipe validation with an explicit runtime check: after
calling load_recipe(quant_config["recipe_path"]) and assigning to recipe, verify
type with if not isinstance(recipe, ModelOptPTQRecipe): raise a ValueError
containing the same descriptive message (including the actual type and recipe
path) and then set quant_cfg = recipe.quantize; this ensures load_recipe,
ModelOptPTQRecipe, quant_cfg and recipe behavior remains the same but prevents
the check from being skipped under Python optimizations.

Copy link
Copy Markdown
Collaborator

@shengliangxu shengliangxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants