Exclude multimodal vision branch from quantization by default (NVBug 6293731, 6293762)#1691
Exclude multimodal vision branch from quantization by default (NVBug 6293731, 6293762)#1691Edwardf0t1 wants to merge 2 commits into
Conversation
…anch in sglang (NVBug 6293731, 6293762)
The general PTQ presets `fp8_default-kv_fp8` and `nvfp4_mlp_only-kv_fp8`
(and their `_cast` KV siblings) enable quantization with broad wildcards
that, on multimodal Gemma checkpoints (e.g. gemma-4-31B-it), also match the
SigLIP vision tower (`model.vision_tower.*`), the vision embedding projection
(`model.embed_vision.*`), and the vision block MLPs:
- `fp8_default`: the `w8a8_fp8_fp8` unit enables bare `*weight_quantizer` /
`*input_quantizer`, FP8-quantizing the whole vision branch. The exported
checkpoint then deploys but emits garbled text in sglang (NVBug 6293731).
- `nvfp4_mlp_only`: the `*mlp*` enables match
`vision_tower.encoder.layers.*.mlp`, so the FP4 kernel crashes at decode
with `ValueError: too many values to unpack (expected 2)` in sglang's
modelopt_quant apply path (NVBug 6293762).
Add trailing `*visual*` / `*vision_tower*` / `*embed_vision*` disable rules
(placed after the enables and `default_disabled_quantizers` so the disable
wins), keeping the vision branch in BF16. Mirrors the vision exclusions
already shipped in the gemma w4a8_awq / qwen3_5 / nemotron_vl recipes. The
rules are no-ops on text-only models.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR extends the PTQ default quantizer disable configuration to explicitly exclude vision and multimodal components from quantization by adding three new pattern-matching rules ( ChangesQuantization Recipe Configuration Update
Possibly related PRs
Suggested reviewers
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
| # (NVBug 6293731) and is accuracy-harmful generally. Must come after the | ||
| # enables so the disable wins (later entries override earlier). No-op on | ||
| # text-only models. | ||
| - quantizer_name: '*visual*' |
There was a problem hiding this comment.
Should we just move these to default_disabled_quantizers yaml? or we want to quantize them in some cases?
There was a problem hiding this comment.
I agree, I moved these to default_disabled_quantizers for diffusiongemma too, those changes are currently local
There was a problem hiding this comment.
Agree that vision modules need to be excluded by default.
…6293731, 6293762)
The general PTQ presets quantize via broad wildcards: `fp8_default` enables
bare `*weight_quantizer` / `*input_quantizer` (the `w8a8_fp8_fp8` unit) and
`nvfp4_mlp_only` enables `*mlp*`. On multimodal checkpoints (e.g. gemma-4-31B-it)
these also match the SigLIP vision tower (`model.vision_tower.*`,
`model.visual.*`) and the vision embedding projection (`model.embed_vision.*`):
- fp8_default-kv_fp8: FP8-quantizes the vision branch; the checkpoint deploys
but emits garbled text in sglang (NVBug 6293731).
- nvfp4_mlp_only-kv_fp8: NVFP4-quantizes the vision block MLPs; the FP4 kernel
crashes at decode with `too many values to unpack (expected 2)` (NVBug 6293762).
Add `*embed_vision*` / `*vision_tower*` / `*visual*` disable rules to the shared
`configs/ptq/units/default_disabled_quantizers` unit, alongside the existing
`*router*` / `*lm_head*` entries. Because both the composed `general/ptq/*`
recipes and the `configs/ptq/presets/model/*` presets import this unit, every
general recipe keeps the vision branch in BF16 by default and the YAML<->preset
parity test stays satisfied. No-op on text-only models; a recipe that
intentionally quantizes vision can re-enable after importing this unit.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1691 +/- ##
==========================================
+ Coverage 67.72% 67.73% +0.01%
==========================================
Files 511 511
Lines 56168 56168
==========================================
+ Hits 38037 38043 +6
+ Misses 18131 18125 -6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
What does this PR do?
Type of change: Bug fix
Fixes two sglang deployment failures on multimodal Gemma (
gemma-4-31B-it) caused by general PTQ presets leaking quantization into the SigLIP vision branch via broad wildcards:general/ptq/fp8_default-kv_fp8: thew8a8_fp8_fp8unit enables bare*weight_quantizer/*input_quantizer, which also match the vision tower (model.vision_tower.*,model.visual.*) and the vision embedding projection (model.embed_vision.*). The exported checkpoint deploys but emits garbled text in sglang.general/ptq/nvfp4_mlp_only-kv_fp8: the*mlp*enables also match the vision tower's block MLPs (model.vision_tower.encoder.layers.*.mlp), and an image request crashes the FP4 kernel at decode:ValueError: too many values to unpack (expected 2)in sglang'smodelopt_quant.pyapply.Fix
Add
*embed_vision*/*vision_tower*/*visual*disable rules to the sharedconfigs/ptq/units/default_disabled_quantizersunit, alongside the existing*router*/*lm_head*entries.Both the composed
general/ptq/*recipes and theconfigs/ptq/presets/model/*presets import this unit, so:fp8_default,nvfp4_default,nvfp4_mlp_only,nvfp4_omlp_only, …) keeps the vision branch in BF16 by default — fixing the whole vision-overreach class, not just the two reported recipes;test_general_ptq_yaml_matches_config_dictsYAML↔preset parity test stays satisfied (both sides pick up the new entries from the one shared unit).The rules are no-ops on text-only models (nothing matches). A recipe that intentionally wants to quantize the vision branch can re-enable these after importing the unit.
Files changed:
modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml(+14)Testing
Re-export of
gemma-4-31B-itwith the affected recipes and re-deploy in sglang (the env from the bug reports:lmsysorg/sglang:v0.5.12.post1, GB200) to confirm fp8_default no longer garbles text and nvfp4_mlp_only no longer crashes on image requests. (Results to be appended.) Unit-level:tests/unit/recipe/test_loader.py::test_general_ptq_yaml_matches_config_dicts(parity) passes for all four general presets.Before your PR is "Ready for review"
Additional Information
NVBug 6293731 and 6293762. Reported on modelopt 0.45.0rc0, GB200, gemma-4-31B-it, sglang 0.5.12.post1. Tracked under OMNIML-5034. Companion to PR #1690 (same vision-overreach class on the gemma-specific
w4a8_awqrecipe, NVBug 6294017).🤖 Generated with Claude Code
Summary by CodeRabbit