Skip to content

Exclude multimodal vision branch from quantization by default (NVBug 6293731, 6293762)#1691

Open
Edwardf0t1 wants to merge 2 commits into
mainfrom
fix-gemma4-fp8-nvfp4-vision-exclude
Open

Exclude multimodal vision branch from quantization by default (NVBug 6293731, 6293762)#1691
Edwardf0t1 wants to merge 2 commits into
mainfrom
fix-gemma4-fp8-nvfp4-vision-exclude

Conversation

@Edwardf0t1

@Edwardf0t1 Edwardf0t1 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: Bug fix

Fixes two sglang deployment failures on multimodal Gemma (gemma-4-31B-it) caused by general PTQ presets leaking quantization into the SigLIP vision branch via broad wildcards:

  • NVBug 6293731general/ptq/fp8_default-kv_fp8: the w8a8_fp8_fp8 unit enables bare *weight_quantizer / *input_quantizer, which also match the vision tower (model.vision_tower.*, model.visual.*) and the vision embedding projection (model.embed_vision.*). The exported checkpoint deploys but emits garbled text in sglang.
  • NVBug 6293762general/ptq/nvfp4_mlp_only-kv_fp8: the *mlp* enables also match the vision tower's block MLPs (model.vision_tower.encoder.layers.*.mlp), and an image request crashes the FP4 kernel at decode: ValueError: too many values to unpack (expected 2) in sglang's modelopt_quant.py apply.

Fix

Add *embed_vision* / *vision_tower* / *visual* disable rules to the shared configs/ptq/units/default_disabled_quantizers unit, alongside the existing *router* / *lm_head* entries.

Both the composed general/ptq/* recipes and the configs/ptq/presets/model/* presets import this unit, so:

  • every general recipe (fp8_default, nvfp4_default, nvfp4_mlp_only, nvfp4_omlp_only, …) keeps the vision branch in BF16 by default — fixing the whole vision-overreach class, not just the two reported recipes;
  • the test_general_ptq_yaml_matches_config_dicts YAML↔preset parity test stays satisfied (both sides pick up the new entries from the one shared unit).

The rules are no-ops on text-only models (nothing matches). A recipe that intentionally wants to quantize the vision branch can re-enable these after importing the unit.

Files changed:

  • modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml (+14)

Testing

Re-export of gemma-4-31B-it with the affected recipes and re-deploy in sglang (the env from the bug reports: lmsysorg/sglang:v0.5.12.post1, GB200) to confirm fp8_default no longer garbles text and nvfp4_mlp_only no longer crashes on image requests. (Results to be appended.) Unit-level: tests/unit/recipe/test_loader.py::test_general_ptq_yaml_matches_config_dicts (parity) passes for all four general presets.

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅ (text-only checkpoints unaffected; new rules only match vision modules that should never have been quantized by a general recipe)
  • If you copied code from any other sources or added a new PIP dependency: N/A
  • Did you write any new necessary tests?: N/A (recipe data fix; covered by the existing parity test + verified by real PTQ export + sglang deploy)
  • Did you update Changelog?: N/A
  • Did you get Claude approval on this PR?: ❌ (pending)

Additional Information

NVBug 6293731 and 6293762. Reported on modelopt 0.45.0rc0, GB200, gemma-4-31B-it, sglang 0.5.12.post1. Tracked under OMNIML-5034. Companion to PR #1690 (same vision-overreach class on the gemma-specific w4a8_awq recipe, NVBug 6294017).

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Chores
    • Updated quantization configuration to preserve BF16 precision for vision encoder components in multimodal models.

…anch in sglang (NVBug 6293731, 6293762)

The general PTQ presets `fp8_default-kv_fp8` and `nvfp4_mlp_only-kv_fp8`
(and their `_cast` KV siblings) enable quantization with broad wildcards
that, on multimodal Gemma checkpoints (e.g. gemma-4-31B-it), also match the
SigLIP vision tower (`model.vision_tower.*`), the vision embedding projection
(`model.embed_vision.*`), and the vision block MLPs:

  - `fp8_default`: the `w8a8_fp8_fp8` unit enables bare `*weight_quantizer` /
    `*input_quantizer`, FP8-quantizing the whole vision branch. The exported
    checkpoint then deploys but emits garbled text in sglang (NVBug 6293731).
  - `nvfp4_mlp_only`: the `*mlp*` enables match
    `vision_tower.encoder.layers.*.mlp`, so the FP4 kernel crashes at decode
    with `ValueError: too many values to unpack (expected 2)` in sglang's
    modelopt_quant apply path (NVBug 6293762).

Add trailing `*visual*` / `*vision_tower*` / `*embed_vision*` disable rules
(placed after the enables and `default_disabled_quantizers` so the disable
wins), keeping the vision branch in BF16. Mirrors the vision exclusions
already shipped in the gemma w4a8_awq / qwen3_5 / nemotron_vl recipes. The
rules are no-ops on text-only models.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@Edwardf0t1 Edwardf0t1 requested a review from a team as a code owner June 11, 2026 21:29
@Edwardf0t1 Edwardf0t1 requested a review from sychen52 June 11, 2026 21:29
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3c9f8ae1-09ab-40c2-9240-ae3f30f5b2ec

📥 Commits

Reviewing files that changed from the base of the PR and between 9e2acad and 513862e.

📒 Files selected for processing (1)
  • modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml

📝 Walkthrough

Walkthrough

This PR extends the PTQ default quantizer disable configuration to explicitly exclude vision and multimodal components from quantization by adding three new pattern-matching rules (*embed_vision*, *vision_tower*, *visual*) with documentation that these components remain in BF16 format unless downstream recipes re-enable them.

Changes

Quantization Recipe Configuration Update

Layer / File(s) Summary
Vision component exclusion patterns in default quantizer config
modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml
Three new quantizer disable entries for *embed_vision*, *vision_tower*, and *visual* patterns are added to the default disabled quantizers configuration, with accompanying comments explaining that vision encoders and multimodal embedding projections remain in BF16 by default.

Possibly related PRs

  • NVIDIA/Model-Optimizer#1687: Both PRs modify the same PTQ quantizer-disabling YAML configuration to add rules keeping vision and multimodal components unquantized for NVFP4.

Suggested reviewers

  • shengliangxu

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: excluding multimodal vision branches from quantization in PTQ recipes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed git diff vs origin/main shows no Python changes under modelopt/ or examples/ (only YAML plus non-scope test/tool files). No SECURITY.md anti-patterns can be introduced in-scope Python.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-gemma4-fp8-nvfp4-vision-exclude

Comment @coderabbitai help to get the list of available commands and usage tips.

# (NVBug 6293731) and is accuracy-harmful generally. Must come after the
# enables so the disable wins (later entries override earlier). No-op on
# text-only models.
- quantizer_name: '*visual*'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just move these to default_disabled_quantizers yaml? or we want to quantize them in some cases?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I moved these to default_disabled_quantizers for diffusiongemma too, those changes are currently local

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that vision modules need to be excluded by default.

@Edwardf0t1 Edwardf0t1 added the cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label Jun 12, 2026
…6293731, 6293762)

The general PTQ presets quantize via broad wildcards: `fp8_default` enables
bare `*weight_quantizer` / `*input_quantizer` (the `w8a8_fp8_fp8` unit) and
`nvfp4_mlp_only` enables `*mlp*`. On multimodal checkpoints (e.g. gemma-4-31B-it)
these also match the SigLIP vision tower (`model.vision_tower.*`,
`model.visual.*`) and the vision embedding projection (`model.embed_vision.*`):

  - fp8_default-kv_fp8: FP8-quantizes the vision branch; the checkpoint deploys
    but emits garbled text in sglang (NVBug 6293731).
  - nvfp4_mlp_only-kv_fp8: NVFP4-quantizes the vision block MLPs; the FP4 kernel
    crashes at decode with `too many values to unpack (expected 2)` (NVBug 6293762).

Add `*embed_vision*` / `*vision_tower*` / `*visual*` disable rules to the shared
`configs/ptq/units/default_disabled_quantizers` unit, alongside the existing
`*router*` / `*lm_head*` entries. Because both the composed `general/ptq/*`
recipes and the `configs/ptq/presets/model/*` presets import this unit, every
general recipe keeps the vision branch in BF16 by default and the YAML<->preset
parity test stays satisfied. No-op on text-only models; a recipe that
intentionally quantizes vision can re-enable after importing this unit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@Edwardf0t1 Edwardf0t1 changed the title Fix gemma-4 fp8_default / nvfp4_mlp_only recipes quantizing vision branch in sglang (NVBug 6293731, 6293762) Exclude multimodal vision branch from quantization by default (NVBug 6293731, 6293762) Jun 12, 2026
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.73%. Comparing base (dd49a46) to head (513862e).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1691      +/-   ##
==========================================
+ Coverage   67.72%   67.73%   +0.01%     
==========================================
  Files         511      511              
  Lines       56168    56168              
==========================================
+ Hits        38037    38043       +6     
+ Misses      18131    18125       -6     
Flag Coverage Δ
unit 54.34% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants