Fix gemma w4a8_awq recipe crashing export on multimodal checkpoints (NVBug 6294017)#1690
Fix gemma w4a8_awq recipe crashing export on multimodal checkpoints (NVBug 6294017)#1690Edwardf0t1 wants to merge 3 commits into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1690 +/- ##
=======================================
Coverage 77.09% 77.09%
=======================================
Files 511 511
Lines 56173 56173
=======================================
Hits 43307 43307
Misses 12866 12866
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
📝 WalkthroughWalkthroughAdds a Gemma 4 PTQ recipe (W4A8, AWQ-lite) with FP8 KV-cache casting and updates default disabled-quantizers to keep multimodal vision/embedding components unquantized; includes README documenting numerics and AWQ search differences. ChangesGemma 4 W4A8 PTQ Recipe and Documentation
🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
✅ Verified on real hardwareReproduced the exact NVBug 6294017 scenario and confirmed the fix, using the real Environment note: run on RTX 6000 Ada (sm_89), not Blackwell. The bug is a structural tensor-indexing bounds error in Command (matches the bug report)Results (full calib_size 512 run)
Root cause confirmed in-environment:
|
| # Gemma-specific W4A8 AWQ PTQ recipe with FP8 KV-cache cast. Uses a coarser | ||
| # optimal-scale search (awq_lite with alpha_step=1) to avoid overflow observed | ||
| # in TRT-LLM kernels when using the default AWQ search on Gemma. | ||
| # |
There was a problem hiding this comment.
gemma4 model type is not the same thing as gemma
please add new recipe under gemma4, you can check it at:
https://huggingface.co/google/gemma-4-31B-it/blob/main/config.json
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Small (+19/-0) additive recipe data fix for w4a8_awq-kv_fp8_cast.yaml: appends *vision_tower* / *embed_vision* enable: false rules after the bare *weight_quantizer/*input_quantizer enables and default_disabled_quantizers, keeping the SigLIP vision branch in BF16 to avoid the INT4 pack_int4_in_uint8 index-out-of-bounds crash on multimodal Gemma (NVBug 6294017).
What checks out:
- Mirrors the established in-repo pattern exactly — qwen3_5 (
w4a16_nvfp4-fp8_attn-kv_fp8_cast.quant_cfg.yaml), nemotron_vl, and phi4mm recipes all use trailingenable: falsevision exclusions placed afterdefault_disabled_quantizers, relying on the same "later entries win" override semantics. - The disable globs live in this single recipe (not the shared
default_disabled_quantizersunit), so blast radius is limited to this recipe only — no risk of disabling vision quant in other presets. - License header unchanged (standard Apache-2.0/NVIDIA); no licensing concern. No prompt-injection in the untrusted content.
Why a human glance:
- No automated test added. Sibling additive recipe variants are normally covered by a
test_loader.pysmoke test; consider adding one so the new rules' resolution is verified in CI. - Verified only by a manual export run on RTX 6000 Ada (sm_89), not the GB200 where the bug was originally reported. The author notes the crash is a structural tensor-indexing error and thus arch-independent, which is plausible, but a maintainer may want to confirm.
Note on the Slack instruction: only one PR is in scope here and no specific handle was provided, so I can't action a cross-PR tag from this review.
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Re-review of the Gemma 4 vision-exclusion PTQ recipe fix (NVBug 6294017). Small additive change (+85/-0): a new huggingface/gemma4/ptq/w4a8_awq-kv_fp8_cast.yaml recipe + README. The recipe enables W4A8 AWQ with FP8 KV-cache cast and appends trailing *vision_tower* / *embed_vision* enable: false rules after the bare *weight_quantizer/*input_quantizer enables and default_disabled_quantizers, relying on the established "later wins" override semantics to keep the SigLIP vision branch in BF16 and avoid the INT4 pack_int4_in_uint8 index-out-of-bounds crash (vision MLP in-features 4304 not a multiple of the 128 block). Mirrors the qwen3_5 / nemotron_vl / phi4mm sibling recipes exactly; globs are recipe-local so blast radius is limited. License header matches the canonical LICENSE_HEADER; no licensing concern. No prompt injection in the untrusted content.
Previous-review status:
- 💬 shengliangxu flagged that gemma4 is a distinct model type and asked for a new recipe under
gemma4/rather than editinggemma/. Author replied "done" — confirmed: the recipe now lives in the newgemma4/ptq/directory and the originalgemma/ptq/w4a8_awq-kv_fp8_cast.yamlis unchanged. This critical/correctness concern is fully resolved.
Why a human glance:
- 💬 Author verified the exact NVBug scenario on the real gemma-4-31B-it checkpoint but only on RTX 6000 Ada (sm_89), not the GB200 where the bug was reported — argues the crash is a structural arch-independent tensor-indexing error, which is plausible but unconfirmed on the target arch.
- No automated smoke test added. This matches the existing precedent (only
general/ptq/*recipes are parametrized intests/unit/recipe/test_loader.py; the sibling vision-exclusion recipes are likewise uncovered), so it's not a regression — but a maintainer with architectural context should sign off on the recipe data and the cross-arch claim.
…6293731, 6293762) (#1691) ### What does this PR do? Type of change: Bug fix Fixes two sglang deployment failures on multimodal Gemma (`gemma-4-31B-it`) caused by general PTQ presets leaking quantization into the SigLIP vision branch via broad wildcards: - **NVBug 6293731** — `general/ptq/fp8_default-kv_fp8`: the `w8a8_fp8_fp8` unit enables bare `*weight_quantizer` / `*input_quantizer`, which also match the vision tower (`model.vision_tower.*`, `model.visual.*`) and the vision embedding projection (`model.embed_vision.*`). The exported checkpoint deploys but emits **garbled text** in sglang. - **NVBug 6293762** — `general/ptq/nvfp4_mlp_only-kv_fp8`: the `*mlp*` enables also match the vision tower's block MLPs (`model.vision_tower.encoder.layers.*.mlp`), and an image request **crashes** the FP4 kernel at decode: `ValueError: too many values to unpack (expected 2)` in sglang's `modelopt_quant.py` `apply`. ### Fix Add `*embed_vision*` / `*vision_tower*` / `*visual*` disable rules to the shared `configs/ptq/units/default_disabled_quantizers` unit, alongside the existing `*router*` / `*lm_head*` entries. Both the composed `general/ptq/*` recipes **and** the `configs/ptq/presets/model/*` presets import this unit, so: - every general recipe (`fp8_default`, `nvfp4_default`, `nvfp4_mlp_only`, `nvfp4_omlp_only`, …) keeps the vision branch in BF16 by default — fixing the whole vision-overreach class, not just the two reported recipes; - the `test_general_ptq_yaml_matches_config_dicts` YAML↔preset parity test stays satisfied (both sides pick up the new entries from the one shared unit). The rules are **no-ops on text-only models** (nothing matches). A recipe that intentionally wants to quantize the vision branch can re-enable these after importing the unit. Files changed: - `modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml` (+14) ### Testing Re-export of `gemma-4-31B-it` with the affected recipes and re-deploy in sglang (the env from the bug reports: `lmsysorg/sglang:v0.5.12.post1`, GB200) to confirm fp8_default no longer garbles text and nvfp4_mlp_only no longer crashes on image requests. _(Results to be appended.)_ Unit-level: `tests/unit/recipe/test_loader.py::test_general_ptq_yaml_matches_config_dicts` (parity) passes for all four general presets. ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ (text-only checkpoints unaffected; new rules only match vision modules that should never have been quantized by a general recipe) - If you copied code from any other sources or added a new PIP dependency: N/A - Did you write any new necessary tests?: N/A (recipe data fix; covered by the existing parity test + verified by real PTQ export + sglang deploy) - Did you update Changelog?: N/A - Did you get Claude approval on this PR?: ❌ (pending) ### Additional Information NVBug 6293731 and 6293762. Reported on modelopt 0.45.0rc0, GB200, gemma-4-31B-it, sglang 0.5.12.post1. Tracked under OMNIML-5034. Companion to PR #1690 (same vision-overreach class on the gemma-specific `w4a8_awq` recipe, NVBug 6294017). 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Updated quantization configuration to preserve BF16 precision for vision encoder components in multimodal models. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…NVBug 6294017) The gemma `w4a8_awq-kv_fp8_cast` recipe enables quantization with bare `*weight_quantizer` / `*input_quantizer` wildcards. On multimodal Gemma checkpoints (e.g. gemma-4-31B-it) these also match the SigLIP vision tower (`model.vision_tower.*`) and the vision embedding projection (`model.embed_vision.*`). The vision tower's MLP in-features (4304) are not a multiple of the INT4 block size (128), so INT4 weight packing at export hits a device-side `index out of bounds` assert in `pack_int4_in_uint8`. Quantizing the vision branch is also accuracy-harmful. Add trailing `*vision_tower*` / `*embed_vision*` disable rules (placed after the enables so the disable wins), keeping the vision branch in BF16. Mirrors the vision exclusions already shipped in the qwen3_5 / nemotron_vl / phi4mm recipes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Per review (shengliangxu): gemma-4-31B-it is model_type=gemma4, a distinct multimodal architecture from the text-only `gemma` model type. The vision-branch exclusion belongs in a gemma4-specific recipe, not the gemma one (which is text-only and has no vision tower). - Revert modelopt_recipes/huggingface/gemma/ptq/w4a8_awq-kv_fp8_cast.yaml to its original (text-only) form. - Add modelopt_recipes/huggingface/gemma4/ptq/w4a8_awq-kv_fp8_cast.yaml with the same awq_lite alpha_step=1 numerics plus `*vision_tower*` / `*embed_vision*` enable:false rules to keep the SigLIP vision branch in BF16 (fixes NVBug 6294017: INT4 pack_int4_in_uint8 index-out-of-bounds on the vision MLP whose in-features (4304) are not a multiple of the 128 block size). - Add modelopt_recipes/huggingface/gemma4/ptq/README.md. Verified: load_recipe resolves the new recipe (<builtin>/huggingface/gemma4/...) with the vision excludes present; the quantize block is identical to the previously hardware-verified fix (full calib_size 512 export on gemma-4-31B-it: no index-out-of-bounds, exclude_modules = lm_head, model.embed_vision*, model.vision_tower*; vision 0/353 quantized, LM 410/772 quantized W4A8_AWQ). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
…rlap Follow-up to #1691 (merged) and meenchen's qwen3.6 vision-exclusion addition, both of which landed `*vision_tower*` / `*visual*` in default_disabled_quantizers. - default_disabled_quantizers.yaml: remove the duplicate bare `*visual*` / `*vision_tower*` entries (qwen3.6) now that the documented block already disables `*vision_tower*` / `*visual*` / `*embed_vision*`. One source of truth. - gemma4 w4a8_awq recipe: drop the now-redundant explicit `*vision_tower*` / `*embed_vision*` excludes — they are inherited from the shared default_disabled_quantizers unit (imported last so its disables win). The recipe is now just the gemma-specific awq_lite alpha_step=1 numerics. - Update the gemma4 recipe comment / README to reflect the shared-unit source. Verified: load_recipe on the gemma4 recipe resolves `*vision_tower*` / `*visual*` / `*embed_vision*` as disabled (via the shared unit) with `*weight_quantizer` still enabled (INT4). Fixes NVBug 6294017. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
11aa3ac to
0cf494b
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml (1)
41-42: Clarify intent ofquantizer_name: 'output.*'disable rule
quantizer_nameentries are matched withfnmatch.fnmatch(name, key), sooutput.*disables quantizers whose module name starts withoutput.(distinct from*output_layer*, which matches the substringoutput_layer).- The same
output.*exclusion appears in other PTQ recipes (modelopt_recipes/huggingface/step3p5/Step3.5-Flash/ptq/nvfp4-mlp-only.yamlandmodelopt_recipes/general/ptq/int4_blockwise_weight_only.yaml), so it’s likely intentional rather than a merge artifact.- Add a short comment near this entry (or in the PR description) stating which model component/module naming pattern it targets and why it must be disabled.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml` around lines 41 - 42, The rule disabling quantizers with quantizer_name 'output.*' is ambiguous; add a short inline comment next to the entry explaining that fnmatch.fnmatch is used, that 'output.*' matches module names starting with "output." (not substring matches like "*output_layer*"), and state which model component(s) this targets (e.g., final output projection or output module name used by the model) and why those quantizers must be disabled; reference the exact key 'quantizer_name: \'output.*\'' so reviewers can find it and mirror the same comment in the other PTQ recipes that use this exclusion if applicable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml`:
- Around line 41-42: The rule disabling quantizers with quantizer_name
'output.*' is ambiguous; add a short inline comment next to the entry explaining
that fnmatch.fnmatch is used, that 'output.*' matches module names starting with
"output." (not substring matches like "*output_layer*"), and state which model
component(s) this targets (e.g., final output projection or output module name
used by the model) and why those quantizers must be disabled; reference the
exact key 'quantizer_name: \'output.*\'' so reviewers can find it and mirror the
same comment in the other PTQ recipes that use this exclusion if applicable.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5938de2d-431f-4fe2-8fc2-f0943732b72f
📒 Files selected for processing (3)
modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yamlmodelopt_recipes/huggingface/gemma4/ptq/README.mdmodelopt_recipes/huggingface/gemma4/ptq/w4a8_awq-kv_fp8_cast.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
- modelopt_recipes/huggingface/gemma4/ptq/README.md
- modelopt_recipes/huggingface/gemma4/ptq/w4a8_awq-kv_fp8_cast.yaml
What does this PR do?
Type of change: Bug fix
Fixes the
huggingface/gemma/ptq/w4a8_awq-kv_fp8_castrecipe crashing at export on multimodal Gemma checkpoints (e.g.gemma-4-31B-it).The recipe enables quantization with bare
*weight_quantizer/*input_quantizerwildcards. On VLM Gemma these also match the SigLIP vision tower (model.vision_tower.*) and the vision embedding projection (model.embed_vision.*). The vision tower's MLP in-features (4304) are not a multiple of the INT4 block size (128), so INT4 weight packing at export hits a device-sideindex out of boundsassert inpack_int4_in_uint8(quant_utils.py). Quantizing the vision branch is also accuracy-harmful (post-PTQ generation in the bug log is degenerate).The fix appends
*vision_tower*/*embed_vision*disable rules after the enables (later entries win), keeping the vision branch in BF16. This mirrors vision exclusions already shipping in theqwen3_5,nemotron_vl, andphi4mmrecipes.Usage
Testing
Reproduced the exact NVBug 6294017 scenario on the real
gemma-4-31B-itcheckpoint, with the repo's modelopt (0.45.0.dev159, this branch) and the requested containernvcr.io/nvidia/tensorrt-llm/release:1.3.0rc18. Ran on RTX 6000 Ada (sm_89); the crash is a structural tensor-indexing bounds error inpack_int4_in_uint8at export, which is architecture-independent, so it reproduces/verifies identically (a Blackwell run can be added if desired). Verified at bothcalib_size 4(smoke) andcalib_size 512(full); identical structural results.pack_int4_in_uint8index out of boundsdevice-side asserthf_quant_config.exclude_moduleslm_head, model.embed_vision*, model.vision_tower*intermediate_size=4304)W4A8_AWQ, group 128, KV FP8Root cause confirmed in-environment:
model_type=gemma4,vision_config.intermediate_size=4304(not a multiple of the 128 INT4 block) — exactly the layer that overran the scale-factor index. This verifies the crash fix and the layer-precision split; W4A8 accuracy is a separate matter (not assessed here).Before your PR is "Ready for review"
Additional Information
NVBug 6294017. Reported on modelopt 0.45.0rc0, GB200, gemma-4-31B-it.
🤖 Generated with Claude Code
Summary by CodeRabbit
Documentation
New Features
Bug Fixes