Fix GPT-OSS MXFP4->NVFP4 PTQ load, export, and cast (nvbug 6295279, 6295242)#1678
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (8)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (4)
📝 WalkthroughWalkthroughThis PR fixes MXFP4→NVFP4 post-training quantization workflows by: (1) adding MXFP4 checkpoint detection and native model loading with forced dequantization and controlled device mapping; (2) resolving Hugging Face Hub IDs to local snapshots for casting operations; (3) aligning expert weight calibration to use transposed views matching forward-time behavior. ChangesMXFP4 PTQ Loading and Expert Calibration Shape Fixes
🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers:
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
examples/llm_ptq/example_utils.py (1)
722-723: ⚡ Quick winMove
Mxfp4Configimport to module scope (or add explicit local-import justification).Line 722 introduces a function-local import without an explicit circular/optional/heavy-import reason.
As per coding guidelines, keep imports at the top of the file; function-local imports are only for circular imports, optional deps, or explicitly justified heavy imports.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/llm_ptq/example_utils.py` around lines 722 - 723, Move the local import of Mxfp4Config out of the function and into the module-level imports (add "from transformers import Mxfp4Config" to the top import block) so imports follow the project guideline; if the import must remain local because it's optional or to avoid a circular/heavy dependency, add an explicit comment above the local import explaining the reason and wrap it in a try/except ImportError with a clear fallback or error message so callers understand the justification.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 722-723: Move the local import of Mxfp4Config out of the function
and into the module-level imports (add "from transformers import Mxfp4Config" to
the top import block) so imports follow the project guideline; if the import
must remain local because it's optional or to avoid a circular/heavy dependency,
add an explicit comment above the local import explaining the reason and wrap it
in a try/except ImportError with a clear fallback or error message so callers
understand the justification.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 650a8273-ec8a-4e7c-8d8f-6f1447eac20f
📒 Files selected for processing (6)
CHANGELOG.rstexamples/llm_ptq/example_utils.pyexamples/llm_ptq/hf_ptq.pymodelopt/torch/quantization/plugins/huggingface.pytests/examples/llm_ptq/test_example_utils.pytests/unit/torch/quantization/plugins/test_huggingface.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1678 +/- ##
==========================================
- Coverage 77.30% 76.20% -1.11%
==========================================
Files 509 511 +2
Lines 55914 56948 +1034
==========================================
+ Hits 43227 43399 +172
- Misses 12687 13549 +862
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
29f82ce to
eaff3f2
Compare
meenchen
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Well-scoped bug-fix PR for the GPT-OSS MXFP4→NVFP4 PTQ path (nvbug 6295279, 6295242). I verified the key claims against the repo:
Design review (gate fired on 6 dirs): This is additive, not a new subsystem. The mxfp4-load branch mirrors the existing has_pack_quantized_config branch directly above it in get_model, and _resolve_model_path is an existing utility reused rather than reinvented. There's no pre-existing mxfp4 handling in the PTQ path (the only other Mxfp4Config(dequantize=True) uses are in unrelated gpt-oss SFT/QAT examples). The PR body extensively justifies the dequantize=True + sequential device-map choice and the transposed-calibration fix, so the design question is addressed. The gate fired mainly because tests span multiple directories.
Correctness: Confirmed both _QuantGptOssExperts.forward and _QuantLlama4TextExperts.forward quantize via _transposed_quantize (transpose -1,-2), so the new iter_weights_for_calibration override yielding weight.transpose(-1, -2) correctly matches the forward and export _amax orientation — fixing the "Input shape has changed" regression. The base iter_weights_for_calibration signature is preserved. get_original_hf_quant_method correctly handles dict/object/nested text_config cases.
Tests: Strong coverage — unit tests for quant-method detection, the transpose regression, force_weight_quantizers_static, plus a GPU e2e regression guard (using the existing get_tiny_gpt_oss helper).
Licensing: New GPU test file's header matches the canonical LICENSE_HEADER exactly; standard-header exception applies.
Size: +297 -1, cohesive.
Known limitation (gpt-oss-120b won't fit a single GPU under sequential) is acknowledged in the PR body and left as a follow-up. No prompt-injection attempts in the untrusted content.
Complex PR: spans 6 directories (≥ 5). Looping in a human for approval.
| self.down_proj_input_quantizer = TensorQuantizer() | ||
| self.down_proj_weight_quantizer = TensorQuantizer() | ||
|
|
||
| def iter_weights_for_calibration(self): |
There was a problem hiding this comment.
Looks like this function should be extracted out and even replace the _iter_transposed_expert_weights_for_calibration.
There was a problem hiding this comment.
Done in 8766bf2 — extracted into a shared _TransposedExpertsCalibMixin that defines iter_weights_for_calibration once; both _QuantGptOssExperts and _QuantLlama4TextExperts now inherit it, and the standalone _iter_transposed_expert_weights_for_calibration helper is removed. Thanks!
meenchen
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Re-review of the GPT-OSS MXFP4→NVFP4 PTQ fix (PR #1678). The change is well-scoped, additive, and well-tested. All previous review comments were minor and are resolved or non-blocking:
- 💬 sychen52 (huggingface.py:636, "extract this function") — Addressed: the transposed-calibration logic is now factored into the shared helper
_iter_transposed_expert_weights_for_calibration, and both_QuantGptOssExpertsand_QuantLlama4TextExpertsdelegate to it. No remaining concern. - 💬 meenchen (example_utils.py:590, "is this GPT-OSS specific?") — Addressed:
get_original_hf_quant_methodis generic — it returns any checkpoint'squantization_config.quant_method(dict/object/nestedtext_config), not just"mxfp4". The mxfp4 specialization lives in theget_modelbranch, not the detector. - CodeRabbit (example_utils.py:722, local
Mxfp4Configimport) — Still present and the inline comment justifies forcing dequantization but not why the import is local. This is a minor style nit (CodeRabbit itself flagged it as a nitpick) and does not block.
Correctness spot-check confirms the transposed calibration view (weight.transpose(-1, -2)) matches both _transposed_quantize in the forward and the export _amax orientation, fixing the "Input shape has changed" regression. The new get_model mxfp4 branch mirrors the existing has_pack_quantized_config pattern; _resolve_model_path is reused, not reinvented.
Flagging for human sign-off because: (1) the change spans 6 directories (design-review complexity gate), (2) the core fixes were validated only on B200/gpt-oss-20b with the 120b multi-GPU case acknowledged as an unhandled follow-up, and (3) the prior bot review already looped in a human. No prompt-injection attempts found in the untrusted content. Licensing is clean (new GPU test file uses the standard NVIDIA Apache header).
…295242)
The documented GPT-OSS MXFP4->NVFP4 command
hf_ptq.py --pyt_ckpt_path openai/gpt-oss-20b --qformat nvfp4_mlp_only \
--cast_mxfp4_to_nvfp4 --export_path ...
failed in three ways; all are fixed and the command now runs end-to-end,
producing a bit-exact (100% lossless) NVFP4 checkpoint.
1. nvbug 6295242 - CUDA illegal memory access on load. GPT-OSS ships native
MXFP4 weights that Transformers dequantizes to BF16; the threaded weight
loader trips an illegal-memory access when device_map="auto" shards the
dequant across multiple GPUs (the missing optional 'kernels' package only
forces the dequant path, it is not the root cause). get_model now detects
MXFP4 checkpoints and loads them with Mxfp4Config(dequantize=True) on a
sequential device map so the dequant stays on a single device.
2. nvbug 6295279 #1 - unified HF export raised NotImplementedError for experts
type 'Mxfp4GptOssExperts'. Forcing dequantize=True yields plain GptOssExperts
(even when 'kernels' is installed), which ModelOpt wraps and exports normally.
3. nvbug 6295279 #2 - the --cast_mxfp4_to_nvfp4 step treated --pyt_ckpt_path as a
local dir, so a HF Hub ID failed with FileNotFoundError. Resolve it to the
cached local snapshot dir via _resolve_model_path before the cast.
Also fixes a static-block NVFP4 regression (surfaced by the cast's
force_weight_quantizers_static) introduced by #1560's unconditional
weight_only_quantize: _QuantGptOssExperts / _QuantLlama4TextExperts quantize
their expert weights transposed in the forward (_transposed_quantize) but the
inherited iter_weights_for_calibration fed the non-transposed weight, locking a
mismatched block-quant _original_shape and raising 'Input shape has changed'.
Override iter_weights_for_calibration to calibrate on the transposed view,
matching both the forward and the export's _amax orientation.
Adds unit tests for get_original_hf_quant_method and the transposed
expert-weight calibration.
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
eaff3f2 to
8766bf2
Compare
What does this PR do?
Type of change: Bug fix
Fixes the GPT-OSS MXFP4 → NVFP4 PTQ path (
examples/llm_ptq/hf_ptq.pywith--cast_mxfp4_to_nvfp4), which failed in three independent ways. The documented command now runs end-to-end and produces a bit-exact (100% lossless) NVFP4 checkpoint. Addresses nvbug 6295279 (OMNIML-5046) and nvbug 6295242 (OMNIML-5045).device_map="auto"shards the dequant across multiple GPUs. The missing optionalkernelspackage only forces the dequant path — it is not the root cause.get_modelnow detects MXFP4 checkpoints and loads them withMxfp4Config(dequantize=True)on a sequential device map so the dequant stays on a single device.kernelsis no longer required.NotImplementedError: Mxfp4GptOssExpertsduring unified HF export. Forcingdequantize=Trueyields plainGptOssExperts(even whenkernelsis installed), which ModelOpt wraps and exports normally.FileNotFoundErrorin the cast step.--cast_mxfp4_to_nvfp4treated--pyt_ckpt_pathas a local dir; a HF Hub ID now resolves to its cached snapshot dir via_resolve_model_path.Also fixes a static-block NVFP4 regression (surfaced by the cast's
force_weight_quantizers_static, introduced by #1560's now-unconditionalweight_only_quantize):_QuantGptOssExperts/_QuantLlama4TextExpertsquantize their expert weights transposed in the forward (_transposed_quantize), but the inheritediter_weights_for_calibrationfed the non-transposed weight, locking a mismatched block-quant_original_shapeand raisingValueError: Input shape has changed. The override now calibrates on the transposed view, matching both the forward and the export's_amaxorientation.Why this regressed (it worked when the cast was added)
get_modelnever had explicit handling for a natively pre-quantized MXFP4 checkpoint — GPT-OSS fell through the generic unquantized-checkpoint branch and relied on Transformers' implicit MXFP4 behavior, which is fragile across three axes. The cast was originally validated (#1372, 2026-05-01) in the "lucky" quadrant of each:device_map="auto"on a single GPU never shards, so the dequant stays on one device. On multiple GPUsautobalances the model and shards the MXFP4→BF16 dequant across devices → CUDA illegal-memory crash (6295242).kernelspresence: withoutkernels, Transformers auto-dequantizes to BF16GptOssExperts(exportable). Withkernelsinstalled it keeps the packedMxfp4GptOssExpertskernel path → exportNotImplementedError(6295279 Update README.md #1).The QA env sat in the breaking quadrant (multi-GPU and/or
kernelspresent, newer Transformers), so the implicit path failed. The new branch makes both decisions explicit and deterministic (dequantize=True+ single-device load), regardless of environment — mirroring the existinghas_pack_quantized_configbranch for compressed-tensors checkpoints.The fourth issue (static-block
Input shape has changed) is a separate regression: it was introduced by #1560 (2026-06-02, "Make sure all weight quantizers have_amax"), a month after the cast landed. #1560 madeweight_only_quantizeunconditional inmax_calibrate; previously it ran only when no calibrationforward_loopwas supplied, and the cast always supplies one — so the non-transposed weight-quantizer call simply never happened before. The conflict only appears at the intersection of (a) transposed-quantize experts (GPT-OSS/Llama4), (b) static-block NVFP4 — which--cast_mxfp4_to_nvfp4forces viaforce_weight_quantizers_static— and (c) #1560. CI's GPT-OSS NVFP4 coverage uses the dynamic-block path, which never locks the block shape, so #1560 looked safe.Usage
Testing
openai/gpt-oss-20b): cast overrode 48/48 expert weight quantizers, 100% lossless layers/blocks, exported a valid packed-NVFP4 HF checkpoint (uint8 weights + FP8 per-blockweight_scale+ per-tensorweight_scale_2+hf_quant_config.json).--qformat nvfp4_mlp_only(no cast) still works end-to-end.Mxfp4Config(dequantize=True)) over all 24 layers × both expert weights —max_abs_err = 0, 100% bitwise-equal in bf16. Sodequant(exported NVFP4) == dequant(original MXFP4)exactly.test_get_original_hf_quant_method_*(load detection) andtest_gpt_oss_experts_iter_weights_for_calibration_transposed(the transpose regression). Existingtest_cast_mxfp4_to_nvfp4.py(8 tests) still pass.pre-commitclean.Known limitation: verified for gpt-oss-20b (fits one GPU). gpt-oss-120b dequantized does not fit a single GPU, so
sequentialwould still span GPUs — that case would need a CPU-dequant-then-dispatch path and is left as a follow-up.Before your PR is "Ready for review"
CONTRIBUTING.md: N/AAdditional Information
nvbug 6295279, nvbug 6295242 / OMNIML-5046, OMNIML-5045.
🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
New Features
Tests