Add support for MXFP8 PTQ #736

danisereb · 2026-01-05T13:30:08Z

What does this PR do?

Type of change: new feature

Overview: Add support for MXFP8 PTQ, enabling MXFP8 hardware acceleration during inference on Blackwell GPUs.

Usage

export MODEL_PATH=/my_home/hf_models/nvidia/OpenMath2-Llama3.1-8B
export OUTPUT_PATH=/my_home/hf_models/nvidia/OpenMath2-Llama3.1-8B-MXFP8
mkdir -p $OUTPUT_PATH

python examples/llm_ptq/hf_ptq.py \
--export_fmt hf \
--dataset cnn_dailymail \
--pyt_ckpt_path $MODEL_PATH \
--export_path $OUTPUT_PATH \
--qformat mxfp8

The hf_quant_config.json of the output checkpoint:

{
    "producer": {
        "name": "modelopt",
        "version": "0.41.0.dev50+g7a796a875"
    },
    "quantization": {
        "quant_algo": "MXFP8",
        "kv_cache_quant_algo": "FP8",
        "group_size": 32,
        "exclude_modules": [
            "lm_head"
        ]
    }
}

And config.json (only the quantization_config):

...
    "quantization_config": {
        "ignore": [
            "lm_head"
        ],
        "quant_algo": "MXFP8",
        "kv_cache_scheme": {
            "dynamic": false,
            "num_bits": 8,
            "type": "float"
        },
        "producer": {
            "name": "modelopt",
            "version": "0.41.0.dev50+g7a796a875"
        },
        "quant_method": "modelopt"
    }

Testing

Used hf_ptq.py to quantize the model nvidia/OpenMath2-Llama3.1-8B (available in hugging-face), see the example command above.

Checked that the generated MXFP8 checkpoint can be loaded with vLLM (required changes in vLLM, not merged to main).

Added tests for MXFP8QTensor in tests/gpu/torch/quantization/test_qtensor_cuda.py.
Added "mxfp8" in ‎tests/examples/llm_ptq/test_llm_ptq.py

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

copy-pr-bot · 2026-01-05T13:30:12Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

modelopt/torch/export/quant_utils.py

modelopt/torch/export/unified_export_hf.py

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

modelopt/torch/quantization/qtensor/mxfp8_tensor.py

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

sugunav14 · 2026-01-06T20:03:07Z

Could you also add the corresponding unit tests for impacted functions in quant_utils.py here? Thanks!

cjluo-nv · 2026-01-06T20:20:16Z

modelopt/torch/quantization/qtensor/mxfp8_tensor.py

+        # Convert E8M0 biased exponent to scale factor: scale = 2^(127 - exponent)
+        scale_factor = torch.exp2(127 - e8m0_scale.float())
+
+        # NOTE: vLLM/flashinfer may require this behavior:


is this required? Should we assert e8m0_scale != 0?

AFAIU, it doesn't align with MXFP8 specification.
But one of my teammates said that it worked for him in a certain case.

So I wanted to leave some documentation for it for future reference.

cjluo-nv · 2026-01-06T20:20:42Z

tests/examples/llm_ptq/test_llm_ptq.py

        # sm89
        PTQCommand(quant="fp8", min_sm=89),
        PTQCommand(quant="fp8", kv_cache_quant="none", min_sm=89),  # sm100
+        PTQCommand(quant="mxfp8", min_sm=100),


does hopper support mxfp8?

Blackwell has hardware acceleration for MXFP8.
Hopper does not.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#MXFP8-and-block-scaling

NVIDIA Blackwell architecture introduced support for a new variant of the FP8 format: MXFP8.

See what we have for NVFP4 (line below the "mxfp8"):

PTQCommand(quant="nvfp4", min_sm=100),

codecov · 2026-01-06T20:32:06Z

Codecov Report

❌ Patch coverage is 21.50538% with 73 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.42%. Comparing base (d541324) to head (d5fced8).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
...odelopt/torch/quantization/qtensor/mxfp8_tensor.py	21.59%	69 Missing ⚠️
.../torch/quantization/nn/modules/tensor_quantizer.py	0.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #736      +/-   ##
==========================================
- Coverage   74.69%   74.42%   -0.27%     
==========================================
  Files         192      193       +1     
  Lines       18948    19043      +95     
==========================================
+ Hits        14153    14173      +20     
- Misses       4795     4870      +75

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

meenchen

LGTM

mxinO · 2026-01-07T03:47:14Z

tests/gpu/torch/quantization/test_qtensor_cuda.py

+        assert dequant_tensor.shape == input_shape, (
+            f"Expected dequantized shape {input_shape}, got {dequant_tensor.shape}"
+        )
+        assert torch.allclose(dequant_tensor, test_tensor, rtol=5e-2, atol=5e-2), (


We can also compare with the fake quant here.

mxinO · 2026-01-07T03:48:31Z

tests/gpu/torch/quantization/test_qtensor_cuda.py

+        "test_input",
+        [
+            # FP8 E4M3 boundary test values (max is 448, various powers of 2)
+            torch.tensor(


The format looks weird, we can turn off the auto format for the tensors, and define them on the top.

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

cjluo-nv reviewed Jan 5, 2026

View reviewed changes

modelopt/torch/export/quant_utils.py Outdated Show resolved Hide resolved

cjluo-nv reviewed Jan 5, 2026

View reviewed changes

modelopt/torch/export/quant_utils.py Outdated Show resolved Hide resolved

meenchen reviewed Jan 5, 2026

View reviewed changes

modelopt/torch/export/unified_export_hf.py Outdated Show resolved Hide resolved

modelopt/torch/export/unified_export_hf.py Outdated Show resolved Hide resolved

danisereb marked this pull request as ready for review January 6, 2026 12:31

danisereb requested review from a team as code owners January 6, 2026 12:31

danisereb requested review from mxinO and sugunav14 January 6, 2026 12:31

danisereb force-pushed the support_mxfp8 branch from 7454f24 to 16f12fa Compare January 6, 2026 12:53

danisereb added 3 commits January 6, 2026 15:17

Add support for MXFP8 PTQ

eab25fc

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

Refactor MXFP8QTensor and remove irrelevant logic and tests

0ecffc4

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

Add mxfp8 to test_llm_ptq and huggingface_example

88b6869

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

danisereb force-pushed the support_mxfp8 branch from 16f12fa to 88b6869 Compare January 6, 2026 13:18

danisereb requested a review from meenchen January 6, 2026 13:49

cjluo-nv reviewed Jan 6, 2026

View reviewed changes

modelopt/torch/quantization/qtensor/mxfp8_tensor.py Outdated Show resolved Hide resolved

cjluo-nv reviewed Jan 6, 2026

View reviewed changes

modelopt/torch/quantization/qtensor/mxfp8_tensor.py Outdated Show resolved Hide resolved

cjluo-nv reviewed Jan 6, 2026

View reviewed changes

modelopt/torch/quantization/qtensor/mxfp8_tensor.py Outdated Show resolved Hide resolved

Use existing utils functions in MXFP8QTensor

d5fced8

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

cjluo-nv reviewed Jan 6, 2026

View reviewed changes

cjluo-nv approved these changes Jan 6, 2026

View reviewed changes

meenchen approved these changes Jan 6, 2026

View reviewed changes

mxinO reviewed Jan 7, 2026

View reviewed changes

danisereb added 2 commits January 7, 2026 20:41

Improve formatting in test_mxfp8_quantize_boundary_values

10f6cfa

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

Add more tests for MXFP8 (error handling)

9b0c088

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

Add support for MXFP8 PTQ #736

Are you sure you want to change the base?

Add support for MXFP8 PTQ #736

Uh oh!

Conversation

danisereb commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Jan 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sugunav14 commented Jan 6, 2026

Uh oh!

cjluo-nv Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

danisereb Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

danisereb Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 6, 2026

Codecov Report

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

mxinO Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

mxinO Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

danisereb commented Jan 5, 2026 •

edited

Loading

danisereb Jan 7, 2026 •

edited

Loading