Skip to content

Implement 4over6 NVFP4 recipe#2972

Open
zianglih wants to merge 60 commits into
NVIDIA:mainfrom
zianglih:4over6
Open

Implement 4over6 NVFP4 recipe#2972
zianglih wants to merge 60 commits into
NVIDIA:mainfrom
zianglih:4over6

Conversation

@zianglih
Copy link
Copy Markdown
Contributor

@zianglih zianglih commented May 9, 2026

Description

@HumansAnd

Implement 4over6 nvfp4 from:

FlashInfer PR:

Enable per-block map-to-4 versus map-to-6 candidate selection for 1D/2D NVFP4 quantization in the NVFP4BlockScaling recipe. This mode currently requires RHT and stochastic rounding to be disabled. Both original per-tensor scaling and row-scaling NVFP4 introduced by #2931 are supported.

This PR also fixes a few minor bugs for row-scaled NVFP4 from #2931.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Adds scoped NVFP4 4over6 control through NVTE_NVFP4_4OVER6=weights|activations|all, with unset preserving existing behavior, and threads the selected scope through recipes, quantizers, tensor metadata, split quantization, single-tensor quantization, and C++ tensor/config APIs.
  • Implements 1D & 2D NVFP4 4over6 quantization in the existing NVFP4 CUDA paths by comparing TE-style map-to-4 and map-to-6 FP4 candidates with the original 4over6 MSE rule, choosing map-to-6 on ties, honoring NVTE_USE_FAST_MATH, and rejecting unsupported combinations such as stochastic rounding, grouped tensors, and RHT.
  • Updates dequantization and NVFP4 GEMM scaling to respect per-tensor 4over6 metadata, using 256-based normalization for 4over6 tensors and 448-based normalization for regular NVFP4 tensors without requiring callers to do hidden rescaling.
  • Extends the Python reference implementation to mirror the intended ground truth, meaning TE-style candidate quantization plus original 4over6 MSE/compare logic, and uses this reference for bitwise exact tests where fast math is disabled.
  • Expands C++ and Python coverage across exact NVFP4 quantization, GEMM, dequantization, recipe scope resolution, quantized tensor handling, numerics, sanity, CUDA graph, torch compile, CPU offload, fusible ops, and backward override paths, while documenting the new environment variable and known unsupported modes.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@zianglih zianglih marked this pull request as draft May 9, 2026 03:50
@zianglih zianglih changed the title Implement 4over6 nvfp4 Implement 4over6 nvfp4 recipe May 9, 2026
@zianglih zianglih changed the title Implement 4over6 nvfp4 recipe Implement 4over6 NVFP4 recipe May 9, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 9, 2026

Greptile Summary

This PR implements the 4over6 NVFP4 quantization algorithm in TransformerEngine, enabling per-block map-to-4 vs map-to-6 candidate selection to improve NVFP4 quantization accuracy. The implementation spans CUDA kernels, C++ tensor/config APIs, Python recipe infrastructure, and a reference implementation mirroring the CUDA path bitwise.

  • Adds a new quantize_4over6_nvfp4.cuh kernel (668 lines) with pipelined shared-memory staging, dual-candidate MSE/MAE error comparison, warp-level reduction for 1D/2D quantization, and strict/fast-math dispatching; threads nvfp4_e4m3_max, nvfp4_4over6_mode, and nvfp4_4over6_err_use_fast_math through all C++ tensor and config APIs.
  • Propagates 4over6 metadata (nvfp4_use_4over6, nvfp4_e4m3_max) through Python tensors, tensor storage, serialization, grouped tensors, and comm-gemm overlap; updates per-tensor GEMM scale computation to use the per-tensor E4M3 max instead of a hardcoded 448.
  • Adds a new NVTE_NVFP4_4OVER6 env-var scope system and recipe fields nvfp4_4over6, nvfp4_4over6_e4m3_use_256, and nvfp4_4over6_err_mode, with documentation and broad test coverage.

Confidence Score: 5/5

The change is safe to merge. Core quantization math, warp reductions, tensor metadata propagation, and dequant/GEMM scale adjustments are all implemented correctly and consistently across CUDA, C++, and Python layers.

The 4over6 kernel logic (dual-candidate scale computation, warp-level error reduction for both 1D and 2D modes, tie-breaking toward map-to-6) correctly mirrors the reference paper and the Python bitwise-exact reference implementation. The E4M3-max metadata flows consistently through tensor creation, serialization, comm-gemm chunking, dequantization, and GEMM scale computation. Guard checks for unsupported combinations (stochastic rounding, grouped tensors, RHT) are placed at multiple layers. The two findings are minor UX/documentation gaps that do not affect correctness on any supported configuration.

No files require special attention. The two flagged items in transformer_engine/common/recipe/init.py are documentation/validation style improvements that do not affect runtime correctness.

Important Files Changed

Filename Overview
transformer_engine/common/cast/nvfp4/quantize_4over6_nvfp4.cuh New 668-line CUDA kernel implementing 4over6 quantization with pipelined shared-memory loads, dual-candidate error computation, warp-level reduce, and compile-time template dispatch across mode/fast-math/E4M3-max axes. Core logic and warp reduction masks are correct.
transformer_engine/common/recipe/init.py Adds nvfp4_4over6, nvfp4_4over6_e4m3_use_256, and nvfp4_4over6_err_mode recipe fields with env-var defaults and post_init validation. disable_rht/disable_stochastic_rounding enforcement exists only for "activations"/"all" scopes — omitted for "weights" by design since weight quantizers don't use RHT or SR.
transformer_engine/common/cast/dispatch/quantize.cuh Routes 4over6-mode quantization to the new kernel in both forward and backward helpers before the existing optimized-kernel path. Guard checks for stochastic rounding and E4M3 max are consistent.
transformer_engine/common/cast/nvfp4/dequantize_nvfp4.cuh Moves row_scaled_nvfp4 from runtime param to compile-time template parameter and adds E4M3_MAX template dispatch for correct factor_inv during dequantization. No issues found.
transformer_engine/pytorch/csrc/quantizer.cpp Reads nvfp4_use_4over6/nvfp4_4over6_mode/nvfp4_e4m3_max from Python quantizer, threads them into tensor construction and quant configs, reads NVTE_NVFP4_4OVER6_ERR_USE_FAST_MATH env var consistently with the split-quantize paths.
transformer_engine/pytorch/tensor/nvfp4_tensor.py Propagates nvfp4_use_4over6 and nvfp4_e4m3_max through new, copy, serialize/deserialize (reduce_ex/_from_saved_data), view/reshape autograd functions, and all-gather metadata. All propagation paths appear consistent.
transformer_engine/pytorch/custom_recipes/quantization_ref_nvfp4.py Adds _quantize_blockwise_4over6_reference implementing dual-candidate map-to-4/map-to-6 selection in Python; used for bitwise-exact tests. GEMM factor computation now reads per-tensor fp8_max from metadata. Logic mirrors the CUDA path correctly.
transformer_engine/common/recipe/nvfp4.cu compute_nvfp4_per_tensor_scale_kernel now receives fp8_max_A/fp8_max_B as parameters instead of using a hardcoded 448-based constexpr. Correctly handles mixed 4over6/non-4over6 tensor pairs.
transformer_engine/pytorch/csrc/extensions/cast.cpp Guards all three split-quantize paths against 4over6 in unsupported configurations (grouped, RHT, SR), threads nvfp4_4over6_mode and nvfp4_4over6_err_use_fast_math into all quant_config_list entries, and propagates nvfp4_e4m3_max into bulk-allocated tensors.

Reviews (11): Last reviewed commit: "Minor fix recipe naming" | Re-trigger Greptile

Comment thread transformer_engine/pytorch/csrc/extensions/cast.cpp Outdated
Comment thread transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu Outdated
Comment thread transformer_engine/common/recipe/__init__.py
Comment thread tests/pytorch/test_sanity.py Outdated
@zianglih
Copy link
Copy Markdown
Contributor Author

zianglih commented May 11, 2026

Functionality has been verified by internal RL experiments.
We may want to allow separate 4over6 config for weights and activations, maybe NVTE_NVFP4_ENABLE_4OVER6=weights|activations|all.

@ptrendx ptrendx requested a review from negvet May 11, 2026 17:12
@ptrendx ptrendx added community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. fp4 labels May 11, 2026
@zianglih
Copy link
Copy Markdown
Contributor Author

Need to rebase.

@zianglih zianglih marked this pull request as draft May 11, 2026 21:17
@zianglih zianglih marked this pull request as ready for review May 11, 2026 22:36
* its values are populated during quantization.
*/
kNVTERowScaledNVFP4 = 8,
kNVTENVFP44Over6 = 9, /*!< Whether an NVFP4 tensor uses 4over6 scaling */
Copy link
Copy Markdown
Collaborator

@timmoon10 timmoon10 May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are specifying this redundantly in NVTETensor and NVTEQuantizationConfig. If this option can be isolated to quantization, then we should not add clutter to the tensor. If the option is needed for downstream consumers (dequantization, GEMM), then it should be treated as part of the tensor data. I'm not especially familiar, but 4over6 seems like it should be specific to quantization.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4over6 changes the decode convention from 1 / (6 * 448) to 1 / (6 * 256). Therefore, for our current representation 4over6 is part of the tensor data contract, not just a quantization option.

using namespace detail;
constexpr float fp8_max = TypeExtrema<fp8e4m3>::max; // 448.0f;
constexpr float fp4_max = TypeExtrema<fp4e2m1>::max; // 6.0f;
constexpr float fp8_max = USE_4OVER6 ? 256.0f : TypeExtrema<fp8e4m3>::max; // 448.0f;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much benefit does changing the FP8 scale have on convergence? If we don't see a clear benefit, then it would be nicer to use the same scale for 4over6 and non-4over6. That way keep can keep this logic confined to quantization, and downstream consumers are completely unaffected.

If there is an impact on training quality, we should still consider disentangling the FP8 scaling from 4over6. I don't see why other NVFP4 recipes might not benefit from tweaking the scaling.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the original paper:

Finally, we make one modification to the computation of the tensor scale α (Equation 1) when
quantizing to NVFP4 with 4/6. When MFP4 ×MFP8 is used to compute the tensor scale, it ensures
that all quantized values will be less than 6 ×448. However, this makes it impossible to select a scale
of 4 for the blocks that contain a tensor’s largest values, because the block’s scale would need to be
448 × 6/4 = 672, which would overflow since 448 is the maximum value that can be represented by
E4M3. As a result, when computing the tensor scale, we replace MFP8 to 256 in Equation 1, since
256 is the largest E4M3 that can be multiplied by 6/4 and represented without error in E4M3, as 384.

Also:

In Section 3.1, we propose calculating the FP32 global tensor scale using 256 as the maximum FP8
E4M3 value rather than the default of 448, as this allows blocks with a tensor’s largest value to have
the option to have a largest FP4 value of 4. In Figure 6, we find that this provides a marginal benefit
over using the standard tensor scale calculation. Even though this adjustment only affects a small
number of large values, this performance gain may come from the fact that larger activation values
can have an outsize impact on model performance. This adjustment is incorporated into the remaining
experiments in this section.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there are internal or external studies about the convergence. But this is required to make it work. We need the largest value that is smaller than 448/1.5 and which is itself, and its multiplication by 1.5 is represented by E4M3 exactly. This would help to avoid quantization noise on both map to 4 and map to 6 paths.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did find the use of 256 to calculate the second level scaling factor helped convergence vs 448, but only slightly.

It's possible that the premise of the paper's argument (prevent saturations when 4 scaling effectively multiplies the block decode scale by 1.5) is sound, but a value larger than 256 can achieve this and the perfect representation of the block with the global amax value with both scalings is not worth the extra range loss.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me make 256 scaling a separate env var disabled by default

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

448, 320, 288, 256 are all potential candidates for map-to-6:

  • 448: effectively disable map-to-4 option above 256, preserve range
  • 320, 288: map-to-4 uses 448, no precise 1.5x
  • 256: map-to-4 uses 384, precise 1.5x

For now let me refactor the interface to NVTE_NVFP4_4OVER6_E4M3="448"|"256", default to "448" and dispatches to a number in template parameter in C++ code instead of a boolean toggle. People can add support for other values or make it more generic (like directly parsing the env var digits) in the future.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVTE_NVFP4_4OVER6_E4M3_USE_256=weights|activations|all is a cleaner pattern and allows separate configuration.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For our RL experiments we do observe 256 leads to less mismatch vs 448.

Comment thread tests/pytorch/utils.py Outdated
Comment thread transformer_engine/common/cast/dispatch/quantize.cuh Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is okay, but it would provide much more confidence if the NVFP4 quantization tests compared against a CPU reference impl.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extended tests/cpp/operator/test_cast_nvfp4_transpose.cu coverage in 3bb42b1.

@zianglih zianglih marked this pull request as draft May 12, 2026 02:01
@zianglih zianglih marked this pull request as ready for review May 12, 2026 06:45
@zianglih zianglih requested a review from timmoon10 May 12, 2026 06:47
@zianglih zianglih marked this pull request as draft May 12, 2026 09:03
@zianglih zianglih marked this pull request as ready for review May 12, 2026 10:10
Comment thread transformer_engine/common/recipe/__init__.py Outdated
using namespace detail;
constexpr float fp8_max = TypeExtrema<fp8e4m3>::max; // 448.0f;
constexpr float fp4_max = TypeExtrema<fp4e2m1>::max; // 6.0f;
constexpr float fp8_max = USE_4OVER6 ? 256.0f : TypeExtrema<fp8e4m3>::max; // 448.0f;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there are internal or external studies about the convergence. But this is required to make it work. We need the largest value that is smaller than 448/1.5 and which is itself, and its multiplication by 1.5 is represented by E4M3 exactly. This would help to avoid quantization noise on both map to 4 and map to 6 paths.

@Oleg-Goncharov Oleg-Goncharov self-requested a review May 12, 2026 16:37
zianglih added 9 commits May 13, 2026 00:36
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
@negvet
Copy link
Copy Markdown
Collaborator

negvet commented May 13, 2026

What is the e2e step time increase with 4/6 on some typical workload?

zianglih added 2 commits May 13, 2026 02:36
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
@zianglih
Copy link
Copy Markdown
Contributor Author

zianglih commented May 13, 2026

Major changes from last time:

  • Use standalone quantization kernel implementation instead of folding into existing code. 4over6 quantize is very fp32 compute bound (Implement 4over6 NVFP4 recipe #2972 (comment) and Implement 4over6 NVFP4 recipe #2972 (comment)) and latency hiding techniques in TE original nvfp4 quant kernels lead to higher register pressure and worse performance. There is not much we could do regarding fp32 arithmetic bottleneck without changing heuristics. I think even if we want to further optimize perf/heuristics we should do it in a separate PR and extend as new error modes. cc @Oleg-Goncharov @kwyss-nvidia
  • Allow both 448 and 256 configurations. The user can config by setting NVTE_NVFP4_4OVER6_E4M3_USE_256. However, all underlying implementations encodes nvfp4_e4m3_max and E4M3_MAX template parameter instead of a boolean flag so we can easily extend other values in the future. cc @timmoon10 @kwyss-nvidia @negvet
  • Add and default to MAE error mode. cc @negvet
  • For 4over6 quantize cpp test, we now don't check map-to-4 vs map-to-6 selection and accept either to be bitwise exact. This avoids numerics drift from CPU arch. Python test still has strict candidate selection coverage. cc @Oleg-Goncharov

@zianglih zianglih marked this pull request as ready for review May 13, 2026 09:48
@zianglih zianglih requested review from ksivaman and ptrendx as code owners May 13, 2026 09:48
@zianglih
Copy link
Copy Markdown
Contributor Author

zianglih commented May 14, 2026

Hi @Oleg-Goncharov ,
For our RL config (see env vars below) benchmark_grouped_linear.py shows a 1.28x~1.36x slowdown.

This is usable especially considering RL has very long context attention and there are other communication overheads. The rollout side end-to-end overhead is only around 1~2%. We also observe meaningful numerics improvements for rollout and training fprop consistency. Considering RL is usually rollout bounded and very sensitive to mismatch, 4over6 shows meaningful improvements under acceptable training side performance overhead.

NVTE_NVFP4_ROW_SCALED_ACTIVATION=1 \
NVTE_BACKWARD_OVERRIDE=dequantized \
NVTE_NVFP4_DISABLE_2D_QUANTIZATION=1 \
NVTE_NVFP4_DISABLE_RHT=1 \
NVTE_NVFP4_DISABLE_STOCHASTIC_ROUNDING=1 \
python benchmarks/linear/benchmark_grouped_linear.py --recipe nvfp4
       m     k     n recipe  num_gemms  grouped_fwd_bwd_time_ms
0  16384  7168  2048  nvfp4          4                 1.443936
1  32768  7168  2048  nvfp4          4                 2.489801
2  65536  7168  2048  nvfp4          4                 4.548635
3  98304  7168  2048  nvfp4          4                 6.640535
0  16384  7168  2048  nvfp4          8                 1.836268
1  32768  7168  2048  nvfp4          8                 2.837006
2  65536  7168  2048  nvfp4          8                 4.977518
3  98304  7168  2048  nvfp4          8                 6.967243
NVTE_NVFP4_4OVER6=all \
NVTE_NVFP4_ROW_SCALED_ACTIVATION=1 \
NVTE_BACKWARD_OVERRIDE=dequantized \
NVTE_NVFP4_DISABLE_2D_QUANTIZATION=1 \
NVTE_NVFP4_DISABLE_RHT=1 \
NVTE_NVFP4_DISABLE_STOCHASTIC_ROUNDING=1 \
python benchmarks/linear/benchmark_grouped_linear.py --recipe nvfp4
       m     k     n recipe  num_gemms  grouped_fwd_bwd_time_ms
0  16384  7168  2048  nvfp4          4                 1.908519
1  32768  7168  2048  nvfp4          4                 3.313811
2  65536  7168  2048  nvfp4          4                 6.215076
3  98304  7168  2048  nvfp4          4                 9.027176
0  16384  7168  2048  nvfp4          8                 2.361491
1  32768  7168  2048  nvfp4          8                 3.768442
2  65536  7168  2048  nvfp4          8                 6.588285
3  98304  7168  2048  nvfp4          8                 9.480253

For pretraining config, the performance overhead is 2.16x~2.57x, in an unusable stage at this time. I turned off RHT and SR for fair comparision:

NVTE_NVFP4_DISABLE_RHT=1 \
NVTE_NVFP4_DISABLE_STOCHASTIC_ROUNDING=1 \
python benchmarks/linear/benchmark_grouped_linear.py --recipe nvfp4
       m     k     n recipe  num_gemms  grouped_fwd_bwd_time_ms
0  16384  7168  2048  nvfp4          4                 0.774788
1  32768  7168  2048  nvfp4          4                 1.251587
2  65536  7168  2048  nvfp4          4                 2.249276
3  98304  7168  2048  nvfp4          4                 3.259345
0  16384  7168  2048  nvfp4          8                 0.952317
1  32768  7168  2048  nvfp4          8                 1.432820
2  65536  7168  2048  nvfp4          8                 2.436908
3  98304  7168  2048  nvfp4          8                 3.412981
NVTE_NVFP4_4OVER6=all \
NVTE_NVFP4_DISABLE_RHT=1 \
NVTE_NVFP4_DISABLE_STOCHASTIC_ROUNDING=1 \
python benchmarks/linear/benchmark_grouped_linear.py --recipe nvfp4
       m     k     n recipe  num_gemms  grouped_fwd_bwd_time_ms
0  16384  7168  2048  nvfp4          4                 1.753024
1  32768  7168  2048  nvfp4          4                 3.074884
2  65536  7168  2048  nvfp4          4                 5.711913
3  98304  7168  2048  nvfp4          4                 8.387917
0  16384  7168  2048  nvfp4          8                 2.060491
1  32768  7168  2048  nvfp4          8                 3.383869
2  65536  7168  2048  nvfp4          8                 6.018331
3  98304  7168  2048  nvfp4          8                 8.670583

@Oleg-Goncharov
Copy link
Copy Markdown
Collaborator

Hi @zianglih, from my side, this looks okay now. The reported slowdown doesn't seem like a blocker for merging, especially if the current tradeoff is acceptable for the target use case, and we can revisit performance later if needed.

Comment on lines +86 to +91
/*! Whether an NVFP4 tensor is encoded with 4over6 semantics.
*
* This records whether block scales were selected by comparing map-to-4
* and map-to-6 candidates.
*/
kNVTENVFP44Over6 = 9,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are controlling 4over6 with 5 configs:

  • kNVTENVFP44Over6
  • kNVTENVFP4E4M3Max
  • kNVTEQuantizationConfigNVFP44Over6
  • kNVTEQuantizationConfigNVFP4E4M3Max
  • kNVTEQuantizationConfigNVFP44Over6ErrMode

We only need 2:

  • kNVTENVFP4E4M3Max: tensor attr, needed for both quant and dequant
  • kNVTEQuantizationConfigNVFP44Over6Mode: quant config, only needed for quant

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done by 2980cb1 .

Comment on lines +127 to +133
/*! \enum NVTENVFP44Over6ErrMode
* \brief Candidate-selection error mode for NVFP4 4over6 quantization.
*/
enum NVTENVFP44Over6ErrMode {
kNVTENVFP44Over6ErrMAE = 0, /*!< Select the candidate with lower summed absolute error */
kNVTENVFP44Over6ErrMSE = 1, /*!< Select the candidate with lower summed squared error */
};
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add "disabled mode", this enum makes the bool configs for 4over6 redundant.

Suggested change
/*! \enum NVTENVFP44Over6ErrMode
* \brief Candidate-selection error mode for NVFP4 4over6 quantization.
*/
enum NVTENVFP44Over6ErrMode {
kNVTENVFP44Over6ErrMAE = 0, /*!< Select the candidate with lower summed absolute error */
kNVTENVFP44Over6ErrMSE = 1, /*!< Select the candidate with lower summed squared error */
};
/*! \enum NVTENVFP44Over6Mode
* \brief Method for NVFP4 4over6 quantization.
*/
enum NVTENVFP44Over6Mode {
kNVTENVFP44Over6Disabled = 0, /*!< 4over6 is not applied */
kNVTENVFP44Over6MinMAE = 1, /*!< Select the candidate with lower mean absolute error */
kNVTENVFP44Over6MinMSE = 2, /*!< Select the candidate with lower mean squared error */
};

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done by 2980cb1 . Also refactored modes in cpp tests.

@timmoon10
Copy link
Copy Markdown
Collaborator

/te-ci

Signed-off-by: Ziang Li <ziangli@umich.edu>
@zianglih
Copy link
Copy Markdown
Contributor Author

A few 4over6 ci failures:

=========================== short test summary info ============================
FAILED ../../tests/pytorch/test_fusible_ops.py::TestBasicOps::test_dropout[dtype1-shape2-fp8_current_scaling-True-0.5] - AssertionError: Number of zeros is outside 99% confidence interval (prob=0.5, prob_observed=0.488525390625)
assert 2.9375 < 2.5758
 +  where 2.9375 = abs(-2.9375)
FAILED ../../tests/pytorch/test_fusible_ops.py::TestSequentialModules::test_layernorm_mlp[nvfp4_4over6-dtype2-False-True-True-False] - AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 384 (0.3%)
Greatest absolute difference: 0.5703666875867625 at index (172,) (up to 0.5 allowed)
Greatest relative difference: 3.475372894576796 at index (172,) (up to 0.25 allowed)
FAILED ../../tests/pytorch/test_fusible_ops.py::TestSequentialModules::test_layernorm_mlp[nvfp4_4over6-dtype2-False-True-True-True] - AssertionError: Tensor-likes are not close!

Mismatched elements: 8 / 384 (2.1%)
Greatest absolute difference: 0.6054700590012174 at index (37,) (up to 0.5 allowed)
Greatest relative difference: 67.53061971695855 at index (36,) (up to 0.25 allowed)
FAILED ../../tests/pytorch/test_fusible_ops.py::TestSequentialModules::test_layernorm_mlp[nvfp4_4over6-dtype2-True-True-True-False] - AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 49152 (0.0%)
Greatest absolute difference: 0.5862411404167979 at index (38, 79) (up to 0.5 allowed)
Greatest relative difference: 10.66064707830139 at index (38, 79) (up to 0.25 allowed)
FAILED ../../tests/pytorch/test_fusible_ops.py::TestSequentialModules::test_layernorm_mlp[nvfp4_4over6-dtype2-True-True-True-True] - AssertionError: Tensor-likes are not close!

Mismatched elements: 35 / 384 (9.1%)
Greatest absolute difference: 0.6996637819152378 at index (23,) (up to 0.5 allowed)
Greatest relative difference: 688.2507391509421 at index (184,) (up to 0.25 allowed)
=== 5 failed, 3945 passed, 9607 skipped, 2966 warnings in 404.46s (0:06:44) ====
Error: sub-test failed: test_fusible_ops.py

zianglih added 2 commits May 19, 2026 00:28
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. fp4

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants