GGEMM+srelu kernels for MxFP8 Nemotron by sraman-rgb · Pull Request #2981 · NVIDIA/TransformerEngine

sraman-rgb · 2026-05-12T19:11:28Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

ksivaman · 2026-05-12T19:20:03Z

/te-ci pytorch

ksivaman · 2026-05-12T19:20:23Z

Please sign-off your commits @sraman-rgb

greptile-apps · 2026-05-12T19:23:56Z

Greptile Summary

This PR refactors the fused GroupedMLP kernel hierarchy into a shared base class and adds ScaledSReLU (squared-ReLU with per-row post-scaling) as a second supported activation alongside the existing GLU variants, wiring up new cuDNN FE grouped_gemm_srelu_wrapper_sm100 / grouped_gemm_dsrelu_wrapper_sm100 kernels.

New ScaledSReLU op (activation.py): standard BasicOperation with num_extra_inputs=1, implements both unfused and fused forward/backward paths.
Refactored fused forward/backward: common logic moved to abstract base classes; GLU and Unary concrete subclasses wire their respective cuDNN FE kernels.
Fusion plumbing (_common.py): fuse_grouped_mlp_ops parameterised with activation_op_types; validate_grouped_mlp_dims extended for unary activations; separate forward/backward fusion functions registered for each activation family.

Confidence Score: 5/5

The refactor is well-structured and the SReLU kernel wiring follows the established GLU pattern closely; the two flagged items are clarifying questions rather than confirmed failures.

The class hierarchy generalisation is clean, dscales_tensor is always an allocated tensor, the recompute-FC2-input path is guarded by multiple independent checks, and test coverage spans both unit-level ScaledSReLU and the full grouped-MLP integration.

forward_grouped_mlp.py (prob_tensor dtype) and _common.py (_nvidia_cudnn_frontend_supports_wgrad guard)

Important Files Changed

Filename	Overview
transformer_engine/pytorch/ops/basic/activation.py	Adds ScaledSReLU with correct unfused fuser_forward/fuser_backward; dtype handling and grad accumulation look sound.
transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py	Base class refactor is clean; prob_tensor dtype (BF16/FP16 vs float32 fallback and backward) is an inconsistency worth confirming against the SReLU kernel spec.
transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py	dSReLU backward kernel wiring, recompute path, and grad_scales handling are logically correct; dscales_tensor is always allocated.
transformer_engine/pytorch/ops/_common.py	validate_grouped_mlp_dims and fuse_grouped_mlp_ops generalised cleanly; _nvidia_cudnn_frontend_supports_wgrad is a thin alias with no distinct version check.
transformer_engine/pytorch/ops/fused/init.py	Export list updated to expose the four new concrete fused-op classes; no issues.
transformer_engine/pytorch/ops/basic/init.py	Adds ScaledSReLU to the public API; straightforward.
tests/pytorch/test_fusible_ops.py	New test_scaled_srelu unit test and scaled_srelu parametrize for test_grouped_mlp look correct; reference implementation matches expected SReLU*scales semantics.

Sequence Diagram

sequenceDiagram
    participant Fuser
    participant GLUFwd as ForwardGroupedMLP_CuTeGEMMGLU_MXFP8
    participant SReLUFwd as ForwardGroupedMLP_CuTeGEMMUnary_MXFP8
    participant SReLUBwd as BackwardGroupedMLP_CuTeGEMMDUnary_MXFP8
    participant cuDNN as cuDNN FE Kernels

    Fuser->>GLUFwd: fuse_forward_ops GLU pattern
    GLUFwd->>cuDNN: grouped_gemm_glu_wrapper_sm100
    cuDNN-->>GLUFwd: fc2_in scales and activation_in

    Fuser->>SReLUFwd: fuse_forward_srelu_ops SReLU pattern
    SReLUFwd->>cuDNN: grouped_gemm_srelu_wrapper_sm100
    cuDNN-->>SReLUFwd: fc2_in scales and activation_in
    Note over SReLUFwd: Save activation_in and scales
    Note over SReLUFwd: optionally skip saving fc2_x

    Fuser->>SReLUBwd: fuse_backward_srelu_ops
    SReLUBwd->>cuDNN: grouped_gemm_dsrelu_wrapper_sm100
    cuDNN-->>SReLUBwd: FC1 dy tensors and grad_scales
    cuDNN-->>SReLUBwd: optional recomputed FC2 input
    SReLUBwd->>cuDNN: grouped_gemm_wgrad for FC1 and FC2

_{Reviews (8): Last reviewed commit: "Address grouped MLP ScaledSReLU review c..." | Re-trigger Greptile}

Signed-off-by: sraman-rgb <sraman@nvidia.com>

timmoon10

Overall looks good, but we've gotten to the point where we need to start thinking about how to gracefully handle adding new activations. It seems that every model has a different activation function.

timmoon10 · 2026-05-12T23:10:12Z

+        swiglu: Optional[ScaledSwiGLU | ScaledClampedQGeGLU] = None,
+        srelu: Optional[ScaledSReLU] = None,


Why not have a single arg?

Suggested change

swiglu: Optional[ScaledSwiGLU | ScaledClampedQGeGLU] = None,

srelu: Optional[ScaledSReLU] = None,

activation: Optional[FusibleOperation] = None,

It seems like we're adding one activation function after another, so we want interfaces that scale gracefully. Also, fused ops are basically internal to TE and these ops in particular are experimental, so backward compatibility is not a major concern.

The forward fused op should have a similar design. Changing to a consistent arg name would also let us get rid of the kwarg name messiness in the op fusion function.

timmoon10 · 2026-05-12T23:29:56Z

        return fc2_out, [(), (), ()]


+class ForwardGroupedMLP_CuTeGEMMSReLU_MXFP8(ForwardGroupedMLP_CuTeGEMMSwiGLU_MXFP8):


This is an awkward class hierarchy. It would be better to have a virtual base class that both the GLU and non-GLU functions inherit from. The backward fused ops should have a similar design.

While we're messing with the existing classes, we should reconsider the names. The "SwiGLU" op is actually used for both SwiGLU and ClampedQGeGLU, so a name like "GLU" would be better. And there's no reason to expect "SReLU" won't be applied to other activations later, so maybe "Unary" would be more general.

timmoon10 · 2026-05-12T23:33:10Z

            pytest.skip("Quantized group GEMM is only supported with BF16/FP16")
+        if activation == "scaled_srelu" and quantization != "mxfp8":
+            pytest.skip("ScaledSReLU grouped MLP fusion is only supported with MXFP8")
+        if activation == "scaled_srelu" and glu_interleave_size is not None:


Nit: This is assuming that activations are GLUs by default, and SReLU is weird. Isn't that kind of backward? In any case, it would be more logical to have a single point where we check is_glu_activation, and then use that everywhere.

greptile-apps · 2026-05-18T20:13:05Z

Want your agent to iterate on Greptile's feedback? Try greploops.

timmoon10 · 2026-05-18T20:28:40Z

        *,
        fc1: GroupedLinear,
-        swiglu: ScaledSwiGLU | ScaledClampedQGeGLU,
+        activation: Optional[FusibleOperation] = None,


Nit: Python supports kwargs without defaults.

Suggested change

activation: Optional[FusibleOperation] = None,

activation: Optional[FusibleOperation],

timmoon10 · 2026-05-18T20:57:48Z

            fc2_ctx.dtype = dtype
            fc2_ctx.input_requires_grad = input_requires_grad
            fc2_ctx.weight_requires_grad = weight_requires_grad
+            fc2_ctx.recompute_input_from_dsrelu = recompute_srelu_fc2_x


This option isn't supported in the unfused GroupedLinear op. This is a problem because the forward and backward fusions are performed indendently, so everything needs to be compatible with the unfused op interfaces in case there are different forward and backward fusions. However, I also don't want to include this in the unfused op because this is so hyper-specific to this particular fusion.

The requirement that the fused and unfused ops are interchangeable has causing some trouble with the grouped MLP block. It may be worth relaxing, but we would need to have some guarantee that the forward and backward fusions match exactly. I propose we change the op fuser to operate in three stages: fuse forward-backward ops together, fuse forward ops, fuse backward ops. For fused ops with matching forward and backward, we can tolerate tighter forward-backward integration.

Signed-off-by: Siddhartha Raman S <sraman@login-lyris01.lyris.clusters.nvidia.com>

Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>

vthumbe1503

LGTM. We might want to wait on the cudnn release and apt cudnn guards are added.

vthumbe1503 · 2026-05-18T19:52:56Z

+        else:
+            try:
+                validate_grouped_mlp_dims(window[0], window[1], window[2])
+            except (TypeError, ValueError):
+                matches_pattern = False


We would want to disable srelu fusion based on cudnn version here eventually before the merge

vthumbe1503 · 2026-05-19T02:54:44Z

+                scales.detach().to(dtype=dtype).reshape(-1, 1, 1)
+                if scales is not None
+                else torch.ones((in_shape[0], 1, 1), dtype=torch.float32, device=device)


This might be a hold over from before right? And we do expect scales passed to be never None. So we can revert the change?

greptile-apps Bot reviewed May 12, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py

Comment thread transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py Outdated

sraman-rgb force-pushed the fc1-srelu-main branch from 8373402 to 765d2e9 Compare May 12, 2026 20:33

greptile-apps Bot reviewed May 12, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py Outdated

vthumbe1503 reviewed May 12, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py Outdated

vthumbe1503 reviewed May 12, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/ops/basic/activation.py

vthumbe1503 reviewed May 12, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py Outdated

Add MXFP8 grouped MLP SReLU fusion

43093cc

Signed-off-by: sraman-rgb <sraman@nvidia.com>

sraman-rgb force-pushed the fc1-srelu-main branch from 765d2e9 to 43093cc Compare May 12, 2026 22:05

timmoon10 reviewed May 12, 2026

View reviewed changes

timmoon10 reviewed May 18, 2026

View reviewed changes

Siddhartha Raman S and others added 5 commits May 18, 2026 14:46

Address grouped MLP fused op review comments

e29544f

Signed-off-by: Siddhartha Raman S <sraman@login-lyris01.lyris.clusters.nvidia.com>

Avoid quantizing ScaledSReLU backward in basic op

2a4d310

Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>

Wire ScaledSReLU recompute in grouped MLP

74a2395

Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5c83920

for more information, see https://pre-commit.ci

Address grouped MLP ScaledSReLU review comments

46b3169

Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>

sraman-rgb force-pushed the fc1-srelu-main branch from 912b1d9 to 46b3169 Compare May 18, 2026 21:47

vthumbe1503 reviewed May 19, 2026

View reviewed changes

		swiglu: Optional[ScaledSwiGLU \| ScaledClampedQGeGLU] = None,
		srelu: Optional[ScaledSReLU] = None,

	swiglu: Optional[ScaledSwiGLU \| ScaledClampedQGeGLU] = None,
	srelu: Optional[ScaledSReLU] = None,
	activation: Optional[FusibleOperation] = None,

		return fc2_out, [(), (), ()]


		class ForwardGroupedMLP_CuTeGEMMSReLU_MXFP8(ForwardGroupedMLP_CuTeGEMMSwiGLU_MXFP8):

	activation: Optional[FusibleOperation] = None,
	activation: Optional[FusibleOperation],

Conversation

sraman-rgb commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

ksivaman commented May 12, 2026

Uh oh!

ksivaman commented May 12, 2026

Uh oh!

greptile-apps Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

timmoon10 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented May 18, 2026

Uh oh!

timmoon10 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 left a comment

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sraman-rgb commented May 12, 2026 •

edited

Loading

greptile-apps Bot commented May 12, 2026 •

edited

Loading