Support merging gating gmm kernels by BirdsOfAFthr · Pull Request #3199 · AI-Hypercomputer/maxtext

BirdsOfAFthr · 2026-02-20T05:53:18Z

Description

This PR optimizes the MoE compute block by merging the two gating GMM kernels ($W_0$ and $W_1$) into a single, unified matrix multiplication pass.

Motivation
In the previous SwiGLU/GLU implementation, the gate-projection and up-projection were processed using two sequential gmm_fn calls. By concatenating these weights and processing them together, we effectively double the contiguous hidden dimension of the kernel. This is especially critical for FP8 utilizing Expert Parallelism (EP) that shard along the contracting dimension. Because this sharding strategy inherently shrinks the local MLP hidden dimension on each device, the matrix multiplications can become small and bottlenecked by memory bandwidth. Merging $W_0$ and $W_1$ effectively gives us a 2X increase in that local dimension, restoring arithmetic intensity and hardware utilization.
Expected Impact

Performance: Increased forward and backward pass throughput for the MoE layers, particularly on EP setups sharded along the contracting dimension due to the 2X larger local GMM sizes.

Tests

The operation is mathematically equivalent to the previous implementation. The quality has been verified through convergence test.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-02-20T06:02:11Z

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/models/deepseek_batchsplit.py	0.00%	6 Missing ⚠️

📢 Thoughts on this report? Let us know!

suexu1025

LGTM, thanks!

BirdsOfAFthr requested review from A9isha, NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, gagika, gobbleturk, hengtaoguo, jesselu-google, jiangjy1982, khatwanimohit, richjames0, shralex, suexu1025 and vipannalla as code owners February 20, 2026 05:53

suexu1025 approved these changes Feb 20, 2026

View reviewed changes

BirdsOfAFthr force-pushed the amandaliang branch from afdb8c7 to 6911c2c Compare February 21, 2026 01:05

khatwanimohit approved these changes Feb 21, 2026

View reviewed changes

Support merging gating gmm kernels

2d43a9c

BirdsOfAFthr force-pushed the amandaliang branch from 6911c2c to 2d43a9c Compare February 21, 2026 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Support merging gating gmm kernels#3199

Support merging gating gmm kernels#3199
BirdsOfAFthr wants to merge 1 commit intomainfrom
amandaliang

BirdsOfAFthr commented Feb 20, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 20, 2026

Uh oh!

suexu1025 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

BirdsOfAFthr commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov bot commented Feb 20, 2026

Codecov Report

Uh oh!

suexu1025 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BirdsOfAFthr commented Feb 20, 2026 •

edited

Loading