Optimize qwen3 moe by WANDY666 · Pull Request #1207 · ModelTC/LightLLM

WANDY666 · 2026-02-06T09:12:23Z

No description provided.

…ize_qwen3_moe

gemini-code-assist · 2026-02-06T09:12:58Z

Summary of Changes

Hello @WANDY666, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance and flexibility of the Qwen3 Mixture-of-Experts (MoE) model within LightLLM. It introduces a highly optimized fused RMSNorm Triton kernel for Query and Key tensors, streamlines MoE expert selection, and expands the autotune configurations for various Triton operations, leading to more efficient GPU utilization. Additionally, a new random sampling method is integrated into the inference pipeline, offering alternative token generation strategies.

Highlights

MoE Expert Selection Optimization: Conditional application of routed_scaling_factor to topk_weights in both deepgemm_impl.py and triton_impl.py to avoid unnecessary multiplication when the factor is 1.0, improving efficiency.
Triton Kernel Refinement: Removed redundant .float() conversion for gating_output in fused_moe/topk_select.py, streamlining data type handling and potentially improving performance.
Fused QK RMSNorm Kernel: Introduced a new Triton kernel (_qk_rms_norm_fused_kernel) and a corresponding Python wrapper (qk_rmsnorm_fused_forward) to perform in-place RMSNorm for both Query (Q) and Key (K) tensors simultaneously, enhancing efficiency, especially for Grouped Query Attention (GQA) scenarios.
Qwen3 MoE Integration: Updated the Qwen3 MoE transformer layer to utilize the newly implemented qk_rmsnorm_fused_forward for Q and K normalization, replacing separate calls for improved performance.
Expanded Autotune Configurations: Added numerous new autotune kernel configurations across several Triton kernel types (grouped_matmul, moe_align_fused, moe_sum_reduce, silu_and_mul_fwd) for various tensor dimensions and parameters, aiming to improve performance on NVIDIA H200 GPUs.
Random Sampling Strategy: Implemented a new _random_sample function and integrated it into the generic_post_process.py for model inference, allowing for a different token sampling approach based on exponential distribution.
Unit Testing for Fused RMSNorm: Added a dedicated unit test (test_qk_rmsnorm_fused.py) to validate the correctness and numerical stability of the new fused QK RMSNorm kernel against a reference implementation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

lightllm/common/basemodel/layer_weights/meta_weights/fused_moe/impl/deepgemm_impl.py
- Added conditional check if self.routed_scaling_factor != 1.0: before applying scaling to topk_weights.
lightllm/common/basemodel/layer_weights/meta_weights/fused_moe/impl/triton_impl.py
- Added conditional check if self.routed_scaling_factor != 1.0: before applying scaling to topk_weights.
lightllm/common/basemodel/triton_kernel/fused_moe/topk_select.py
- Changed gating_output.float() to gating_output in sgl_ops.topk_softmax call.
lightllm/common/basemodel/triton_kernel/norm/qk_norm.py
- Added _qk_rms_norm_fused_kernel (Triton JIT kernel) for parallel RMSNorm on Q and K.
- Added qk_rmsnorm_fused_forward function to orchestrate the fused QK RMSNorm.
lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/grouped_matmul:v1/{K=192,N=2048,expert_num=128,mul_routed_weight=true,out_dtype=torch.bfloat16,topk_num=1,use_fp8_w8a8=false}_NVIDIA_H200.json
- Added new kernel configurations for K values 192, 384, 640, 768, 896, 96.
lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/grouped_matmul:v1/{K=2048,N=384,expert_num=128,mul_routed_weight=false,out_dtype=torch.bfloat16,topk_num=8,use_fp8_w8a8=false}_NVIDIA_H200.json
- Added new kernel configurations for K values 112, 12, 24, 48, 80, 96.
lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/moe_align_fused:v1/{topk_num=8}_NVIDIA_H200.json
- Added new kernel configurations for topk_num values 112, 12, 2, 24, 48, 80, 96.
lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/moe_sum_reduce:v1/{hidden_dim=2048,out_dtype=torch.bfloat16,topk_num=8}_NVIDIA_H200.json
- Added new kernel configurations for hidden_dim values 112, 12, 24, 48, 80, 96.
lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/silu_and_mul_fwd:v1/{N=192,out_dtype=torch.bfloat16}_NVIDIA_H200.json
- Added new kernel configurations for N values 192, 384, 640, 768, 896, 96.
lightllm/models/qwen3_moe/layer_infer/transformer_layer_infer.py
- Imported qk_rmsnorm_fused_forward.
- Replaced separate layer_weight.q_norm_weight_ and layer_weight.k_norm_weight_ calls with a single qk_rmsnorm_fused_forward call.
lightllm/server/router/model_infer/mode_backend/generic_post_process.py
- Added is_all_random to the return of _get_post_sample_tensors.
- Implemented _random_sample function for exponential sampling.
- Added logic to use _random_sample if is_all_random is true.
- Updated _get_post_sample_tensors to set is_all_random based on top_k_val and shm_param.top_p.
unit_tests/common/basemodel/triton_kernel/test_qk_rmsnorm_fused.py
- Added a new unit test file to verify qk_rmsnorm_fused_forward functionality and accuracy.

Activity

The pull request was initiated by WANDY666.
New Triton kernels and autotune configurations were introduced and integrated.
A new unit test was added to ensure the correctness of the fused QK RMSNorm.
No specific comments or review activities are detailed in the provided context.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces several optimizations for the qwen3 MoE model, including a new fused Triton kernel for QK RMSNorm to reduce kernel launch overhead, a new fast path for random sampling, and other micro-optimizations. The changes are well-targeted for performance. My review has identified a critical bug in the new random sampling logic that results in incorrect log probabilities, and I have provided a fix. I also noted some maintainability issues with non-English comments in the newly added code, which should be addressed for consistency.

gemini-code-assist · 2026-02-06T09:14:59Z

lightllm/server/router/model_infer/mode_backend/generic_post_process.py

+def _random_sample(probs: torch.Tensor):
+    q = torch.empty_like(probs)
+    q.exponential_()
+    return probs.div_(q).argmax(dim=-1).view(-1)


The _random_sample function modifies the probs tensor in-place using div_. However, the caller in the sample function reuses this modified probs tensor on line 74 to calculate batch_next_token_probs, which will lead to incorrect log probabilities being returned. Please use the out-of-place div operation to avoid modifying the input tensor.

Suggested change

return probs.div_(q).argmax(dim=-1).view(-1)

return probs.div(q).argmax(dim=-1).view(-1)

gemini-code-assist · 2026-02-06T09:15:00Z

lightllm/common/basemodel/triton_kernel/norm/qk_norm.py

+    stride_k_row,
+    stride_k_col,
+    # Dimensions
+    num_heads_q: tl.constexpr,  # Q 的头数 (用于判断边界)


The comments within the new _qk_rms_norm_fused_kernel Triton kernel are in Chinese (e.g., "Q 的头数 (用于判断边界)"). For consistency with the rest of the codebase and to ensure it's understandable for all contributors, please translate these comments to English.

gemini-code-assist · 2026-02-06T09:15:00Z

unit_tests/common/basemodel/triton_kernel/test_qk_rmsnorm_fused.py

+
+    torch.manual_seed(0)
+
+    # 模拟配置: Batch=2, Seq=128, Head_Dim=128


The comments in this new test file are in Chinese (e.g., "模拟配置..."). For consistency and to make the tests understandable to all contributors, please translate them to English.

WANDY666 added 5 commits February 6, 2026 08:22

more batch tune

476166a

delete the useless process

f715bb9

add random_sample

fe942a1

Merge branch 'main' of https://github.com/ModelTC/LightLLM into optim…

318b489

…ize_qwen3_moe

add qk_rmsnorm_fused

31cc903

gemini-code-assist bot reviewed Feb 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize qwen3 moe#1207

Optimize qwen3 moe#1207
WANDY666 wants to merge 5 commits intomainfrom
optimize_qwen3_moe

WANDY666 commented Feb 6, 2026

Uh oh!

gemini-code-assist bot commented Feb 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 6, 2026

Uh oh!

gemini-code-assist bot Feb 6, 2026

Uh oh!

gemini-code-assist bot Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	return probs.div_(q).argmax(dim=-1).view(-1)
	return probs.div(q).argmax(dim=-1).view(-1)


		torch.manual_seed(0)

		# 模拟配置: Batch=2, Seq=128, Head_Dim=128

Conversation

WANDY666 commented Feb 6, 2026

Uh oh!

gemini-code-assist bot commented Feb 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant