Skip to content

Optimize qwen3 moe#1207

Open
WANDY666 wants to merge 5 commits intomainfrom
optimize_qwen3_moe
Open

Optimize qwen3 moe#1207
WANDY666 wants to merge 5 commits intomainfrom
optimize_qwen3_moe

Conversation

@WANDY666
Copy link
Contributor

@WANDY666 WANDY666 commented Feb 6, 2026

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @WANDY666, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance and flexibility of the Qwen3 Mixture-of-Experts (MoE) model within LightLLM. It introduces a highly optimized fused RMSNorm Triton kernel for Query and Key tensors, streamlines MoE expert selection, and expands the autotune configurations for various Triton operations, leading to more efficient GPU utilization. Additionally, a new random sampling method is integrated into the inference pipeline, offering alternative token generation strategies.

Highlights

  • MoE Expert Selection Optimization: Conditional application of routed_scaling_factor to topk_weights in both deepgemm_impl.py and triton_impl.py to avoid unnecessary multiplication when the factor is 1.0, improving efficiency.
  • Triton Kernel Refinement: Removed redundant .float() conversion for gating_output in fused_moe/topk_select.py, streamlining data type handling and potentially improving performance.
  • Fused QK RMSNorm Kernel: Introduced a new Triton kernel (_qk_rms_norm_fused_kernel) and a corresponding Python wrapper (qk_rmsnorm_fused_forward) to perform in-place RMSNorm for both Query (Q) and Key (K) tensors simultaneously, enhancing efficiency, especially for Grouped Query Attention (GQA) scenarios.
  • Qwen3 MoE Integration: Updated the Qwen3 MoE transformer layer to utilize the newly implemented qk_rmsnorm_fused_forward for Q and K normalization, replacing separate calls for improved performance.
  • Expanded Autotune Configurations: Added numerous new autotune kernel configurations across several Triton kernel types (grouped_matmul, moe_align_fused, moe_sum_reduce, silu_and_mul_fwd) for various tensor dimensions and parameters, aiming to improve performance on NVIDIA H200 GPUs.
  • Random Sampling Strategy: Implemented a new _random_sample function and integrated it into the generic_post_process.py for model inference, allowing for a different token sampling approach based on exponential distribution.
  • Unit Testing for Fused RMSNorm: Added a dedicated unit test (test_qk_rmsnorm_fused.py) to validate the correctness and numerical stability of the new fused QK RMSNorm kernel against a reference implementation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • lightllm/common/basemodel/layer_weights/meta_weights/fused_moe/impl/deepgemm_impl.py
    • Added conditional check if self.routed_scaling_factor != 1.0: before applying scaling to topk_weights.
  • lightllm/common/basemodel/layer_weights/meta_weights/fused_moe/impl/triton_impl.py
    • Added conditional check if self.routed_scaling_factor != 1.0: before applying scaling to topk_weights.
  • lightllm/common/basemodel/triton_kernel/fused_moe/topk_select.py
    • Changed gating_output.float() to gating_output in sgl_ops.topk_softmax call.
  • lightllm/common/basemodel/triton_kernel/norm/qk_norm.py
    • Added _qk_rms_norm_fused_kernel (Triton JIT kernel) for parallel RMSNorm on Q and K.
    • Added qk_rmsnorm_fused_forward function to orchestrate the fused QK RMSNorm.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/grouped_matmul:v1/{K=192,N=2048,expert_num=128,mul_routed_weight=true,out_dtype=torch.bfloat16,topk_num=1,use_fp8_w8a8=false}_NVIDIA_H200.json
    • Added new kernel configurations for K values 192, 384, 640, 768, 896, 96.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/grouped_matmul:v1/{K=2048,N=384,expert_num=128,mul_routed_weight=false,out_dtype=torch.bfloat16,topk_num=8,use_fp8_w8a8=false}_NVIDIA_H200.json
    • Added new kernel configurations for K values 112, 12, 24, 48, 80, 96.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/moe_align_fused:v1/{topk_num=8}_NVIDIA_H200.json
    • Added new kernel configurations for topk_num values 112, 12, 2, 24, 48, 80, 96.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/moe_sum_reduce:v1/{hidden_dim=2048,out_dtype=torch.bfloat16,topk_num=8}_NVIDIA_H200.json
    • Added new kernel configurations for hidden_dim values 112, 12, 24, 48, 80, 96.
  • lightllm/common/triton_utils/autotune_kernel_configs/triton_3.5.1/NVIDIA_H200/silu_and_mul_fwd:v1/{N=192,out_dtype=torch.bfloat16}_NVIDIA_H200.json
    • Added new kernel configurations for N values 192, 384, 640, 768, 896, 96.
  • lightllm/models/qwen3_moe/layer_infer/transformer_layer_infer.py
    • Imported qk_rmsnorm_fused_forward.
    • Replaced separate layer_weight.q_norm_weight_ and layer_weight.k_norm_weight_ calls with a single qk_rmsnorm_fused_forward call.
  • lightllm/server/router/model_infer/mode_backend/generic_post_process.py
    • Added is_all_random to the return of _get_post_sample_tensors.
    • Implemented _random_sample function for exponential sampling.
    • Added logic to use _random_sample if is_all_random is true.
    • Updated _get_post_sample_tensors to set is_all_random based on top_k_val and shm_param.top_p.
  • unit_tests/common/basemodel/triton_kernel/test_qk_rmsnorm_fused.py
    • Added a new unit test file to verify qk_rmsnorm_fused_forward functionality and accuracy.
Activity
  • The pull request was initiated by WANDY666.
  • New Triton kernels and autotune configurations were introduced and integrated.
  • A new unit test was added to ensure the correctness of the fused QK RMSNorm.
  • No specific comments or review activities are detailed in the provided context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several optimizations for the qwen3 MoE model, including a new fused Triton kernel for QK RMSNorm to reduce kernel launch overhead, a new fast path for random sampling, and other micro-optimizations. The changes are well-targeted for performance. My review has identified a critical bug in the new random sampling logic that results in incorrect log probabilities, and I have provided a fix. I also noted some maintainability issues with non-English comments in the newly added code, which should be addressed for consistency.

def _random_sample(probs: torch.Tensor):
q = torch.empty_like(probs)
q.exponential_()
return probs.div_(q).argmax(dim=-1).view(-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The _random_sample function modifies the probs tensor in-place using div_. However, the caller in the sample function reuses this modified probs tensor on line 74 to calculate batch_next_token_probs, which will lead to incorrect log probabilities being returned. Please use the out-of-place div operation to avoid modifying the input tensor.

Suggested change
return probs.div_(q).argmax(dim=-1).view(-1)
return probs.div(q).argmax(dim=-1).view(-1)

stride_k_row,
stride_k_col,
# Dimensions
num_heads_q: tl.constexpr, # Q 的头数 (用于判断边界)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comments within the new _qk_rms_norm_fused_kernel Triton kernel are in Chinese (e.g., "Q 的头数 (用于判断边界)"). For consistency with the rest of the codebase and to ensure it's understandable for all contributors, please translate these comments to English.


torch.manual_seed(0)

# 模拟配置: Batch=2, Seq=128, Head_Dim=128
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comments in this new test file are in Chinese (e.g., "模拟配置..."). For consistency and to make the tests understandable to all contributors, please translate them to English.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant