Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @sufubao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the performance and efficiency of Mixture-of-Experts (MoE) layers, primarily for DeepseekV2/V3 and Qwen3-MoE models. It achieves this by introducing a 'fused shared experts' mechanism, integrating custom PyTorch operations for critical computations, and applying extensive performance tuning to underlying Triton kernels. Additionally, it includes improvements in memory management and FP8 quantization handling.
Highlights
- MoE Optimization: Fused Shared Experts: Implemented support for 'fused shared experts' in Mixture-of-Experts (MoE) layers, particularly for DeepseekV2/V3 models. This allows a shared expert to be integrated directly into the MoE computation path, potentially improving efficiency. This involves modifying expert weight handling,
topk_idsgeneration, and conditional processing of shared expert outputs. - PyTorch Custom Operations Integration: Refactored the core
fused_expertslogic to leverage custom PyTorch operations (torch.ops.lightllm.inplace_fused_experts_implandtorch.ops.lightllm.outplace_fused_experts_impl). A new utility (lightllm/utils/torch_ops_utils.py) was introduced to directly register these custom ops, aiming for better performance and integration with PyTorch's dispatch mechanism. - Triton Kernel Performance Tuning: Added numerous new Triton kernel configurations for grouped MoE GEMM operations, targeting NVIDIA H100 and H200 GPUs across various model dimensions and quantization settings. A faster Triton kernel (
_silu_and_mul_kernel_fast) was introduced for SiLU and Mul operations, and the Deepseek2 rotary embedding kernel was split into specialized Q and K kernels for fine-grained optimization. - Memory and Quantization Enhancements: Optimized memory allocation for intermediate tensors in fused MoE operations by using a single large cache and
torch.empty. The FP8 activation quantization (per_token_group_quant_fp8) was enhanced to support column-major and TMA-aligned scales, and the weight fusion logic was refactored to avoidtorch.catand_flatten_dense_tensorsfor improved efficiency.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces significant enhancements for Mixture-of-Experts (MoE) models, primarily by adding support for fused shared experts and leveraging custom torch operations for performance. The changes span from kernel configurations and low-level Triton kernels to model-specific layer implementations. Key improvements include more efficient memory management through pre-allocation and buffer reuse, and performance optimizations in rotary embedding and SiLU kernels. My review focuses on ensuring the correctness of new kernel logic, improving type safety, and maintaining code clarity by avoiding hardcoded values. Overall, these are solid improvements, with a few areas that need attention for correctness and maintainability.
| # 如果使用了 block wise 量化,分块大小不能超过 block size | ||
| BLOCK_SIZE_K = min(BLOCK_SIZE_K, block_size_k) | ||
| assert BLOCK_SIZE_K == triton.next_power_of_2(BLOCK_SIZE_K) | ||
| BLOCK_SIZE_K = triton.next_power_of_2(BLOCK_SIZE_K//2 + 1) |
There was a problem hiding this comment.
The logic for calculating BLOCK_SIZE_K appears to be incorrect. The expression triton.next_power_of_2(BLOCK_SIZE_K // 2 + 1) does not reliably compute the next power of two for BLOCK_SIZE_K. For instance, if BLOCK_SIZE_K is 63, this expression evaluates to 32, which is smaller and likely not the intended behavior. This could lead to incorrect results or performance issues.
If the goal is to ensure BLOCK_SIZE_K is a power of two, it is recommended to use triton.next_power_of_2(BLOCK_SIZE_K).
BLOCK_SIZE_K = triton.next_power_of_2(BLOCK_SIZE_K)| w2_scale: Optional[torch.Tensor] = None, | ||
| a1_scale: Optional[torch.Tensor] = None, | ||
| a2_scale: Optional[torch.Tensor] = None, | ||
| )-> None: |
|
|
||
| grid = (triton.cdiv(head_num_q, BLOCK_HEAD), triton.cdiv(total_len, BLOCK_SEQ)) | ||
| _rotary_kernel[grid]( | ||
| grid = (triton.next_power_of_2(head_num_q), triton.cdiv(total_len, BLOCK_SEQ)) |
There was a problem hiding this comment.
Using triton.next_power_of_2(head_num_q) for the grid size can launch more kernel instances than necessary if head_num_q is not a power of two. While the kernel includes a check to handle these extra instances, it is generally more efficient to launch the exact number of required blocks. Triton is optimized to handle non-power-of-two grid sizes effectively.
grid = (head_num_q, triton.cdiv(total_len, BLOCK_SEQ))| hidden_states, | ||
| router_logits=router_logits, | ||
| top_k=self.num_experts_per_tok, | ||
| top_k=8, |
There was a problem hiding this comment.
The top_k parameter is hardcoded to 8. For better maintainability and to avoid magic numbers, it is recommended to use the value from the configuration, self.num_experts_per_tok. This ensures that the code remains correct if the model configuration is updated.
| top_k=8, | |
| top_k=self.num_experts_per_tok, |
No description provided.