use torch ops by sufubao · Pull Request #1000 · ModelTC/LightLLM

sufubao · 2025-08-05T12:24:57Z

No description provided.

gemini-code-assist

Summary of Changes

Hello @sufubao, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance and efficiency of Mixture-of-Experts (MoE) layers, primarily for DeepseekV2/V3 and Qwen3-MoE models. It achieves this by introducing a 'fused shared experts' mechanism, integrating custom PyTorch operations for critical computations, and applying extensive performance tuning to underlying Triton kernels. Additionally, it includes improvements in memory management and FP8 quantization handling.

Highlights

MoE Optimization: Fused Shared Experts: Implemented support for 'fused shared experts' in Mixture-of-Experts (MoE) layers, particularly for DeepseekV2/V3 models. This allows a shared expert to be integrated directly into the MoE computation path, potentially improving efficiency. This involves modifying expert weight handling, topk_ids generation, and conditional processing of shared expert outputs.
PyTorch Custom Operations Integration: Refactored the core fused_experts logic to leverage custom PyTorch operations (torch.ops.lightllm.inplace_fused_experts_impl and torch.ops.lightllm.outplace_fused_experts_impl). A new utility (lightllm/utils/torch_ops_utils.py) was introduced to directly register these custom ops, aiming for better performance and integration with PyTorch's dispatch mechanism.
Triton Kernel Performance Tuning: Added numerous new Triton kernel configurations for grouped MoE GEMM operations, targeting NVIDIA H100 and H200 GPUs across various model dimensions and quantization settings. A faster Triton kernel (_silu_and_mul_kernel_fast) was introduced for SiLU and Mul operations, and the Deepseek2 rotary embedding kernel was split into specialized Q and K kernels for fine-grained optimization.
Memory and Quantization Enhancements: Optimized memory allocation for intermediate tensors in fused MoE operations by using a single large cache and torch.empty. The FP8 activation quantization (per_token_group_quant_fp8) was enhanced to support column-major and TMA-aligned scales, and the weight fusion logic was refactored to avoid torch.cat and _flatten_dense_tensors for improved efficiency.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant enhancements for Mixture-of-Experts (MoE) models, primarily by adding support for fused shared experts and leveraging custom torch operations for performance. The changes span from kernel configurations and low-level Triton kernels to model-specific layer implementations. Key improvements include more efficient memory management through pre-allocation and buffer reuse, and performance optimizations in rotary embedding and SiLU kernels. My review focuses on ensuring the correctness of new kernel logic, improving type safety, and maintaining code clarity by avoiding hardcoded values. Overall, these are solid improvements, with a few areas that need attention for correctness and maintainability.

gemini-code-assist · 2025-08-05T12:28:27Z

        # 如果使用了 block wise 量化，分块大小不能超过 block size
        BLOCK_SIZE_K = min(BLOCK_SIZE_K, block_size_k)
-        assert BLOCK_SIZE_K == triton.next_power_of_2(BLOCK_SIZE_K)
+        BLOCK_SIZE_K = triton.next_power_of_2(BLOCK_SIZE_K//2 + 1)


The logic for calculating BLOCK_SIZE_K appears to be incorrect. The expression triton.next_power_of_2(BLOCK_SIZE_K // 2 + 1) does not reliably compute the next power of two for BLOCK_SIZE_K. For instance, if BLOCK_SIZE_K is 63, this expression evaluates to 32, which is smaller and likely not the intended behavior. This could lead to incorrect results or performance issues.

If the goal is to ensure BLOCK_SIZE_K is a power of two, it is recommended to use triton.next_power_of_2(BLOCK_SIZE_K).

BLOCK_SIZE_K = triton.next_power_of_2(BLOCK_SIZE_K)

gemini-code-assist · 2025-08-05T12:28:27Z

+    w2_scale: Optional[torch.Tensor] = None,
+    a1_scale: Optional[torch.Tensor] = None,
+    a2_scale: Optional[torch.Tensor] = None,
+)-> None:


The return type hint for outplace_fused_experts_impl is -> None, but the function's implementation returns a torch.Tensor. To ensure type correctness and improve code clarity, the type hint should be updated to -> torch.Tensor.

)-> torch.Tensor:

gemini-code-assist · 2025-08-05T12:28:27Z

-
-    grid = (triton.cdiv(head_num_q, BLOCK_HEAD), triton.cdiv(total_len, BLOCK_SEQ))
-    _rotary_kernel[grid](
+    grid = (triton.next_power_of_2(head_num_q), triton.cdiv(total_len, BLOCK_SEQ))


Using triton.next_power_of_2(head_num_q) for the grid size can launch more kernel instances than necessary if head_num_q is not a power of two. While the kernel includes a check to handle these extra instances, it is generally more efficient to launch the exact number of required blocks. Triton is optimized to handle non-power-of-two grid sizes effectively.

grid = (head_num_q, triton.cdiv(total_len, BLOCK_SEQ))

gemini-code-assist · 2025-08-05T12:28:27Z

            hidden_states,
            router_logits=router_logits,
-            top_k=self.num_experts_per_tok,
+            top_k=8,


The top_k parameter is hardcoded to 8. For better maintainability and to avoid magic numbers, it is recommended to use the value from the configuration, self.num_experts_per_tok. This ensures that the code remains correct if the model configuration is updated.

Suggested change

top_k=8,

top_k=self.num_experts_per_tok,

shihaobai and others added 21 commits June 13, 2025 20:22

speedup load

31c0729

speedup kernel

a440f2d

update tuning

bf33cdb

fix

eec0be4

add tuning config

7c04f48

improve the rotary_embed

9665ccc

improve per_token_quant

c3701e9

silu improve

14eb826

update

c1a02ac

add config

432c521

merge main

6362c4a

merge main

587e0a4

remove unused scripts

c1fe812

fix tuning

038e089

add fused expert

aeb80a5

update

7e299d6

merge main

7a8abec

back the fuese moe

8f24773

add grouped_moe_gemm_kernel configs (#994)

762ae34

optimize

48e1ee9

use torch ops

518b48d

sufubao closed this Aug 5, 2025

gemini-code-assist Bot reviewed Aug 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use torch ops#1000

use torch ops#1000
sufubao wants to merge 21 commits into
mainfrom
torch.ops

sufubao commented Aug 5, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Aug 5, 2025

Uh oh!

gemini-code-assist Bot Aug 5, 2025

Uh oh!

gemini-code-assist Bot Aug 5, 2025

Uh oh!

gemini-code-assist Bot Aug 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sufubao commented Aug 5, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants