Skip to content

use torch ops#1000

Closed
sufubao wants to merge 21 commits into
mainfrom
torch.ops
Closed

use torch ops#1000
sufubao wants to merge 21 commits into
mainfrom
torch.ops

Conversation

@sufubao

@sufubao sufubao commented Aug 5, 2025

Copy link
Copy Markdown
Collaborator

No description provided.

@sufubao sufubao closed this Aug 5, 2025

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @sufubao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance and efficiency of Mixture-of-Experts (MoE) layers, primarily for DeepseekV2/V3 and Qwen3-MoE models. It achieves this by introducing a 'fused shared experts' mechanism, integrating custom PyTorch operations for critical computations, and applying extensive performance tuning to underlying Triton kernels. Additionally, it includes improvements in memory management and FP8 quantization handling.

Highlights

  • MoE Optimization: Fused Shared Experts: Implemented support for 'fused shared experts' in Mixture-of-Experts (MoE) layers, particularly for DeepseekV2/V3 models. This allows a shared expert to be integrated directly into the MoE computation path, potentially improving efficiency. This involves modifying expert weight handling, topk_ids generation, and conditional processing of shared expert outputs.
  • PyTorch Custom Operations Integration: Refactored the core fused_experts logic to leverage custom PyTorch operations (torch.ops.lightllm.inplace_fused_experts_impl and torch.ops.lightllm.outplace_fused_experts_impl). A new utility (lightllm/utils/torch_ops_utils.py) was introduced to directly register these custom ops, aiming for better performance and integration with PyTorch's dispatch mechanism.
  • Triton Kernel Performance Tuning: Added numerous new Triton kernel configurations for grouped MoE GEMM operations, targeting NVIDIA H100 and H200 GPUs across various model dimensions and quantization settings. A faster Triton kernel (_silu_and_mul_kernel_fast) was introduced for SiLU and Mul operations, and the Deepseek2 rotary embedding kernel was split into specialized Q and K kernels for fine-grained optimization.
  • Memory and Quantization Enhancements: Optimized memory allocation for intermediate tensors in fused MoE operations by using a single large cache and torch.empty. The FP8 activation quantization (per_token_group_quant_fp8) was enhanced to support column-major and TMA-aligned scales, and the weight fusion logic was refactored to avoid torch.cat and _flatten_dense_tensors for improved efficiency.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant enhancements for Mixture-of-Experts (MoE) models, primarily by adding support for fused shared experts and leveraging custom torch operations for performance. The changes span from kernel configurations and low-level Triton kernels to model-specific layer implementations. Key improvements include more efficient memory management through pre-allocation and buffer reuse, and performance optimizations in rotary embedding and SiLU kernels. My review focuses on ensuring the correctness of new kernel logic, improving type safety, and maintaining code clarity by avoiding hardcoded values. Overall, these are solid improvements, with a few areas that need attention for correctness and maintainability.

# 如果使用了 block wise 量化,分块大小不能超过 block size
BLOCK_SIZE_K = min(BLOCK_SIZE_K, block_size_k)
assert BLOCK_SIZE_K == triton.next_power_of_2(BLOCK_SIZE_K)
BLOCK_SIZE_K = triton.next_power_of_2(BLOCK_SIZE_K//2 + 1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for calculating BLOCK_SIZE_K appears to be incorrect. The expression triton.next_power_of_2(BLOCK_SIZE_K // 2 + 1) does not reliably compute the next power of two for BLOCK_SIZE_K. For instance, if BLOCK_SIZE_K is 63, this expression evaluates to 32, which is smaller and likely not the intended behavior. This could lead to incorrect results or performance issues.

If the goal is to ensure BLOCK_SIZE_K is a power of two, it is recommended to use triton.next_power_of_2(BLOCK_SIZE_K).

BLOCK_SIZE_K = triton.next_power_of_2(BLOCK_SIZE_K)

w2_scale: Optional[torch.Tensor] = None,
a1_scale: Optional[torch.Tensor] = None,
a2_scale: Optional[torch.Tensor] = None,
)-> None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The return type hint for outplace_fused_experts_impl is -> None, but the function's implementation returns a torch.Tensor. To ensure type correctness and improve code clarity, the type hint should be updated to -> torch.Tensor.

)-> torch.Tensor:


grid = (triton.cdiv(head_num_q, BLOCK_HEAD), triton.cdiv(total_len, BLOCK_SEQ))
_rotary_kernel[grid](
grid = (triton.next_power_of_2(head_num_q), triton.cdiv(total_len, BLOCK_SEQ))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using triton.next_power_of_2(head_num_q) for the grid size can launch more kernel instances than necessary if head_num_q is not a power of two. While the kernel includes a check to handle these extra instances, it is generally more efficient to launch the exact number of required blocks. Triton is optimized to handle non-power-of-two grid sizes effectively.

grid = (head_num_q, triton.cdiv(total_len, BLOCK_SEQ))

hidden_states,
router_logits=router_logits,
top_k=self.num_experts_per_tok,
top_k=8,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The top_k parameter is hardcoded to 8. For better maintainability and to avoid magic numbers, it is recommended to use the value from the configuration, self.num_experts_per_tok. This ensures that the code remains correct if the model configuration is updated.

Suggested change
top_k=8,
top_k=self.num_experts_per_tok,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants