LoRA: Implementing kernels using CUBE computation unit by vlserov · Pull Request #384 · sgl-project/sgl-kernel-npu

vlserov · 2026-02-27T12:06:10Z

Implementing kernels using CUBE computation unit instead of using VECTOR computation unit

gemini-code-assist · 2026-02-27T12:06:40Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances LoRA kernel performance by migrating key operations from the VECTOR computation unit to the more specialized CUBE computation unit. This involves introducing new sgemmc_expand and sgemmc_shrink kernels, complete with their host-side interfaces and dedicated tiling mechanisms. Additionally, existing LoRA kernels were refactored to share common utilities, streamlining the codebase and preparing for future optimizations.

Highlights

New LoRA Kernels with CUBE Unit: Introduced new sgemmc_expand and sgemmc_shrink operations, specifically designed to utilize the CUBE computation unit for LoRA (Low-Rank Adaptation) kernels, moving away from or complementing existing VECTOR unit implementations.
Tiling Infrastructure for CUBE Operations: Added dedicated tiling logic and data structures (sgemmc_tiling.cpp, sgemmc_tiling.h, sgemmc_tiling_data.h) to support the configuration and execution of the new CUBE-based sgemmc operations.
Kernel Code Refactoring and Reusability: Refactored existing sgmv_expand_kernel.cpp, sgmv_shrink_kernel.cpp, sgemmv_expand_kernel.cpp, and sgemmv_shrink_kernel.cpp to leverage a new common BlockIterator utility defined in lora_common_kernel.h, improving code modularity and reducing redundancy.
Build System and API Integration: Updated CMakeLists.txt to include the newly added host and kernel source files for sgemmc operations, linked necessary libraries, and registered the new sgemmc_expand and sgemmc_shrink functions within the PyTorch extension API (pytorch_extensions.cpp) and public headers (sgl_kenel_npu_ops.h).
Utility Enhancements: Enhanced common_tiling.h with a DataType enum and torch_helper.h with a ConvertDataType utility, alongside a new common_tiling_kernel.h for kernel-side tiling data copying, providing foundational support for the new kernel implementations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

csrc/CMakeLists.txt
- Added sgemmc_expand.cpp, sgemmc_shrink.cpp, and sgemmc_tiling.cpp to host source files.
- Included sgemmc_expand_kernel.cpp and sgemmc_shrink_kernel.cpp in workspace kernel sources.
- Added utils/kernel to include directories for both no_workspace_kernel and workspace_kernel.
- Linked host_intf_pub to the OP_PLUGIN_NAME target.
- Added ${ASCEND_INCLUDE_DIR} to target include directories.
csrc/lora/op_host/sgemmc_expand.cpp
- Added new host-side implementation for sgemmc_expand using CUBE computation unit.
csrc/lora/op_host/sgemmc_shrink.cpp
- Added new host-side implementation for sgemmc_shrink using CUBE computation unit.
csrc/lora/op_host/tiling/sgemmc_tiling.cpp
- Added implementation for GenerateTiling function to configure CUBE tiling parameters for sgemmc operations.
csrc/lora/op_host/tiling/sgemmc_tiling.h
- Added header file declaring the GenerateTiling function for sgemmc operations.
csrc/lora/op_host/tiling/sgemmc_tiling_data.h
- Added data structure SGEMMCTilingData to hold tiling information for sgemmc kernels.
csrc/lora/op_kernel/lora_common_kernel.h
- Added new header file defining BlockIterator for common LoRA kernel utilities.
csrc/lora/op_kernel/sgemmc_expand_kernel.cpp
- Added new kernel-side implementation for sgemmc_expand utilizing the CUBE computation unit.
csrc/lora/op_kernel/sgemmc_shrink_kernel.cpp
- Added new kernel-side implementation for sgemmc_shrink utilizing the CUBE computation unit.
csrc/lora/op_kernel/sgemmv_expand_kernel.cpp
- Included lora_common_kernel.h.
- Updated Process method to use lora_common::BlockIterator for LoRA index lookup.
- Removed redundant CopyInIndex private method.
csrc/lora/op_kernel/sgemmv_shrink_kernel.cpp
- Included lora_common_kernel.h.
- Updated Process method to use lora_common::BlockIterator for LoRA index lookup.
- Removed redundant CopyInIndex private method.
csrc/lora/op_kernel/sgmv_expand_kernel.cpp
- Included lora_common_kernel.h.
- Updated Process method to use lora_common::BlockIterator for LoRA index lookup.
- Removed redundant CopyInIndex private method.
csrc/lora/op_kernel/sgmv_shrink_kernel.cpp
- Included lora_common_kernel.h.
- Updated Process method to use lora_common::BlockIterator for LoRA index lookup.
- Removed redundant CopyInIndex private method.
csrc/pytorch_extensions.cpp
- Registered sgemmc_expand and sgemmc_shrink operations in the PyTorch NPU library fragment.
- Implemented sgemmc_expand and sgemmc_shrink functions for the PrivateUse1 backend.
csrc/utils/common_tiling.h
- Added DataType enum to host_utils namespace.
csrc/utils/kernel/common_tiling_kernel.h
- Added new header file defining CopyTiling utility for kernel-side tiling data handling.
csrc/utils/torch_helper.h
- Included common_tiling.h.
- Added ConvertDataType static method to TorchNpuHelper to convert at::ScalarType to host_utils::DataType.
include/sgl_kenel_npu_ops.h
- Declared sgemmc_expand and sgemmc_shrink functions.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces new LoRA kernels (sgemmc) that leverage the CUBE computation unit for better performance, which is a solid architectural improvement. The refactoring of existing kernels to use a common BlockIterator is also a good move for code maintainability. However, my review uncovered several critical issues in the new host-side and kernel-side implementations. These include incorrect variable initializations, swapped function arguments, use of uninitialized variables, and missing template parameters, which will likely cause compilation errors or incorrect runtime behavior. I've also noted some medium-severity issues related to code quality, such as improper error handling and dead code. The BlockIterator refactoring is also flawed as it's used incorrectly. I've provided specific suggestions to address these critical problems.

This reverts commit 7b93364.

#430) This reverts commit 7b93364.

gemini-code-assist Bot reviewed Feb 27, 2026

View reviewed changes

vlserov force-pushed the vlserov/lora_kernels_cube branch 2 times, most recently from ba1b713 to 573bb5d Compare March 2, 2026 11:11

vlserov marked this pull request as ready for review March 2, 2026 11:12

RuixuanZhang06 approved these changes Mar 9, 2026

View reviewed changes

RuixuanZhang06 approved these changes Mar 10, 2026

View reviewed changes

RuixuanZhang06 previously approved these changes Mar 10, 2026

View reviewed changes

vlserov dismissed RuixuanZhang06’s stale review via b82a7ca March 17, 2026 14:26

vlserov force-pushed the vlserov/lora_kernels_cube branch 2 times, most recently from b82a7ca to 9cc6799 Compare March 19, 2026 05:01

LoRA: Implementing kernels using CUBE computation unit

27a056c

vlserov force-pushed the vlserov/lora_kernels_cube branch from 9cc6799 to 27a056c Compare March 20, 2026 11:58

RuixuanZhang06 previously approved these changes Mar 30, 2026

View reviewed changes

Merge branch 'origin/main'

04c5fa2

vlserov dismissed RuixuanZhang06’s stale review via 04c5fa2 April 2, 2026 03:07

vlserov force-pushed the vlserov/lora_kernels_cube branch from d455c9b to 8d51c08 Compare April 2, 2026 04:32

Resolving comments

a7455c6

vlserov force-pushed the vlserov/lora_kernels_cube branch from 8d51c08 to a7455c6 Compare April 2, 2026 04:34

vlserov added 2 commits April 6, 2026 12:30

Merge branch 'origin/main'

9b22313

Update pre-commit after merge

a103bd6

RuixuanZhang06 approved these changes Apr 8, 2026

View reviewed changes

RuixuanZhang06 merged commit 7b93364 into sgl-project:main Apr 8, 2026
5 of 7 checks passed

iforgetmyname added a commit that referenced this pull request Apr 8, 2026

Revert "LoRA: Implementing kernels using CUBE computation unit (#384)"

d9ef805

This reverts commit 7b93364.

iforgetmyname mentioned this pull request Apr 8, 2026

Revert "LoRA: Implementing kernels using CUBE computation unit" #430

Merged

iforgetmyname added a commit that referenced this pull request Apr 8, 2026

Revert "LoRA: Implementing kernels using CUBE computation unit (#384)" (

cbc0409

#430) This reverts commit 7b93364.

vlserov mentioned this pull request Apr 8, 2026

LoRA: Implementing kernels using CUBE computation unit #432

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoRA: Implementing kernels using CUBE computation unit#384

LoRA: Implementing kernels using CUBE computation unit#384
RuixuanZhang06 merged 5 commits intosgl-project:mainfrom
vlserov:vlserov/lora_kernels_cube

vlserov commented Feb 27, 2026

Uh oh!

gemini-code-assist Bot commented Feb 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vlserov commented Feb 27, 2026

Uh oh!

gemini-code-assist Bot commented Feb 27, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants