Revert "LoRA: Implementing kernels using CUBE computation unit" by iforgetmyname · Pull Request #430 · sgl-project/sgl-kernel-npu

iforgetmyname · 2026-04-08T08:53:09Z

Reverts #384

This reverts commit 7b93364.

gemini-code-assist

Code Review

This pull request removes the sgemmc operations and their associated tiling logic from the codebase, including host-side API definitions and kernel implementations. It also refactors the LoRA index lookup mechanism in the sgemmv and sgmv kernels by replacing the BlockIterator with a new CopyInIndex function. Feedback is provided regarding a significant performance regression introduced by this refactor in several kernel files; the new lookup logic performs a linear search with synchronous global memory reads inside the token loop, increasing the algorithmic complexity from $O(N+M)$ to $O(N \times M)$.

gemini-code-assist · 2026-04-08T08:56:32Z

+    __aicore__ inline void CopyInIndex(const int64_t idx)
+    {
+        // Look up the LoRA index
+        int64_t weightIdx = idx;
+        uint64_t i = 0;
+        for (; i < seqLenGm_.GetSize(); i++) {
+            int64_t repeatValue = seqLenGm_.GetValue(i);
+            if (weightIdx >= repeatValue) {
+                weightIdx -= repeatValue;
+                continue;
+            }
+            break;
+        }
+        reqLoRAIndex_ = (i < seqLenGm_.GetSize()) ? loraIndicesGm_.GetValue(i) : -1;
+    }


The CopyInIndex function introduces a significant performance regression compared to the BlockIterator it replaces. It performs a linear search through seqLenGm_ for every token index idx in the batch. Since this is called inside the main loop in Process(), the overall complexity becomes $O(N \times M)$ where $N$ is the number of tokens and $M$ is the number of LoRAs. Furthermore, seqLenGm_.GetValue(i) is a synchronous read from Global Memory, which is extremely slow on AI Core when executed repeatedly in a loop.

Consider implementing a stateful iteration approach that maintains the current block index and cumulative offset across iterations of the token loop, which would restore $O(N + M)$ complexity and significantly reduce Global Memory access overhead.

gemini-code-assist · 2026-04-08T08:56:32Z

+    __aicore__ inline void CopyInIndex(const int64_t idx)
+    {
+        // look up the LoRA index
+        int64_t weightIdx = idx;
+        uint64_t i = 0;
+        for (; i < seqLenGm_.GetSize(); i++) {
+            int64_t repeatValue = seqLenGm_.GetValue(i);
+            if (weightIdx >= repeatValue) {
+                weightIdx -= repeatValue;
+                continue;
+            }
+            break;
+        }
+        reqLoRAIndex_ = (i < seqLenGm_.GetSize()) ? loraIndicesGm_.GetValue(i) : -1;
+    }


The CopyInIndex function introduces a significant performance regression by performing a linear search through seqLenGm_ for every token. This results in $O(N \times M)$ complexity with many synchronous Global Memory reads via GetValue().

Consider using a stateful approach to track the current LoRA block across token iterations, reducing the complexity to $O(N + M)$.

gemini-code-assist · 2026-04-08T08:56:32Z

+    __aicore__ inline void CopyInIndex(const int64_t idx)
+    {
+        // Look up the LoRA index
+        int64_t weightIdx = idx;
+        uint64_t i = 0;
+        for (; i < seqLenGm_.GetSize(); i++) {
+            int64_t repeatValue = seqLenGm_.GetValue(i);
+            if (weightIdx >= repeatValue) {
+                weightIdx -= repeatValue;
+                continue;
+            }
+            break;
+        }
+        reqLoRAIndex_ = (i < seqLenGm_.GetSize()) ? loraIndicesGm_.GetValue(i) : -1;
+    }


The CopyInIndex function introduces a significant performance regression by performing a linear search through seqLenGm_ for every token. This results in $O(N \times M)$ complexity with many synchronous Global Memory reads via GetValue().

Consider using a stateful approach to track the current LoRA block across token iterations, reducing the complexity to $O(N + M)$.

gemini-code-assist · 2026-04-08T08:56:32Z

+    __aicore__ inline void CopyInIndex(const int64_t idx)
+    {
+        // look up the LoRA index
+        int64_t weightIdx = idx;
+        uint64_t i = 0;
+        for (; i < seqLenGm_.GetSize(); i++) {
+            int64_t repeatValue = seqLenGm_.GetValue(i);
+            if (weightIdx >= repeatValue) {
+                weightIdx -= repeatValue;
+                continue;
+            }
+            break;
+        }
+        reqLoRAIndex_ = (i < seqLenGm_.GetSize()) ? loraIndicesGm_.GetValue(i) : -1;
+    }


The CopyInIndex function introduces a significant performance regression by performing a linear search through seqLenGm_ for every token. This results in $O(N \times M)$ complexity with many synchronous Global Memory reads via GetValue().

Consider using a stateful approach to track the current LoRA block across token iterations, reducing the complexity to $O(N + M)$.

Revert "LoRA: Implementing kernels using CUBE computation unit (#384)"

d9ef805

This reverts commit 7b93364.

gemini-code-assist Bot reviewed Apr 8, 2026

View reviewed changes

iforgetmyname merged commit cbc0409 into main Apr 8, 2026
14 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "LoRA: Implementing kernels using CUBE computation unit"#430

Revert "LoRA: Implementing kernels using CUBE computation unit"#430
iforgetmyname merged 1 commit intomainfrom
revert-384-vlserov/lora_kernels_cube

iforgetmyname commented Apr 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iforgetmyname commented Apr 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant