Skip to content

Revert "LoRA: Implementing kernels using CUBE computation unit"#430

Merged
iforgetmyname merged 1 commit intomainfrom
revert-384-vlserov/lora_kernels_cube
Apr 8, 2026
Merged

Revert "LoRA: Implementing kernels using CUBE computation unit"#430
iforgetmyname merged 1 commit intomainfrom
revert-384-vlserov/lora_kernels_cube

Conversation

@iforgetmyname
Copy link
Copy Markdown
Collaborator

Reverts #384

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the sgemmc operations and their associated tiling logic from the codebase, including host-side API definitions and kernel implementations. It also refactors the LoRA index lookup mechanism in the sgemmv and sgmv kernels by replacing the BlockIterator with a new CopyInIndex function. Feedback is provided regarding a significant performance regression introduced by this refactor in several kernel files; the new lookup logic performs a linear search with synchronous global memory reads inside the token loop, increasing the algorithmic complexity from $O(N+M)$ to $O(N \times M)$.

Comment on lines +147 to +161
__aicore__ inline void CopyInIndex(const int64_t idx)
{
// Look up the LoRA index
int64_t weightIdx = idx;
uint64_t i = 0;
for (; i < seqLenGm_.GetSize(); i++) {
int64_t repeatValue = seqLenGm_.GetValue(i);
if (weightIdx >= repeatValue) {
weightIdx -= repeatValue;
continue;
}
break;
}
reqLoRAIndex_ = (i < seqLenGm_.GetSize()) ? loraIndicesGm_.GetValue(i) : -1;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The CopyInIndex function introduces a significant performance regression compared to the BlockIterator it replaces. It performs a linear search through seqLenGm_ for every token index idx in the batch. Since this is called inside the main loop in Process(), the overall complexity becomes $O(N \times M)$ where $N$ is the number of tokens and $M$ is the number of LoRAs. Furthermore, seqLenGm_.GetValue(i) is a synchronous read from Global Memory, which is extremely slow on AI Core when executed repeatedly in a loop.

Consider implementing a stateful iteration approach that maintains the current block index and cumulative offset across iterations of the token loop, which would restore $O(N + M)$ complexity and significantly reduce Global Memory access overhead.

Comment on lines +129 to +143
__aicore__ inline void CopyInIndex(const int64_t idx)
{
// look up the LoRA index
int64_t weightIdx = idx;
uint64_t i = 0;
for (; i < seqLenGm_.GetSize(); i++) {
int64_t repeatValue = seqLenGm_.GetValue(i);
if (weightIdx >= repeatValue) {
weightIdx -= repeatValue;
continue;
}
break;
}
reqLoRAIndex_ = (i < seqLenGm_.GetSize()) ? loraIndicesGm_.GetValue(i) : -1;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The CopyInIndex function introduces a significant performance regression by performing a linear search through seqLenGm_ for every token. This results in $O(N \times M)$ complexity with many synchronous Global Memory reads via GetValue().

Consider using a stateful approach to track the current LoRA block across token iterations, reducing the complexity to $O(N + M)$.

Comment on lines +131 to +145
__aicore__ inline void CopyInIndex(const int64_t idx)
{
// Look up the LoRA index
int64_t weightIdx = idx;
uint64_t i = 0;
for (; i < seqLenGm_.GetSize(); i++) {
int64_t repeatValue = seqLenGm_.GetValue(i);
if (weightIdx >= repeatValue) {
weightIdx -= repeatValue;
continue;
}
break;
}
reqLoRAIndex_ = (i < seqLenGm_.GetSize()) ? loraIndicesGm_.GetValue(i) : -1;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The CopyInIndex function introduces a significant performance regression by performing a linear search through seqLenGm_ for every token. This results in $O(N \times M)$ complexity with many synchronous Global Memory reads via GetValue().

Consider using a stateful approach to track the current LoRA block across token iterations, reducing the complexity to $O(N + M)$.

Comment on lines +119 to +133
__aicore__ inline void CopyInIndex(const int64_t idx)
{
// look up the LoRA index
int64_t weightIdx = idx;
uint64_t i = 0;
for (; i < seqLenGm_.GetSize(); i++) {
int64_t repeatValue = seqLenGm_.GetValue(i);
if (weightIdx >= repeatValue) {
weightIdx -= repeatValue;
continue;
}
break;
}
reqLoRAIndex_ = (i < seqLenGm_.GetSize()) ? loraIndicesGm_.GetValue(i) : -1;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The CopyInIndex function introduces a significant performance regression by performing a linear search through seqLenGm_ for every token. This results in $O(N \times M)$ complexity with many synchronous Global Memory reads via GetValue().

Consider using a stateful approach to track the current LoRA block across token iterations, reducing the complexity to $O(N + M)$.

@iforgetmyname iforgetmyname merged commit cbc0409 into main Apr 8, 2026
14 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant