Revert "LoRA: Implementing kernels using CUBE computation unit"#430
Revert "LoRA: Implementing kernels using CUBE computation unit"#430iforgetmyname merged 1 commit intomainfrom
Conversation
This reverts commit 7b93364.
There was a problem hiding this comment.
Code Review
This pull request removes the sgemmc operations and their associated tiling logic from the codebase, including host-side API definitions and kernel implementations. It also refactors the LoRA index lookup mechanism in the sgemmv and sgmv kernels by replacing the BlockIterator with a new CopyInIndex function. Feedback is provided regarding a significant performance regression introduced by this refactor in several kernel files; the new lookup logic performs a linear search with synchronous global memory reads inside the token loop, increasing the algorithmic complexity from
| __aicore__ inline void CopyInIndex(const int64_t idx) | ||
| { | ||
| // Look up the LoRA index | ||
| int64_t weightIdx = idx; | ||
| uint64_t i = 0; | ||
| for (; i < seqLenGm_.GetSize(); i++) { | ||
| int64_t repeatValue = seqLenGm_.GetValue(i); | ||
| if (weightIdx >= repeatValue) { | ||
| weightIdx -= repeatValue; | ||
| continue; | ||
| } | ||
| break; | ||
| } | ||
| reqLoRAIndex_ = (i < seqLenGm_.GetSize()) ? loraIndicesGm_.GetValue(i) : -1; | ||
| } |
There was a problem hiding this comment.
The CopyInIndex function introduces a significant performance regression compared to the BlockIterator it replaces. It performs a linear search through seqLenGm_ for every token index idx in the batch. Since this is called inside the main loop in Process(), the overall complexity becomes seqLenGm_.GetValue(i) is a synchronous read from Global Memory, which is extremely slow on AI Core when executed repeatedly in a loop.
Consider implementing a stateful iteration approach that maintains the current block index and cumulative offset across iterations of the token loop, which would restore
| __aicore__ inline void CopyInIndex(const int64_t idx) | ||
| { | ||
| // look up the LoRA index | ||
| int64_t weightIdx = idx; | ||
| uint64_t i = 0; | ||
| for (; i < seqLenGm_.GetSize(); i++) { | ||
| int64_t repeatValue = seqLenGm_.GetValue(i); | ||
| if (weightIdx >= repeatValue) { | ||
| weightIdx -= repeatValue; | ||
| continue; | ||
| } | ||
| break; | ||
| } | ||
| reqLoRAIndex_ = (i < seqLenGm_.GetSize()) ? loraIndicesGm_.GetValue(i) : -1; | ||
| } |
There was a problem hiding this comment.
The CopyInIndex function introduces a significant performance regression by performing a linear search through seqLenGm_ for every token. This results in GetValue().
Consider using a stateful approach to track the current LoRA block across token iterations, reducing the complexity to
| __aicore__ inline void CopyInIndex(const int64_t idx) | ||
| { | ||
| // Look up the LoRA index | ||
| int64_t weightIdx = idx; | ||
| uint64_t i = 0; | ||
| for (; i < seqLenGm_.GetSize(); i++) { | ||
| int64_t repeatValue = seqLenGm_.GetValue(i); | ||
| if (weightIdx >= repeatValue) { | ||
| weightIdx -= repeatValue; | ||
| continue; | ||
| } | ||
| break; | ||
| } | ||
| reqLoRAIndex_ = (i < seqLenGm_.GetSize()) ? loraIndicesGm_.GetValue(i) : -1; | ||
| } |
There was a problem hiding this comment.
The CopyInIndex function introduces a significant performance regression by performing a linear search through seqLenGm_ for every token. This results in GetValue().
Consider using a stateful approach to track the current LoRA block across token iterations, reducing the complexity to
| __aicore__ inline void CopyInIndex(const int64_t idx) | ||
| { | ||
| // look up the LoRA index | ||
| int64_t weightIdx = idx; | ||
| uint64_t i = 0; | ||
| for (; i < seqLenGm_.GetSize(); i++) { | ||
| int64_t repeatValue = seqLenGm_.GetValue(i); | ||
| if (weightIdx >= repeatValue) { | ||
| weightIdx -= repeatValue; | ||
| continue; | ||
| } | ||
| break; | ||
| } | ||
| reqLoRAIndex_ = (i < seqLenGm_.GetSize()) ? loraIndicesGm_.GetValue(i) : -1; | ||
| } |
There was a problem hiding this comment.
The CopyInIndex function introduces a significant performance regression by performing a linear search through seqLenGm_ for every token. This results in GetValue().
Consider using a stateful approach to track the current LoRA block across token iterations, reducing the complexity to
Reverts #384