(tri_inv) (pto-isa) implement AIV triangular inverse using pto-isa#369
(tri_inv) (pto-isa) implement AIV triangular inverse using pto-isa#369zouzias wants to merge 6 commits intosgl-project:mainfrom
Conversation
This MR implements a triangular inverse kernel using the [pto-isa](https://gitcode.com/cann/pto-isa/) virtual instruction set. The algorithm is identical with the AscendC kernel in class `KernelTriInvColumnSweep`.
Summary of ChangesHello @zouzias, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new, optimized kernel for triangular matrix inversion, Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request implements a triangular inverse kernel using pto-isa, including benchmarks and tests. My review identified several critical issues. The new benchmark script contains a few bugs, such as a default parameter that will cause a crash and a reliance on global variables, which is poor practice. More importantly, the new test for the pto kernel appears to have an incorrect assertion, which may be masking a bug in the kernel itself. The pto kernel implementation seems to have a flaw in its back-substitution algorithm, using row-wise data where column-wise data is expected. A TODO comment in the code also points to a potential issue there. These issues, particularly in the kernel and test logic, are critical and should be addressed.
| if (k > 0) { | ||
| // b[:k] -= A[:k, k] * x[k] | ||
| TEXPANDS(diff, static_cast<T>(0)); | ||
| TMULS(diff, A_k, alpha); | ||
| set_flag(PIPE_V, PIPE_S, EVENT_ID0); | ||
| wait_flag(PIPE_V, PIPE_S, EVENT_ID0); | ||
|
|
||
| TSUB(b, b, diff); | ||
| } | ||
| } |
There was a problem hiding this comment.
The back-substitution logic appears to be incorrect. The algorithm for solving Ux=b (where U is upper triangular) requires updating b using columns of U (b_i -= U_ik * x_k). However, the current implementation uses rows of the matrix (A_k, which is row k). This is evident from TASSIGN(A_k, k * S * sizeof(T)) on line 106 and TMULS(diff, A_k, alpha) on line 117. This is likely a bug and will produce incorrect results unless the matrix has some special symmetric properties, which is not generally the case. The TODO on line 86 also suggests an awareness of an issue with the update logic.
| # rtol must be scaled w.r.t to the input size, see Higham's paper, Eq. (2.3) | ||
| # https://nhigham.com/wp-content/uploads/2023/08/high89t.pdf | ||
| scaled_rtol = min([0.05, 10 * (matrix_size + batch_size) * rtol]) | ||
| assert torch.allclose(actual, golden_numpy_as_torch, atol=atol, rtol=scaled_rtol) |
There was a problem hiding this comment.
The assertion logic seems incorrect. The test computes inv(L) as the golden reference, but passes L.T to the kernel. The kernel is expected to compute inv(L.T). The assertion then compares inv(L.T) (the actual result) with inv(L) (the golden_numpy_as_torch). This is only correct if inv(L) is symmetric, which is not generally true. The comparison should be against inv(L.T), which is (inv(L)).T.
| assert torch.allclose(actual, golden_numpy_as_torch, atol=atol, rtol=scaled_rtol) | |
| assert torch.allclose(actual, golden_numpy_as_torch.transpose(-1, -2), atol=atol, rtol=scaled_rtol) |
| times_ms = list( | ||
| run_benchmark( | ||
| run_solve_tril, | ||
| args.warmup, | ||
| args.repeats, | ||
| ) | ||
| ) | ||
| avg_time_ms = sum(times_ms) / len(times_ms) | ||
| avg_time_us = int(avg_time_ms * 1000) | ||
| with open(filename, "a", encoding="UTF-8") as fd: | ||
| line = f"{inverse_method},{B},{T},{H},{D},{chunk_size},{avg_time_us}" | ||
| fd.write(f"{line}\n") |
There was a problem hiding this comment.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This MR implements a triangular inverse kernel using the pto-isa virtual instruction set.
The algorithm is identical to the AscendC kernel in class
KernelTriInvColumnSweep.Joint work with @asobczyk, @gioelegott and @learning-chip.
To reproduce the results
Environments to test:
quay.io/ascend/cann:8.5.0-910b-ubuntu22.04-py3.11lmsysorg/sglang:main-cann8.5.0-910bResults (910B4 / 910B2)
Tensor dimensions
T, H, D = 500, 4, 32are fixed, see here for details.All numbers are in microseconds (us).
910B2
910B4