(tri_inv) (pto-isa) implement AIV triangular inverse using pto-isa by zouzias · Pull Request #369 · sgl-project/sgl-kernel-npu

zouzias · 2026-02-06T15:15:45Z

This MR implements a triangular inverse kernel using the pto-isa virtual instruction set.

The algorithm is identical to the AscendC kernel in class KernelTriInvColumnSweep.

Joint work with @asobczyk, @gioelegott and @learning-chip.

To reproduce the results

Environments to test:

quay.io/ascend/cann:8.5.0-910b-ubuntu22.04-py3.11
lmsysorg/sglang:main-cann8.5.0-910b

bash build.sh -a kernels
pip install --force-reinstall output/sgl_kernel_npu-2026.2.1-cp310-cp310-linux_x86_64.whl
python benchmark/benchmark_tri_inv.py

Results (910B4 / 910B2)

Tensor dimensions T, H, D = 500, 4, 32 are fixed, see here for details.

All numbers are in microseconds (us).

910B2

(batch_size, chunk_size)	ascendc_aiv	pto	triton
(10, 16)	463	485	761
(10, 32)	795	829	1458
(10, 64)	1468	1515	3108
(10, 128)	3207	3119	nan
(20, 16)	815	875	1318
(20, 32)	1441	1522	2748
(20, 64)	2833	2940	6013
(20, 128)	5558	5388	nan
(30, 16)	1175	1245	1880
(30, 32)	2099	2213	3994
(30, 64)	4018	4169	8906
(30, 128)	7890	7660	nan

910B4

(batch_size, chunk_size)	ascendc_aiv	pto	triton
(10, 16)	525	560	949
(10, 32)	931	980	1706
(10, 64)	1768	1833	3802
(10, 128)	3461	3361	nan
(20, 16)	993	1056	1535
(20, 32)	1807	1888	3345
(20, 64)	3476	3593	7528
(20, 128)	6857	6643	nan
(30, 16)	1441	1550	2272
(30, 32)	2659	2814	4981
(30, 64)	5160	5355	11263
(30, 128)	10244	9944	nan

This MR implements a triangular inverse kernel using the [pto-isa](https://gitcode.com/cann/pto-isa/) virtual instruction set. The algorithm is identical with the AscendC kernel in class `KernelTriInvColumnSweep`.

gemini-code-assist · 2026-02-06T15:16:13Z

Summary of Changes

Hello @zouzias, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new, optimized kernel for triangular matrix inversion, tri_inv_pto_col_sweep, which leverages the pto-isa virtual instruction set for enhanced performance on NPU devices. The changes involve adding the core C++ kernel implementation, integrating it into the PyTorch operator registry, and providing a comprehensive benchmarking script to compare its efficiency against existing Triton and AscendC methods. This enhancement aims to improve the computational efficiency of triangular inverse operations within the system.

Highlights

New pto-isa Triangular Inverse Kernel: Implemented a new kernel for triangular matrix inversion using the pto-isa virtual instruction set, specifically tri_inv_pto_col_sweep, to optimize performance on NPU devices.
Performance Benchmarking: Introduced a new benchmarking script (benchmark/benchmark_tri_inv.py) to compare the performance of the pto-isa kernel against existing Triton and AscendC implementations on NPU devices (910B2 and 910B4), showing competitive results.
Codebase Restructuring and Integration: Renamed existing triangular inverse files to explicitly include _col_sweep in their names for clarity, integrated the new pto_triangular_inverse operator into the PyTorch extension, and added dedicated Python tests to validate its correctness.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

benchmark/benchmark_tri_inv.py
- Added a new Python script for benchmarking different triangular inverse methods, including triton, ascendc_aiv, and the newly introduced pto method.
- Configured logging and NPU device settings for benchmarking.
- Implemented run_benchmark and profile_solve_tril functions to execute and measure performance.
csrc/CMakeLists.txt
- Renamed tri_inv.cpp to tri_inv_col_sweep.cpp and tri_inv_kernel.cpp to tri_inv_col_sweep_kernel.cpp.
- Added tri_inv_pto_col_sweep.cpp and tri_inv_pto_col_sweep_kernel.cpp to the list of compiled source files.
- Included ASCEND_INCLUDE_DIR for ascendc_include_directories.
csrc/pytorch_extensions.cpp
- Registered a new PyTorch operator pto_triangular_inverse for the npu namespace.
- Mapped pto_triangular_inverse to the sglang::npu_kernel::tri_inv_pto_col_sweep implementation.
csrc/tri_inv/op_host/tri_inv_col_sweep.cpp
- Renamed from csrc/tri_inv/op_host/tri_inv.cpp.
csrc/tri_inv/op_host/tri_inv_pto_col_sweep.cpp
- Added a new host-side C++ implementation for the pto-isa triangular inverse kernel.
- Handles tiling data and dispatches to fp16 or fp32 kernel based on input dtype.
csrc/tri_inv/op_kernel/kernel_tri_inv_col_sweep.h
- Renamed from csrc/tri_inv/op_kernel/kernel_tri_inv.h.
csrc/tri_inv/op_kernel/tri_inv_col_sweep_kernel.cpp
- Renamed from csrc/tri_inv/op_kernel/tri_inv_kernel.cpp.
- Updated include path from kernel_tri_inv.h to kernel_tri_inv_col_sweep.h.
csrc/tri_inv/op_kernel/tri_inv_pto_col_sweep_kernel.cpp
- Added the core pto-isa kernel implementation for triangular inverse.
- Implements runTTriInv for matrix sizes 16, 32, 64, 128 using pto-inst.hpp.
- Provides fp16 and fp32 entry points for the kernel.
include/sgl_kenel_npu_ops.h
- Declared the new at::Tensor tri_inv_pto_col_sweep(const at::Tensor &tensor_in) function.
tests/python/sgl_kernel_npu/test_triangular_inverse_pto.py
- Added new Python unit tests for pto_triangular_inverse.
- Includes np_triu_inv_cs for a NumPy reference implementation of column sweep.
- Tests with random and all-ones lower triangular matrices across various batch and matrix sizes.
- Compares results against numpy.linalg.inv with appropriate atol and rtol scaling.

Activity

The author, zouzias, has implemented a new triangular inverse kernel using pto-isa.
Detailed performance benchmarks on 910B2 and 910B4 devices have been provided, comparing the new pto method against existing Triton and AscendC implementations.
A new benchmarking script and comprehensive test cases have been included to validate the new implementation.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request implements a triangular inverse kernel using pto-isa, including benchmarks and tests. My review identified several critical issues. The new benchmark script contains a few bugs, such as a default parameter that will cause a crash and a reliance on global variables, which is poor practice. More importantly, the new test for the pto kernel appears to have an incorrect assertion, which may be masking a bug in the kernel itself. The pto kernel implementation seems to have a flaw in its back-substitution algorithm, using row-wise data where column-wise data is expected. A TODO comment in the code also points to a potential issue there. These issues, particularly in the kernel and test logic, are critical and should be addressed.

gemini-code-assist · 2026-02-06T15:20:04Z

+            if (k > 0) {
+                // b[:k] -= A[:k, k] * x[k]
+                TEXPANDS(diff, static_cast<T>(0));
+                TMULS(diff, A_k, alpha);
+                set_flag(PIPE_V, PIPE_S, EVENT_ID0);
+                wait_flag(PIPE_V, PIPE_S, EVENT_ID0);
+
+                TSUB(b, b, diff);
+            }
+        }


The back-substitution logic appears to be incorrect. The algorithm for solving Ux=b (where U is upper triangular) requires updating b using columns of U (b_i -= U_ik * x_k). However, the current implementation uses rows of the matrix (A_k, which is row k). This is evident from TASSIGN(A_k, k * S * sizeof(T)) on line 106 and TMULS(diff, A_k, alpha) on line 117. This is likely a bug and will produce incorrect results unless the matrix has some special symmetric properties, which is not generally the case. The TODO on line 86 also suggests an awareness of an issue with the update logic.

gemini-code-assist · 2026-02-06T15:20:04Z

+    # rtol must be scaled w.r.t to the input size, see Higham's paper, Eq. (2.3)
+    # https://nhigham.com/wp-content/uploads/2023/08/high89t.pdf
+    scaled_rtol = min([0.05, 10 * (matrix_size + batch_size) * rtol])
+    assert torch.allclose(actual, golden_numpy_as_torch, atol=atol, rtol=scaled_rtol)


The assertion logic seems incorrect. The test computes inv(L) as the golden reference, but passes L.T to the kernel. The kernel is expected to compute inv(L.T). The assertion then compares inv(L.T) (the actual result) with inv(L) (the golden_numpy_as_torch). This is only correct if inv(L) is symmetric, which is not generally true. The comparison should be against inv(L.T), which is (inv(L)).T.

Suggested change

assert torch.allclose(actual, golden_numpy_as_torch, atol=atol, rtol=scaled_rtol)

assert torch.allclose(actual, golden_numpy_as_torch.transpose(-1, -2), atol=atol, rtol=scaled_rtol)

gemini-code-assist · 2026-02-06T15:20:05Z

+    times_ms = list(
+        run_benchmark(
+            run_solve_tril,
+            args.warmup,
+            args.repeats,
+        )
+    )
+    avg_time_ms = sum(times_ms) / len(times_ms)
+    avg_time_us = int(avg_time_ms * 1000)
+    with open(filename, "a", encoding="UTF-8") as fd:
+        line = f"{inverse_method},{B},{T},{H},{D},{chunk_size},{avg_time_us}"
+        fd.write(f"{line}\n")


The function profile_solve_tril uses args and filename which are defined in the if __name__ == "__main__" block. This is not a good practice as it makes the function non-reusable and hard to test. It's better to pass warmup, repeats, and filename as arguments to this function.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

(tri_inv) (pto-isa) implement AIV triangular inverse using pto-isa

0105c12

This MR implements a triangular inverse kernel using the [pto-isa](https://gitcode.com/cann/pto-isa/) virtual instruction set. The algorithm is identical with the AscendC kernel in class `KernelTriInvColumnSweep`.

gemini-code-assist Bot reviewed Feb 6, 2026

View reviewed changes

zouzias and others added 2 commits February 6, 2026 16:22

Update benchmark/benchmark_tri_inv.py

1265f17

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

remove unused seq_len

f24aad9

zouzias marked this pull request as ready for review February 8, 2026 13:54

anastasios added 3 commits February 9, 2026 06:31

make clang-format happy

3568867

fix benchmark_tri_inv.py

f7a888b

trigger GitHub actions

df45d99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(tri_inv) (pto-isa) implement AIV triangular inverse using pto-isa#369

(tri_inv) (pto-isa) implement AIV triangular inverse using pto-isa#369
zouzias wants to merge 6 commits intosgl-project:mainfrom
zouzias:github_oss_10-triinv-aiv-column-sweep-written-in-pto-isa

zouzias commented Feb 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Feb 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Feb 6, 2026

Uh oh!

gemini-code-assist Bot Feb 6, 2026

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	assert torch.allclose(actual, golden_numpy_as_torch, atol=atol, rtol=scaled_rtol)
	assert torch.allclose(actual, golden_numpy_as_torch.transpose(-1, -2), atol=atol, rtol=scaled_rtol)

Conversation

zouzias commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

To reproduce the results

Results (910B4 / 910B2)

910B2

910B4

Uh oh!

gemini-code-assist Bot commented Feb 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zouzias commented Feb 6, 2026 •

edited

Loading