Skip to content

(tri_inv) (pto-isa) implement AIV triangular inverse using pto-isa#369

Open
zouzias wants to merge 6 commits intosgl-project:mainfrom
zouzias:github_oss_10-triinv-aiv-column-sweep-written-in-pto-isa
Open

(tri_inv) (pto-isa) implement AIV triangular inverse using pto-isa#369
zouzias wants to merge 6 commits intosgl-project:mainfrom
zouzias:github_oss_10-triinv-aiv-column-sweep-written-in-pto-isa

Conversation

@zouzias
Copy link
Copy Markdown
Contributor

@zouzias zouzias commented Feb 6, 2026

This MR implements a triangular inverse kernel using the pto-isa virtual instruction set.

The algorithm is identical to the AscendC kernel in class KernelTriInvColumnSweep.

Joint work with @asobczyk, @gioelegott and @learning-chip.

To reproduce the results

Environments to test:

  • quay.io/ascend/cann:8.5.0-910b-ubuntu22.04-py3.11
  • lmsysorg/sglang:main-cann8.5.0-910b
bash build.sh -a kernels
pip install --force-reinstall output/sgl_kernel_npu-2026.2.1-cp310-cp310-linux_x86_64.whl
python benchmark/benchmark_tri_inv.py

Results (910B4 / 910B2)

Tensor dimensions T, H, D = 500, 4, 32 are fixed, see here for details.

All numbers are in microseconds (us).

910B2

(batch_size, chunk_size) ascendc_aiv pto triton
(10, 16) 463 485 761
(10, 32) 795 829 1458
(10, 64) 1468 1515 3108
(10, 128) 3207 3119 nan
(20, 16) 815 875 1318
(20, 32) 1441 1522 2748
(20, 64) 2833 2940 6013
(20, 128) 5558 5388 nan
(30, 16) 1175 1245 1880
(30, 32) 2099 2213 3994
(30, 64) 4018 4169 8906
(30, 128) 7890 7660 nan

910B4

(batch_size, chunk_size) ascendc_aiv pto triton
(10, 16) 525 560 949
(10, 32) 931 980 1706
(10, 64) 1768 1833 3802
(10, 128) 3461 3361 nan
(20, 16) 993 1056 1535
(20, 32) 1807 1888 3345
(20, 64) 3476 3593 7528
(20, 128) 6857 6643 nan
(30, 16) 1441 1550 2272
(30, 32) 2659 2814 4981
(30, 64) 5160 5355 11263
(30, 128) 10244 9944 nan

This MR implements a triangular inverse kernel using the
[pto-isa](https://gitcode.com/cann/pto-isa/) virtual instruction set.

The algorithm is identical with the AscendC kernel in class `KernelTriInvColumnSweep`.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @zouzias, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new, optimized kernel for triangular matrix inversion, tri_inv_pto_col_sweep, which leverages the pto-isa virtual instruction set for enhanced performance on NPU devices. The changes involve adding the core C++ kernel implementation, integrating it into the PyTorch operator registry, and providing a comprehensive benchmarking script to compare its efficiency against existing Triton and AscendC methods. This enhancement aims to improve the computational efficiency of triangular inverse operations within the system.

Highlights

  • New pto-isa Triangular Inverse Kernel: Implemented a new kernel for triangular matrix inversion using the pto-isa virtual instruction set, specifically tri_inv_pto_col_sweep, to optimize performance on NPU devices.
  • Performance Benchmarking: Introduced a new benchmarking script (benchmark/benchmark_tri_inv.py) to compare the performance of the pto-isa kernel against existing Triton and AscendC implementations on NPU devices (910B2 and 910B4), showing competitive results.
  • Codebase Restructuring and Integration: Renamed existing triangular inverse files to explicitly include _col_sweep in their names for clarity, integrated the new pto_triangular_inverse operator into the PyTorch extension, and added dedicated Python tests to validate its correctness.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • benchmark/benchmark_tri_inv.py
    • Added a new Python script for benchmarking different triangular inverse methods, including triton, ascendc_aiv, and the newly introduced pto method.
    • Configured logging and NPU device settings for benchmarking.
    • Implemented run_benchmark and profile_solve_tril functions to execute and measure performance.
  • csrc/CMakeLists.txt
    • Renamed tri_inv.cpp to tri_inv_col_sweep.cpp and tri_inv_kernel.cpp to tri_inv_col_sweep_kernel.cpp.
    • Added tri_inv_pto_col_sweep.cpp and tri_inv_pto_col_sweep_kernel.cpp to the list of compiled source files.
    • Included ASCEND_INCLUDE_DIR for ascendc_include_directories.
  • csrc/pytorch_extensions.cpp
    • Registered a new PyTorch operator pto_triangular_inverse for the npu namespace.
    • Mapped pto_triangular_inverse to the sglang::npu_kernel::tri_inv_pto_col_sweep implementation.
  • csrc/tri_inv/op_host/tri_inv_col_sweep.cpp
    • Renamed from csrc/tri_inv/op_host/tri_inv.cpp.
  • csrc/tri_inv/op_host/tri_inv_pto_col_sweep.cpp
    • Added a new host-side C++ implementation for the pto-isa triangular inverse kernel.
    • Handles tiling data and dispatches to fp16 or fp32 kernel based on input dtype.
  • csrc/tri_inv/op_kernel/kernel_tri_inv_col_sweep.h
    • Renamed from csrc/tri_inv/op_kernel/kernel_tri_inv.h.
  • csrc/tri_inv/op_kernel/tri_inv_col_sweep_kernel.cpp
    • Renamed from csrc/tri_inv/op_kernel/tri_inv_kernel.cpp.
    • Updated include path from kernel_tri_inv.h to kernel_tri_inv_col_sweep.h.
  • csrc/tri_inv/op_kernel/tri_inv_pto_col_sweep_kernel.cpp
    • Added the core pto-isa kernel implementation for triangular inverse.
    • Implements runTTriInv for matrix sizes 16, 32, 64, 128 using pto-inst.hpp.
    • Provides fp16 and fp32 entry points for the kernel.
  • include/sgl_kenel_npu_ops.h
    • Declared the new at::Tensor tri_inv_pto_col_sweep(const at::Tensor &tensor_in) function.
  • tests/python/sgl_kernel_npu/test_triangular_inverse_pto.py
    • Added new Python unit tests for pto_triangular_inverse.
    • Includes np_triu_inv_cs for a NumPy reference implementation of column sweep.
    • Tests with random and all-ones lower triangular matrices across various batch and matrix sizes.
    • Compares results against numpy.linalg.inv with appropriate atol and rtol scaling.
Activity
  • The author, zouzias, has implemented a new triangular inverse kernel using pto-isa.
  • Detailed performance benchmarks on 910B2 and 910B4 devices have been provided, comparing the new pto method against existing Triton and AscendC implementations.
  • A new benchmarking script and comprehensive test cases have been included to validate the new implementation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a triangular inverse kernel using pto-isa, including benchmarks and tests. My review identified several critical issues. The new benchmark script contains a few bugs, such as a default parameter that will cause a crash and a reliance on global variables, which is poor practice. More importantly, the new test for the pto kernel appears to have an incorrect assertion, which may be masking a bug in the kernel itself. The pto kernel implementation seems to have a flaw in its back-substitution algorithm, using row-wise data where column-wise data is expected. A TODO comment in the code also points to a potential issue there. These issues, particularly in the kernel and test logic, are critical and should be addressed.

Comment on lines +114 to +123
if (k > 0) {
// b[:k] -= A[:k, k] * x[k]
TEXPANDS(diff, static_cast<T>(0));
TMULS(diff, A_k, alpha);
set_flag(PIPE_V, PIPE_S, EVENT_ID0);
wait_flag(PIPE_V, PIPE_S, EVENT_ID0);

TSUB(b, b, diff);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The back-substitution logic appears to be incorrect. The algorithm for solving Ux=b (where U is upper triangular) requires updating b using columns of U (b_i -= U_ik * x_k). However, the current implementation uses rows of the matrix (A_k, which is row k). This is evident from TASSIGN(A_k, k * S * sizeof(T)) on line 106 and TMULS(diff, A_k, alpha) on line 117. This is likely a bug and will produce incorrect results unless the matrix has some special symmetric properties, which is not generally the case. The TODO on line 86 also suggests an awareness of an issue with the update logic.

# rtol must be scaled w.r.t to the input size, see Higham's paper, Eq. (2.3)
# https://nhigham.com/wp-content/uploads/2023/08/high89t.pdf
scaled_rtol = min([0.05, 10 * (matrix_size + batch_size) * rtol])
assert torch.allclose(actual, golden_numpy_as_torch, atol=atol, rtol=scaled_rtol)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The assertion logic seems incorrect. The test computes inv(L) as the golden reference, but passes L.T to the kernel. The kernel is expected to compute inv(L.T). The assertion then compares inv(L.T) (the actual result) with inv(L) (the golden_numpy_as_torch). This is only correct if inv(L) is symmetric, which is not generally true. The comparison should be against inv(L.T), which is (inv(L)).T.

Suggested change
assert torch.allclose(actual, golden_numpy_as_torch, atol=atol, rtol=scaled_rtol)
assert torch.allclose(actual, golden_numpy_as_torch.transpose(-1, -2), atol=atol, rtol=scaled_rtol)

Comment thread benchmark/benchmark_tri_inv.py Outdated
Comment thread benchmark/benchmark_tri_inv.py Outdated
Comment on lines +127 to +138
times_ms = list(
run_benchmark(
run_solve_tril,
args.warmup,
args.repeats,
)
)
avg_time_ms = sum(times_ms) / len(times_ms)
avg_time_us = int(avg_time_ms * 1000)
with open(filename, "a", encoding="UTF-8") as fd:
line = f"{inverse_method},{B},{T},{H},{D},{chunk_size},{avg_time_us}"
fd.write(f"{line}\n")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function profile_solve_tril uses args and filename which are defined in the if __name__ == "__main__" block. This is not a good practice as it makes the function non-reusable and hard to test. It's better to pass warmup, repeats, and filename as arguments to this function.

zouzias and others added 2 commits February 6, 2026 16:22
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@zouzias zouzias marked this pull request as ready for review February 8, 2026 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant