MMLU benchmark for different inverse implementations by gioelegott · Pull Request #374 · sgl-project/sgl-kernel-npu

gioelegott · 2026-02-11T15:14:26Z

In this PR we propose a small change that allows to integrate the AIV triangular_inverse into the triton backend of fla.

Additionally here we provide the instructions to run the sglang server on A2 and to evaluate different inverse implementations using the MMLU benchmark.

Running the sglang server

#!/bin/bash
#

SLANG_DOCKER_IMAGE="sglang:main-cann8.3.rc1-910b"

drun() {
	
docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \
    --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \
    --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \
    --device=/dev/davinci_manager --device=/dev/hisi_hdc \
    --volume /usr/local/sbin:/usr/local/sbin --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    --volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
    --volume /etc/ascend_install.info:/etc/ascend_install.info \
    --volume /var/queue_schedule:/var/queue_schedule --volume ~/.cache/:/root/.cache/ "$@"
}


MODEL_NAME="Qwen/Qwen3-Next-80B-A3B-Instruct"
drun --env "HF_ENDPOINT=https://hf-mirror.com" \
    ${SLANG_DOCKER_IMAGE} \
    python3 -m sglang.launch_server \
    --model-path ${MODEL_NAME} \
    --attention-backend ascend \
    --tp-size 8 \
    --max-total-tokens 4096

Once the server is up and running you can send requests with:

import requests

port = 30000
url = f"http://localhost:{port}/v1/chat/completions"

data = {
    "model": "qwen/qwen3-next-80b-a3b-instruct",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
}

response = requests.post(url, json=data)
print(response.json())

Model's accuracy

To check and compare the model's accuracy for different implementations of the triangular inverse, clone sgl-kernel-npu and sglang.

Then run:

SLANG_DOCKER_IMAGE="sglang:main-cann8.3.rc1-910b"

drun() {
	
docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \
    --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \
    --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \
    --device=/dev/davinci_manager --device=/dev/hisi_hdc \
    --volume /usr/local/sbin:/usr/local/sbin --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    --volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
    --volume /etc/ascend_install.info:/etc/ascend_install.info \
    --volume ~/sgl-kernel-npu/:/tmp/sgl-kernel-npu/ --volume ~/sglang/:/tmp/sglang/ \
    --volume /var/queue_schedule:/var/queue_schedule --volume ~/.cache/:/root/.cache/ "$@"
}

drun --env "HF_ENDPOINT=https://hf-mirror.com"     ${SLANG_DOCKER_IMAGE} /usr/bin/bash

From the docker image run the server, adjusting --tp-size and --max-total-tokens so that the model can fit:

python3 -m sglang.launch_server     --model-path Qwen/Qwen3-Next-80B-A3B-Instruct     --disable-cuda-graph --mem-fraction-static=0.7   --attention-backend ascend  --tp-size=8 --max-total-tokens 1024 --disable-radix-cache

From another window run the sglang benchmark:

cd sglang/benchmark/mmlu/
bash download_data.sh
python3 bench_sglang.py --nsub 10

This will run the benchmark on the model using the native implementation of GDN. By running this experiment we obtain:

subject: abstract_algebra, #q:100, acc: 0.680
subject: anatomy, #q:135, acc: 0.793
subject: astronomy, #q:152, acc: 0.941
subject: business_ethics, #q:100, acc: 0.780
subject: clinical_knowledge, #q:265, acc: 0.838
subject: college_biology, #q:144, acc: 0.958
subject: college_chemistry, #q:100, acc: 0.630
subject: college_computer_science, #q:100, acc: 0.830
subject: college_mathematics, #q:100, acc: 0.670
subject: college_medicine, #q:173, acc: 0.769
Total latency: 2016.666
Average accuracy: 0.805

Triton backend

To run the triton backend force reinstall sgl-kernel-npu with the latest version in the docker image:

cd sgl-kernel-npu
bash build.sh -a kernels
pip install --force-reinstall output/sgl_kernel_npu*.whl

And run the sglang benchmark as before. By running it we obtained:

subject: abstract_algebra, #q:100, acc: 0.800
subject: anatomy, #q:135, acc: 0.807
subject: astronomy, #q:152, acc: 0.947
subject: business_ethics, #q:100, acc: 0.820
subject: clinical_knowledge, #q:265, acc: 0.891
subject: college_biology, #q:144, acc: 0.972
subject: college_chemistry, #q:100, acc: 0.670
subject: college_computer_science, #q:100, acc: 0.860
subject: college_mathematics, #q:100, acc: 0.770
subject: college_medicine, #q:173, acc: 0.861
Total latency: 436.129
Average accuracy: 0.855

Fast inverse backend

To run the fast inverse backend some modifications of sgl-kernel-npu are required. More precisely replace solve_tril with fast_inv_tril in https://github.com/sgl-project/sgl-kernel-npu/blob/main/python/sgl_kernel_npu/sgl_kernel_npu/fla/chunk.py#L215

Then reinstall sgl-kernel-npu again with:

bash build.sh -a kernels
pip install --force-reinstall output/sgl_kernel_npu*.whl

And run the accuracy benchmark again as before. By running it we obtain:

subject: abstract_algebra, #q:100, acc: 0.800
subject: anatomy, #q:135, acc: 0.807
subject: astronomy, #q:152, acc: 0.941
subject: business_ethics, #q:100, acc: 0.810
subject: clinical_knowledge, #q:265, acc: 0.894
subject: college_biology, #q:144, acc: 0.972
subject: college_chemistry, #q:100, acc: 0.670
subject: college_computer_science, #q:100, acc: 0.840
subject: college_mathematics, #q:100, acc: 0.780
subject: college_medicine, #q:173, acc: 0.855
Total latency: 491.175
Average accuracy: 0.852

gemini-code-assist · 2026-02-11T15:14:43Z

Summary of Changes

Hello @gioelegott, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enables the use of the AIV triangular_inverse within the fla library's Triton backend by modifying the fast_inv_tril function to correctly preprocess and postprocess tensors for the NPU-specific inverse operation, including necessary padding and reshaping. Additionally, it provides detailed guidance for users to deploy the sglang server on A2 and conduct MMLU benchmark evaluations to compare the accuracy of various inverse implementations.

Highlights

AIV Triangular Inverse Integration: The fast_inv_tril function was enhanced to correctly handle tensor dimensions and padding, facilitating the integration of the AIV triangular_inverse into the fla library's Triton backend for NPU operations.
MMLU Benchmark Instructions: Comprehensive instructions were added for setting up and running the sglang server on A2, and for evaluating the accuracy of different triangular inverse implementations using the MMLU benchmark.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sgl_kernel_npu/sgl_kernel_npu/fla/chunk.py
- Modified the fast_inv_tril function to include padding logic for input tensors where T % chunk_size is not zero.
- Implemented tensor reshaping and transposing operations to correctly prepare the input for torch.ops.npu.triangular_inverse.
- Restored the original tensor shape after the inverse operation to maintain consistency.

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request updates the fast_inv_tril function to handle a new tensor format and integrate with an NPU-specific triangular inverse operation, which is a good step towards optimizing for specific hardware. My review focuses on ensuring the correctness and robustness of this new implementation. I've found a critical issue in the padding logic that can cause unnecessary processing and memory allocation, and another high-severity issue related to tensor memory layout that could lead to unsafe operations. Code suggestions are provided to address these points.

The merge-base changed after approval.

This reverts commit c726cd8, reversing changes made to 6476b2d.

Napkin-AI · 2026-02-12T09:54:06Z

When replacing A = solve_tril(A=A, cu_seqlens=cu_seqlens, output_dtype=k.dtype) with A = fast_inv_tril_wrapper(A=A) in chunk.py, line 231, locally I have the accuracy of Qwen3-Next-80B-A3B-Instruct is 0 on the gsm8k benchmark and the outputs differs with the same input tensors. How is solve_tril supposed to be replaced?

gioelegott · 2026-02-12T10:06:20Z

When replacing A = solve_tril(A=A, cu_seqlens=cu_seqlens, output_dtype=k.dtype) with A = fast_inv_tril_wrapper(A=A) in chunk.py, line 231, locally I have the accuracy of Qwen3-Next-80B-A3B-Instruct is 0 on the gsm8k benchmark and the outputs differs with the same input tensors. How is solve_tril supposed to be replaced?

I haven't tested the gsm8k benchmark myself, but I am surprised by the results you obtain. Do you get any errors during the benchmark execution?
Also it is expected the results to be slightly different due to fp precision, but there shouldn't be a large gap between triton and our implementation.

Napkin-AI · 2026-02-12T10:43:53Z

When replacing A = solve_tril(A=A, cu_seqlens=cu_seqlens, output_dtype=k.dtype) with A = fast_inv_tril_wrapper(A=A) in chunk.py, line 231, locally I have the accuracy of Qwen3-Next-80B-A3B-Instruct is 0 on the gsm8k benchmark and the outputs differs with the same input tensors. How is solve_tril supposed to be replaced?

I haven't tested the gsm8k benchmark myself, but I am surprised by the results you obtain. Do you get any errors during the benchmark execution? Also it is expected the results to be slightly different due to fp precision, but there shouldn't be a large gap between triton and our implementation.

No errors occurred during the benchmark.
gsm8k out:

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:09<00:00,  6.97s/it]
Accuracy: 0.000
Invalid: 0.200
Latency: 70.028 s
Output throughput: 31.602 token/s

Also in custom test:

def test():
    attn_trill = load_param("attn_triton") # torch.Tensor, [1, 6, 4, 64]

    out_ref = solve_tril(attn_trill)
    out = fast_inv_tril_wrapper(attn_trill)

    assert torch.allclose(out, out_ref.to(torch.float32), atol=1e-3), f"Max difference is {(out - out_ref).abs().max()}"
    print("OK!")

Output:
AssertionError: Max difference is 1.0

l30059686 and others added 4 commits February 11, 2026 20:40

GLM5 optimize

a56354c

glm5 optimize

7be1e33

Merge pull request sgl-project#373 from cen121212/2-11-main

c726cd8

Modify fast-inv layout

2b6e9a4

gemini-code-assist Bot reviewed Feb 11, 2026

View reviewed changes

Comment thread python/sgl_kernel_npu/sgl_kernel_npu/fla/chunk.py Outdated

Comment thread python/sgl_kernel_npu/sgl_kernel_npu/fla/chunk.py Outdated

Fix

e2aef8d

ping1jing2 previously approved these changes Feb 11, 2026

View reviewed changes

iforgetmyname force-pushed the main branch from c726cd8 to 6476b2d Compare February 12, 2026 01:09

zouzias reviewed Feb 12, 2026

View reviewed changes

Comment thread python/sgl_kernel_npu/sgl_kernel_npu/fla/chunk.py Outdated

ggottardo added 2 commits February 12, 2026 08:42

Revert "Merge pull request sgl-project#373 from cen121212/2-11-main"

fdd93c5

This reverts commit c726cd8, reversing changes made to 6476b2d.

Put padding into wrapper for triton

dc1c724

Gioele Gottardo and others added 10 commits February 17, 2026 13:22

Modify layout fast_inv_tril

7e54c52

Add test

5861876

Fix

fcd9bdd

Add cu_seqlens support

dd80909

Cleanup

f006ec9

Improve readability

3531642

Fix

0809379

Fix

54d7833

Merge branch 'bench-fast-inv-accuracy' into benchmark-fast-inv-tril-acc

a8fc51c

Fix

01015fb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMLU benchmark for different inverse implementations#374

MMLU benchmark for different inverse implementations#374
gioelegott wants to merge 17 commits intosgl-project:mainfrom
gioelegott:bench-fast-inv-accuracy

gioelegott commented Feb 11, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Feb 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Napkin-AI commented Feb 12, 2026

Uh oh!

gioelegott commented Feb 12, 2026

Uh oh!

Napkin-AI commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

gioelegott commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Running the sglang server

Model's accuracy

Triton backend

Fast inverse backend

Uh oh!

gemini-code-assist Bot commented Feb 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Napkin-AI commented Feb 12, 2026

Uh oh!

gioelegott commented Feb 12, 2026

Uh oh!

Napkin-AI commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gioelegott commented Feb 11, 2026 •

edited

Loading