Skip to content

MMLU benchmark for different inverse implementations#374

Open
gioelegott wants to merge 17 commits intosgl-project:mainfrom
gioelegott:bench-fast-inv-accuracy
Open

MMLU benchmark for different inverse implementations#374
gioelegott wants to merge 17 commits intosgl-project:mainfrom
gioelegott:bench-fast-inv-accuracy

Conversation

@gioelegott
Copy link
Copy Markdown

@gioelegott gioelegott commented Feb 11, 2026

In this PR we propose a small change that allows to integrate the AIV triangular_inverse into the triton backend of fla.

Additionally here we provide the instructions to run the sglang server on A2 and to evaluate different inverse implementations using the MMLU benchmark.

Running the sglang server

#!/bin/bash
#

SLANG_DOCKER_IMAGE="sglang:main-cann8.3.rc1-910b"

drun() {
	
docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \
    --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \
    --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \
    --device=/dev/davinci_manager --device=/dev/hisi_hdc \
    --volume /usr/local/sbin:/usr/local/sbin --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    --volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
    --volume /etc/ascend_install.info:/etc/ascend_install.info \
    --volume /var/queue_schedule:/var/queue_schedule --volume ~/.cache/:/root/.cache/ "$@"
}


MODEL_NAME="Qwen/Qwen3-Next-80B-A3B-Instruct"
drun --env "HF_ENDPOINT=https://hf-mirror.com" \
    ${SLANG_DOCKER_IMAGE} \
    python3 -m sglang.launch_server \
    --model-path ${MODEL_NAME} \
    --attention-backend ascend \
    --tp-size 8 \
    --max-total-tokens 4096

Once the server is up and running you can send requests with:

import requests

port = 30000
url = f"http://localhost:{port}/v1/chat/completions"

data = {
    "model": "qwen/qwen3-next-80b-a3b-instruct",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
}

response = requests.post(url, json=data)
print(response.json())

Model's accuracy

To check and compare the model's accuracy for different implementations of the triangular inverse, clone sgl-kernel-npu and sglang.

Then run:

SLANG_DOCKER_IMAGE="sglang:main-cann8.3.rc1-910b"

drun() {
	
docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \
    --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \
    --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \
    --device=/dev/davinci_manager --device=/dev/hisi_hdc \
    --volume /usr/local/sbin:/usr/local/sbin --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    --volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
    --volume /etc/ascend_install.info:/etc/ascend_install.info \
    --volume ~/sgl-kernel-npu/:/tmp/sgl-kernel-npu/ --volume ~/sglang/:/tmp/sglang/ \
    --volume /var/queue_schedule:/var/queue_schedule --volume ~/.cache/:/root/.cache/ "$@"
}

drun --env "HF_ENDPOINT=https://hf-mirror.com"     ${SLANG_DOCKER_IMAGE} /usr/bin/bash

From the docker image run the server, adjusting --tp-size and --max-total-tokens so that the model can fit:

python3 -m sglang.launch_server     --model-path Qwen/Qwen3-Next-80B-A3B-Instruct     --disable-cuda-graph --mem-fraction-static=0.7   --attention-backend ascend  --tp-size=8 --max-total-tokens 1024 --disable-radix-cache

From another window run the sglang benchmark:

cd sglang/benchmark/mmlu/
bash download_data.sh
python3 bench_sglang.py --nsub 10

This will run the benchmark on the model using the native implementation of GDN. By running this experiment we obtain:

subject: abstract_algebra, #q:100, acc: 0.680
subject: anatomy, #q:135, acc: 0.793
subject: astronomy, #q:152, acc: 0.941
subject: business_ethics, #q:100, acc: 0.780
subject: clinical_knowledge, #q:265, acc: 0.838
subject: college_biology, #q:144, acc: 0.958
subject: college_chemistry, #q:100, acc: 0.630
subject: college_computer_science, #q:100, acc: 0.830
subject: college_mathematics, #q:100, acc: 0.670
subject: college_medicine, #q:173, acc: 0.769
Total latency: 2016.666
Average accuracy: 0.805

Triton backend

To run the triton backend force reinstall sgl-kernel-npu with the latest version in the docker image:

cd sgl-kernel-npu
bash build.sh -a kernels
pip install --force-reinstall output/sgl_kernel_npu*.whl

And run the sglang benchmark as before. By running it we obtained:

subject: abstract_algebra, #q:100, acc: 0.800
subject: anatomy, #q:135, acc: 0.807
subject: astronomy, #q:152, acc: 0.947
subject: business_ethics, #q:100, acc: 0.820
subject: clinical_knowledge, #q:265, acc: 0.891
subject: college_biology, #q:144, acc: 0.972
subject: college_chemistry, #q:100, acc: 0.670
subject: college_computer_science, #q:100, acc: 0.860
subject: college_mathematics, #q:100, acc: 0.770
subject: college_medicine, #q:173, acc: 0.861
Total latency: 436.129
Average accuracy: 0.855

Fast inverse backend

To run the fast inverse backend some modifications of sgl-kernel-npu are required. More precisely replace solve_tril with fast_inv_tril in https://github.com/sgl-project/sgl-kernel-npu/blob/main/python/sgl_kernel_npu/sgl_kernel_npu/fla/chunk.py#L215

Then reinstall sgl-kernel-npu again with:

bash build.sh -a kernels
pip install --force-reinstall output/sgl_kernel_npu*.whl

And run the accuracy benchmark again as before. By running it we obtain:

subject: abstract_algebra, #q:100, acc: 0.800
subject: anatomy, #q:135, acc: 0.807
subject: astronomy, #q:152, acc: 0.941
subject: business_ethics, #q:100, acc: 0.810
subject: clinical_knowledge, #q:265, acc: 0.894
subject: college_biology, #q:144, acc: 0.972
subject: college_chemistry, #q:100, acc: 0.670
subject: college_computer_science, #q:100, acc: 0.840
subject: college_mathematics, #q:100, acc: 0.780
subject: college_medicine, #q:173, acc: 0.855
Total latency: 491.175
Average accuracy: 0.852

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @gioelegott, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enables the use of the AIV triangular_inverse within the fla library's Triton backend by modifying the fast_inv_tril function to correctly preprocess and postprocess tensors for the NPU-specific inverse operation, including necessary padding and reshaping. Additionally, it provides detailed guidance for users to deploy the sglang server on A2 and conduct MMLU benchmark evaluations to compare the accuracy of various inverse implementations.

Highlights

  • AIV Triangular Inverse Integration: The fast_inv_tril function was enhanced to correctly handle tensor dimensions and padding, facilitating the integration of the AIV triangular_inverse into the fla library's Triton backend for NPU operations.
  • MMLU Benchmark Instructions: Comprehensive instructions were added for setting up and running the sglang server on A2, and for evaluating the accuracy of different triangular inverse implementations using the MMLU benchmark.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sgl_kernel_npu/sgl_kernel_npu/fla/chunk.py
    • Modified the fast_inv_tril function to include padding logic for input tensors where T % chunk_size is not zero.
    • Implemented tensor reshaping and transposing operations to correctly prepare the input for torch.ops.npu.triangular_inverse.
    • Restored the original tensor shape after the inverse operation to maintain consistency.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the fast_inv_tril function to handle a new tensor format and integrate with an NPU-specific triangular inverse operation, which is a good step towards optimizing for specific hardware. My review focuses on ensuring the correctness and robustness of this new implementation. I've found a critical issue in the padding logic that can cause unnecessary processing and memory allocation, and another high-severity issue related to tensor memory layout that could lead to unsafe operations. Code suggestions are provided to address these points.

Comment thread python/sgl_kernel_npu/sgl_kernel_npu/fla/chunk.py Outdated
Comment thread python/sgl_kernel_npu/sgl_kernel_npu/fla/chunk.py Outdated
ping1jing2
ping1jing2 previously approved these changes Feb 11, 2026
@iforgetmyname iforgetmyname dismissed ping1jing2’s stale review February 12, 2026 01:09

The merge-base changed after approval.

Comment thread python/sgl_kernel_npu/sgl_kernel_npu/fla/chunk.py Outdated
@Napkin-AI
Copy link
Copy Markdown
Contributor

When replacing A = solve_tril(A=A, cu_seqlens=cu_seqlens, output_dtype=k.dtype) with A = fast_inv_tril_wrapper(A=A) in chunk.py, line 231, locally I have the accuracy of Qwen3-Next-80B-A3B-Instruct is 0 on the gsm8k benchmark and the outputs differs with the same input tensors. How is solve_tril supposed to be replaced?

@gioelegott
Copy link
Copy Markdown
Author

When replacing A = solve_tril(A=A, cu_seqlens=cu_seqlens, output_dtype=k.dtype) with A = fast_inv_tril_wrapper(A=A) in chunk.py, line 231, locally I have the accuracy of Qwen3-Next-80B-A3B-Instruct is 0 on the gsm8k benchmark and the outputs differs with the same input tensors. How is solve_tril supposed to be replaced?

I haven't tested the gsm8k benchmark myself, but I am surprised by the results you obtain. Do you get any errors during the benchmark execution?
Also it is expected the results to be slightly different due to fp precision, but there shouldn't be a large gap between triton and our implementation.

@Napkin-AI
Copy link
Copy Markdown
Contributor

When replacing A = solve_tril(A=A, cu_seqlens=cu_seqlens, output_dtype=k.dtype) with A = fast_inv_tril_wrapper(A=A) in chunk.py, line 231, locally I have the accuracy of Qwen3-Next-80B-A3B-Instruct is 0 on the gsm8k benchmark and the outputs differs with the same input tensors. How is solve_tril supposed to be replaced?

I haven't tested the gsm8k benchmark myself, but I am surprised by the results you obtain. Do you get any errors during the benchmark execution? Also it is expected the results to be slightly different due to fp precision, but there shouldn't be a large gap between triton and our implementation.

No errors occurred during the benchmark.
gsm8k out:

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:09<00:00,  6.97s/it]
Accuracy: 0.000
Invalid: 0.200
Latency: 70.028 s
Output throughput: 31.602 token/s

Also in custom test:

def test():
    attn_trill = load_param("attn_triton") # torch.Tensor, [1, 6, 4, 64]

    out_ref = solve_tril(attn_trill)
    out = fast_inv_tril_wrapper(attn_trill)

    assert torch.allclose(out, out_ref.to(torch.float32), atol=1e-3), f"Max difference is {(out - out_ref).abs().max()}"
    print("OK!")

Output:
AssertionError: Max difference is 1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants