[NPU][Feat] Add Ascend sampling backend by Alexhaoge · Pull Request #12692 · sgl-project/sglang

Alexhaoge · 2025-11-05T10:09:31Z

Motivation

Currently, Ascend NPU only supports pytorch backend. The sampling performance degrades sharply under large batchsize(>100). We propose to add ascend sampling backend for better performance on NPU.

Modifications

The main improvement gains from torch_npu.npu_top_k_top_p and torch.Tensor.masked_fill_.

If top_ks<=1024, Use fused op torch_npu.npu_top_k_top_p for top-k & top-p sampling. The return value of this op is the filtered logits.
- Because softmax of the raw logits is included in the fused op, the initial softmax can be skipped if return_logprob is false.
- torch_npu.npu_top_k_top_p apply top-p filtering on top of top-k reults. This is a litte bit different from pytorch backend, but does not affect the accuracy since the final answer is the intersection of top-k and top-p filtering.
If top_ks>1024, fallback to native torch interface, but use masked_fill_ instead of naive tensor indexing. [:] array style indexing is the root cause of sampling performance degradation on NPU with large batchsize.

Note: Currently ascend sampling backend does not support deterministic sampling but it is feasible to implement follow the similar way as pytorch backend. We will create another pr once NPU support works for deterministic inference is finished wholistically.

Accuracy Tests

Included in the unit test file test/srt/ascend/test_ascend_pytorch_sampling_backend.py. We use the same test cases as test/srt/test_pytorch_sampling_backend.py.

Benchmarking and Profiling

On defaults, sglang.bench_serving set temperature 0. To benchmark sampling, we modify the script by adding Qwen3 recommended sampling parameters into sglang api,

async def async_request_sglang_generate(...) -> RequestFuncOutput:
    ...
        payload = {
            ...
            "sampling_params": {
                "temperature": 0.6,
                "top_k": 20,
                "top_p": 0.95,
                ...
            },
            ...
        }
    ...

and use the following test command,

python3 -m sglang.bench_serving \
        --model Qwen/Qwen3-32B --apply-chat-template --flush-cache \
        --num-prompt 128 --max-concurrency 128 \
        --dataset-name random-ids --random-input-len 512 --random-output-len 512 --random-range-ratio 1.0

We perform the test on a single Atlas 800T A2 server and use Qwen3-32B model on two NPUs.

# For pytorch
STREAMS_PER_DEVICE=32 python -m sglang.launch_server --model-path /model/qwen3_32b --tp 2 \
    --attention-backend ascend --sampling-backend pytorch --mem-fraction-static 0.85 \

# For ascend
STREAMS_PER_DEVICE=32 python -m sglang.launch_server --model-path /model/qwen3_32b --tp 2 \
    --attention-backend ascend --sampling-backend ascend --mem-fraction-static 0.85 \

The results show large speed up with ascend sampling backend.
bs=128 E2E ITL 361ms -> 91ms, sampling time 274ms -> 5.4ms

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

Alexhaoge · 2025-11-05T10:16:50Z

+            # For ascend backend, softmax is not needed before sampling
+            if not get_global_server_args().sampling_backend == "ascend" or (
+                return_logprob and not SGLANG_RETURN_ORIGINAL_LOGPROB
+            ):
+                logits[:] = torch.softmax(logits, dim=-1)


Because softmax of the raw logits is included in the fused op, the initial softmax can be skipped if return_logprob is false. Condition is copied from here

sglang/python/sglang/srt/layers/sampler.py

Lines 158 to 169 in dc4f541

if return_logprob:

if get_global_server_args().rl_on_policy_target == "fsdp":

logprobs = logprobs_via_logsoftmax_kernel

del logprobs_via_logsoftmax_kernel

# clamp to avoid -inf

elif SGLANG_RETURN_ORIGINAL_LOGPROB:

logprobs = torch.log(probs_without_temp_scaling).clamp(

min=torch.finfo(probs_without_temp_scaling.dtype).min

)

del probs_without_temp_scaling

else:

logprobs = torch.log(probs).clamp(min=torch.finfo(probs.dtype).min)

But if return_logprob and not SGLANG_RETURN_ORIGINAL_LOGPROB is True here, it will do two softmax operations?

Originally softmax was always executed.

sglang/python/sglang/srt/layers/sampler.py

Lines 114 to 115 in 0296f1c

logits.div_(sampling_info.temperatures)

logits[:] = torch.softmax(logits, dim=-1)

We want to skip this softmax as possible when using ascend sampling backend to speed up. The if-clause condition is designed for,

if the sampling backend is not ascend, always execute softmax;

if the sampling backend is ascend, because the last branch in return_logprob condtional clause computes logprobs based on probs, we have to retain the softmax for the correctness of probs.

sglang/python/sglang/srt/layers/sampler.py

Lines 168 to 169 in 0296f1c

else:

logprobs = torch.log(probs).clamp(min=torch.finfo(probs.dtype).min)

This branch is reached only if using ascend sampling and return_logprob and not SGLANG_RETURN_ORIGINAL_LOGPROB is True, and yes, in such case there will be two softmax operations. This scenario is rather uncommon and top_p/top_k sampling indexing will be its performance bottleneck, so it should be okay to compromise a bit for the sake of correctness.

Those will not affect other sampling backends as they will always execute softmax once at the start but do not have softmax operations inside their kernel implementation.

The code is just wrong when return_logprob and not SGLANG_RETURN_ORIGINAL_LOGPROB and
ascend sampling.

You can never push incorrect code. You should raise exceptions.

Thank you for the opinions. I check again the code and agree the if-clause and previous code comment can be quite confusing, especially for the corner cases on NPU with ascend sampling backend.

Other backend behaviors

First I would like to further clarify that the changes did not affect the original bahavior for other backends. The default sampling_backend is flasherinfer or pytorch. The sampling backend will not be ascend unless users explicitly set --sampling-backend ascend in the launch command (even for NPU).

So for flashinfer and pytorch, the first condition

not get_global_server_args().sampling_backend == "ascend"

will be True, short-circuiting will gurantee the entire condition be True so that there will be one and only one softmax before sampling, just the same as the code before this PR.

Considerations for ascend sampling backend

NPU has poor performance with pytorch sampling backend, mostly because of indexing [:], as shown in the profiling screenshots above, so we use fused op npu_top_k_top_p, or naive implementation masked_fill_. We would like to make ascend sampling backend works for most of the cases on NPU.

The fused op npu_top_k_top_p has a softmax inside before top-p fitlering. When using this fused op, softmax before sampling can be skipped. I wrote the if clause for this scenario while keeping the softmax when return_logprob is necessary.

When using torch native implemetation (as npu_top_k_top_p cannot take top_ks>1024), I added a softmax operation inside the ascend sampling backend.

sglang/python/sglang/srt/layers/sampler.py

Lines 327 to 329 in dda2b8d

else:

probs = torch.softmax(probs, dim=-1)

probs_sort, probs_idx = probs.sort(dim=-1, descending=True)

When falling back to this implemention, plus (return_logprob and not SGLANG_RETURN_ORIGINAL_LOGPROB) , there are indeed two softmax, one in logits post-processing, and one before top-k sampling. This does not affect the correctness of logprob, and the filtering of top-k/top-p theoretically should be the same. However, I understand that this is incorrect for the model structure, and the duplicated softmax should be removed.

Possible solutions

Immediate solution

Drop npu_top_k_top_p implementation, use torch native implentation only and remove the latter softmax inside top_k_top_p_min_p_sampling_from_probs_ascend. In such case, if-clause can be removed, and ascend sampling backend becomes consistent with other backends in the logit post-processing part.

Dropping npu_top_k_top_p is tolerable because the main performace gap lies in tensor indexing [:] v.s. masked_fill_

Future plans

A new fused op for npu top-k/top-p/min-p sampling is under development([feat]add apply_top_k_top_p_min_p op sgl-kernel-npu#340). We will rework the ascend sampling backend with it for better performance. The new version also will have clearer logic and be consistent with other backends in Sampler.forward part.

You need to make code alway correct and readable. The readability should be there regardless of whether it changed the behavior of other backends or not.

Refactor here: #18915

hnyls2002 · 2025-11-13T16:31:15Z

@ping1jing2 When you add a new test file under test/ folder, you should add it to the run_suite.py

ping1jing2 · 2025-11-14T06:16:53Z

@ping1jing2 When you add a new test file under test/ folder, you should add it to the run_suite.py

ok, thanks for your reminder and we are trying

The NPU CI environments currently do not cache Llama3.1-8B.

Co-authored-by: ronnie_zheng <zl19940307@163.com>

Alexhaoge changed the title ~~Feat: add ascend sampling backend~~ [Ascend][Feat] Add Ascend sampling backend Nov 5, 2025

Alexhaoge commented Nov 5, 2025

View reviewed changes

Feat: add ascend sampling backend

dda2b8d

Alexhaoge force-pushed the sampling branch from 41eec83 to dda2b8d Compare November 6, 2025 08:48

github-actions Bot added the documentation Improvements or additions to documentation label Nov 6, 2025

Alexhaoge marked this pull request as ready for review November 6, 2025 09:16

Alexhaoge requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123, ch-wan, ispobock, kushanam, merrymercy, ping1jing2 and zhyncs as code owners November 6, 2025 09:16

ping1jing2 added the run-ci label Nov 12, 2025

Merge branch 'main' into sampling

c5f4334

ping1jing2 requested a review from Fridge003 as a code owner November 12, 2025 06:24

hnyls2002 assigned hnyls2002, ping1jing2 and Alexhaoge Nov 13, 2025

Alexhaoge and others added 6 commits November 14, 2025 14:54

Merge remote-tracking branch 'upstream' into sampling

b8f433e

Add UT to run_suite.py

fb67ca3

Merge branch 'main' into sampling

f8a3311

Merge branch 'main' into sampling

d674d02

Change model weight path to fix CI

25753ce

The NPU CI environments currently do not cache Llama3.1-8B.

Merge branch 'main' into sampling

9961fbe

hnyls2002 merged commit 10592e9 into sgl-project:main Nov 15, 2025
51 of 61 checks passed

ping1jing2 changed the title ~~[Ascend][Feat] Add Ascend sampling backend~~ [NPU][Feat] Add Ascend sampling backend Dec 31, 2025

ping1jing2 deleted the sampling branch May 4, 2026 07:23

0826joyce pushed a commit to 0826joyce/sglang-perf-opt that referenced this pull request May 19, 2026

[Ascend][Feat] Add Ascend sampling backend (sgl-project#12692)

5ba1f42

Co-authored-by: ronnie_zheng <zl19940307@163.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPU][Feat] Add Ascend sampling backend#12692

[NPU][Feat] Add Ascend sampling backend#12692
hnyls2002 merged 8 commits into
sgl-project:mainfrom
ping1jing2:sampling

Alexhaoge commented Nov 5, 2025 •

edited

Loading

Uh oh!

Alexhaoge Nov 5, 2025

Uh oh!

Qiaolin-Yu Dec 21, 2025

Uh oh!

Alexhaoge Dec 22, 2025 •

edited

Loading

Uh oh!

merrymercy Feb 10, 2026 •

edited

Loading

Uh oh!

Alexhaoge Feb 12, 2026

Uh oh!

merrymercy Feb 17, 2026 •

edited

Loading

Uh oh!

hnyls2002 commented Nov 13, 2025

Uh oh!

ping1jing2 commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	if return_logprob:
	if get_global_server_args().rl_on_policy_target == "fsdp":
	logprobs = logprobs_via_logsoftmax_kernel
	del logprobs_via_logsoftmax_kernel
	# clamp to avoid -inf
	elif SGLANG_RETURN_ORIGINAL_LOGPROB:
	logprobs = torch.log(probs_without_temp_scaling).clamp(
	min=torch.finfo(probs_without_temp_scaling.dtype).min
	)
	del probs_without_temp_scaling
	else:
	logprobs = torch.log(probs).clamp(min=torch.finfo(probs.dtype).min)

	logits.div_(sampling_info.temperatures)
	logits[:] = torch.softmax(logits, dim=-1)

	else:
	probs = torch.softmax(probs, dim=-1)
	probs_sort, probs_idx = probs.sort(dim=-1, descending=True)

Conversation

Alexhaoge commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

Alexhaoge Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Qiaolin-Yu Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Alexhaoge Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merrymercy Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Alexhaoge Feb 12, 2026

Choose a reason for hiding this comment

Other backend behaviors

Considerations for ascend sampling backend

Possible solutions

Uh oh!

merrymercy Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hnyls2002 commented Nov 13, 2025

Uh oh!

ping1jing2 commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Alexhaoge commented Nov 5, 2025 •

edited

Loading

Alexhaoge Dec 22, 2025 •

edited

Loading

merrymercy Feb 10, 2026 •

edited

Loading

merrymercy Feb 17, 2026 •

edited

Loading