[NPU][Feat] Add Ascend sampling backend#12692
Conversation
| # For ascend backend, softmax is not needed before sampling | ||
| if not get_global_server_args().sampling_backend == "ascend" or ( | ||
| return_logprob and not SGLANG_RETURN_ORIGINAL_LOGPROB | ||
| ): | ||
| logits[:] = torch.softmax(logits, dim=-1) |
There was a problem hiding this comment.
Because softmax of the raw logits is included in the fused op, the initial softmax can be skipped if return_logprob is false. Condition is copied from here
sglang/python/sglang/srt/layers/sampler.py
Lines 158 to 169 in dc4f541
There was a problem hiding this comment.
But if return_logprob and not SGLANG_RETURN_ORIGINAL_LOGPROB is True here, it will do two softmax operations?
There was a problem hiding this comment.
Originally softmax was always executed.
sglang/python/sglang/srt/layers/sampler.py
Lines 114 to 115 in 0296f1c
We want to skip this softmax as possible when using ascend sampling backend to speed up. The if-clause condition is designed for,
- if the sampling backend is not
ascend, always execute softmax; - if the sampling backend is
ascend, because the last branch inreturn_logprobcondtional clause computeslogprobsbased onprobs, we have to retain the softmax for the correctness ofprobs.
sglang/python/sglang/srt/layers/sampler.py
Lines 168 to 169 in 0296f1c
This branch is reached only if using ascend sampling and return_logprob and not SGLANG_RETURN_ORIGINAL_LOGPROB is True, and yes, in such case there will be two softmax operations. This scenario is rather uncommon and top_p/top_k sampling indexing will be its performance bottleneck, so it should be okay to compromise a bit for the sake of correctness.
Those will not affect other sampling backends as they will always execute softmax once at the start but do not have softmax operations inside their kernel implementation.
There was a problem hiding this comment.
The code is just wrong when return_logprob and not SGLANG_RETURN_ORIGINAL_LOGPROB and
ascend sampling.
You can never push incorrect code. You should raise exceptions.
There was a problem hiding this comment.
Thank you for the opinions. I check again the code and agree the if-clause and previous code comment can be quite confusing, especially for the corner cases on NPU with ascend sampling backend.
Other backend behaviors
First I would like to further clarify that the changes did not affect the original bahavior for other backends. The default sampling_backend is flasherinfer or pytorch. The sampling backend will not be ascend unless users explicitly set --sampling-backend ascend in the launch command (even for NPU).
So for flashinfer and pytorch, the first condition
not get_global_server_args().sampling_backend == "ascend"
will be True, short-circuiting will gurantee the entire condition be True so that there will be one and only one softmax before sampling, just the same as the code before this PR.
Considerations for ascend sampling backend
- NPU has poor performance with pytorch sampling backend, mostly because of indexing
[:], as shown in the profiling screenshots above, so we use fused opnpu_top_k_top_p, or naive implementationmasked_fill_. We would like to make ascend sampling backend works for most of the cases on NPU. - The fused op
npu_top_k_top_phas a softmax inside before top-p fitlering. When using this fused op, softmax before sampling can be skipped. I wrote the if clause for this scenario while keeping the softmax when return_logprob is necessary. - When using torch native implemetation (as
npu_top_k_top_pcannot taketop_ks>1024), I added a softmax operation inside the ascend sampling backend.
sglang/python/sglang/srt/layers/sampler.py
Lines 327 to 329 in dda2b8d
When falling back to this implemention, plus(return_logprob and not SGLANG_RETURN_ORIGINAL_LOGPROB), there are indeed two softmax, one in logits post-processing, and one before top-k sampling. This does not affect the correctness of logprob, and the filtering of top-k/top-p theoretically should be the same. However, I understand that this is incorrect for the model structure, and the duplicated softmax should be removed.
Possible solutions
-
Immediate solution
- Drop
npu_top_k_top_pimplementation, use torch native implentation only and remove the latter softmax insidetop_k_top_p_min_p_sampling_from_probs_ascend. In such case,if-clause can be removed, and ascend sampling backend becomes consistent with other backends in the logit post-processing part. - Dropping
npu_top_k_top_pis tolerable because the main performace gap lies in tensor indexing[:]v.s.masked_fill_
- Drop
-
Future plans
- A new fused op for npu top-k/top-p/min-p sampling is under development([feat]add apply_top_k_top_p_min_p op sgl-kernel-npu#340). We will rework the ascend sampling backend with it for better performance. The new version also will have clearer logic and be consistent with other backends in
Sampler.forwardpart.
- A new fused op for npu top-k/top-p/min-p sampling is under development([feat]add apply_top_k_top_p_min_p op sgl-kernel-npu#340). We will rework the ascend sampling backend with it for better performance. The new version also will have clearer logic and be consistent with other backends in
There was a problem hiding this comment.
You need to make code alway correct and readable. The readability should be there regardless of whether it changed the behavior of other backends or not.
Refactor here: #18915
|
@ping1jing2 When you add a new test file under |
ok, thanks for your reminder and we are trying |
The NPU CI environments currently do not cache Llama3.1-8B.
Co-authored-by: ronnie_zheng <zl19940307@163.com>
Co-author: @Alexhaoge @shuy98
Motivation
Currently, Ascend NPU only supports
pytorchbackend. The sampling performance degrades sharply under large batchsize(>100). We propose to addascendsampling backend for better performance on NPU.Modifications
The main improvement gains from
torch_npu.npu_top_k_top_pandtorch.Tensor.masked_fill_.top_ks<=1024, Use fused op torch_npu.npu_top_k_top_p for top-k & top-p sampling. The return value of this op is the filtered logits.return_logprobis false.torch_npu.npu_top_k_top_papply top-p filtering on top of top-k reults. This is a litte bit different from pytorch backend, but does not affect the accuracy since the final answer is the intersection of top-k and top-p filtering.top_ks>1024, fallback to native torch interface, but usemasked_fill_instead of naive tensor indexing.[:]array style indexing is the root cause of sampling performance degradation on NPU with large batchsize.Note: Currently ascend sampling backend does not support deterministic sampling but it is feasible to implement follow the similar way as pytorch backend. We will create another pr once NPU support works for deterministic inference is finished wholistically.
Accuracy Tests
Included in the unit test file

test/srt/ascend/test_ascend_pytorch_sampling_backend.py. We use the same test cases astest/srt/test_pytorch_sampling_backend.py.Benchmarking and Profiling
On defaults,
sglang.bench_servingset temperature 0. To benchmark sampling, we modify the script by adding Qwen3 recommended sampling parameters into sglang api,and use the following test command,
We perform the test on a single Atlas 800T A2 server and use Qwen3-32B model on two NPUs.
The results show large speed up with ascend sampling backend.

bs=128 E2E ITL 361ms -> 91ms, sampling time 274ms -> 5.4ms
Checklist