Skip to content

[NPU][Feat] Add Ascend sampling backend#12692

Merged
hnyls2002 merged 8 commits into
sgl-project:mainfrom
ping1jing2:sampling
Nov 15, 2025
Merged

[NPU][Feat] Add Ascend sampling backend#12692
hnyls2002 merged 8 commits into
sgl-project:mainfrom
ping1jing2:sampling

Conversation

@Alexhaoge
Copy link
Copy Markdown
Contributor

@Alexhaoge Alexhaoge commented Nov 5, 2025

Co-author: @Alexhaoge @shuy98

Motivation

Currently, Ascend NPU only supports pytorch backend. The sampling performance degrades sharply under large batchsize(>100). We propose to add ascend sampling backend for better performance on NPU.

Modifications

The main improvement gains from torch_npu.npu_top_k_top_p and torch.Tensor.masked_fill_.

  • If top_ks<=1024, Use fused op torch_npu.npu_top_k_top_p for top-k & top-p sampling. The return value of this op is the filtered logits.
    • Because softmax of the raw logits is included in the fused op, the initial softmax can be skipped if return_logprob is false.
    • torch_npu.npu_top_k_top_p apply top-p filtering on top of top-k reults. This is a litte bit different from pytorch backend, but does not affect the accuracy since the final answer is the intersection of top-k and top-p filtering.
  • If top_ks>1024, fallback to native torch interface, but use masked_fill_ instead of naive tensor indexing. [:] array style indexing is the root cause of sampling performance degradation on NPU with large batchsize.

Note: Currently ascend sampling backend does not support deterministic sampling but it is feasible to implement follow the similar way as pytorch backend. We will create another pr once NPU support works for deterministic inference is finished wholistically.

Accuracy Tests

Included in the unit test file test/srt/ascend/test_ascend_pytorch_sampling_backend.py. We use the same test cases as test/srt/test_pytorch_sampling_backend.py.
acc

Benchmarking and Profiling

On defaults, sglang.bench_serving set temperature 0. To benchmark sampling, we modify the script by adding Qwen3 recommended sampling parameters into sglang api,

async def async_request_sglang_generate(...) -> RequestFuncOutput:
    ...
        payload = {
            ...
            "sampling_params": {
                "temperature": 0.6,
                "top_k": 20,
                "top_p": 0.95,
                ...
            },
            ...
        }
    ...

and use the following test command,

python3 -m sglang.bench_serving \
        --model Qwen/Qwen3-32B --apply-chat-template --flush-cache \
        --num-prompt 128 --max-concurrency 128 \
        --dataset-name random-ids --random-input-len 512 --random-output-len 512 --random-range-ratio 1.0 

We perform the test on a single Atlas 800T A2 server and use Qwen3-32B model on two NPUs.

# For pytorch
STREAMS_PER_DEVICE=32 python -m sglang.launch_server --model-path /model/qwen3_32b --tp 2 \
    --attention-backend ascend --sampling-backend pytorch --mem-fraction-static 0.85 \

# For ascend
STREAMS_PER_DEVICE=32 python -m sglang.launch_server --model-path /model/qwen3_32b --tp 2 \
    --attention-backend ascend --sampling-backend ascend --mem-fraction-static 0.85 \

The results show large speed up with ascend sampling backend.
bs=128 E2E ITL 361ms -> 91ms, sampling time 274ms -> 5.4ms
merged

profilemerge

Checklist

@Alexhaoge Alexhaoge changed the title Feat: add ascend sampling backend [Ascend][Feat] Add Ascend sampling backend Nov 5, 2025
Comment on lines +118 to +122
# For ascend backend, softmax is not needed before sampling
if not get_global_server_args().sampling_backend == "ascend" or (
return_logprob and not SGLANG_RETURN_ORIGINAL_LOGPROB
):
logits[:] = torch.softmax(logits, dim=-1)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because softmax of the raw logits is included in the fused op, the initial softmax can be skipped if return_logprob is false. Condition is copied from here

if return_logprob:
if get_global_server_args().rl_on_policy_target == "fsdp":
logprobs = logprobs_via_logsoftmax_kernel
del logprobs_via_logsoftmax_kernel
# clamp to avoid -inf
elif SGLANG_RETURN_ORIGINAL_LOGPROB:
logprobs = torch.log(probs_without_temp_scaling).clamp(
min=torch.finfo(probs_without_temp_scaling.dtype).min
)
del probs_without_temp_scaling
else:
logprobs = torch.log(probs).clamp(min=torch.finfo(probs.dtype).min)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if return_logprob and not SGLANG_RETURN_ORIGINAL_LOGPROB is True here, it will do two softmax operations?

Copy link
Copy Markdown
Contributor Author

@Alexhaoge Alexhaoge Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally softmax was always executed.

logits.div_(sampling_info.temperatures)
logits[:] = torch.softmax(logits, dim=-1)

We want to skip this softmax as possible when using ascend sampling backend to speed up. The if-clause condition is designed for,

  1. if the sampling backend is not ascend, always execute softmax;
  2. if the sampling backend is ascend, because the last branch in return_logprob condtional clause computes logprobs based on probs, we have to retain the softmax for the correctness of probs.
    else:
    logprobs = torch.log(probs).clamp(min=torch.finfo(probs.dtype).min)

This branch is reached only if using ascend sampling and return_logprob and not SGLANG_RETURN_ORIGINAL_LOGPROB is True, and yes, in such case there will be two softmax operations. This scenario is rather uncommon and top_p/top_k sampling indexing will be its performance bottleneck, so it should be okay to compromise a bit for the sake of correctness.

Those will not affect other sampling backends as they will always execute softmax once at the start but do not have softmax operations inside their kernel implementation.

Copy link
Copy Markdown
Contributor

@merrymercy merrymercy Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is just wrong when return_logprob and not SGLANG_RETURN_ORIGINAL_LOGPROB and
ascend sampling.

You can never push incorrect code. You should raise exceptions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the opinions. I check again the code and agree the if-clause and previous code comment can be quite confusing, especially for the corner cases on NPU with ascend sampling backend.

Other backend behaviors

First I would like to further clarify that the changes did not affect the original bahavior for other backends. The default sampling_backend is flasherinfer or pytorch. The sampling backend will not be ascend unless users explicitly set --sampling-backend ascend in the launch command (even for NPU).

So for flashinfer and pytorch, the first condition

not get_global_server_args().sampling_backend == "ascend"

will be True, short-circuiting will gurantee the entire condition be True so that there will be one and only one softmax before sampling, just the same as the code before this PR.

Considerations for ascend sampling backend

  1. NPU has poor performance with pytorch sampling backend, mostly because of indexing [:], as shown in the profiling screenshots above, so we use fused op npu_top_k_top_p, or naive implementation masked_fill_. We would like to make ascend sampling backend works for most of the cases on NPU.
  2. The fused op npu_top_k_top_p has a softmax inside before top-p fitlering. When using this fused op, softmax before sampling can be skipped. I wrote the if clause for this scenario while keeping the softmax when return_logprob is necessary.
  3. When using torch native implemetation (as npu_top_k_top_p cannot take top_ks>1024), I added a softmax operation inside the ascend sampling backend.
    else:
    probs = torch.softmax(probs, dim=-1)
    probs_sort, probs_idx = probs.sort(dim=-1, descending=True)

    When falling back to this implemention, plus (return_logprob and not SGLANG_RETURN_ORIGINAL_LOGPROB) , there are indeed two softmax, one in logits post-processing, and one before top-k sampling. This does not affect the correctness of logprob, and the filtering of top-k/top-p theoretically should be the same. However, I understand that this is incorrect for the model structure, and the duplicated softmax should be removed.

Possible solutions

  1. Immediate solution

    • Drop npu_top_k_top_p implementation, use torch native implentation only and remove the latter softmax inside top_k_top_p_min_p_sampling_from_probs_ascend. In such case, if-clause can be removed, and ascend sampling backend becomes consistent with other backends in the logit post-processing part.
    • Dropping npu_top_k_top_p is tolerable because the main performace gap lies in tensor indexing [:] v.s. masked_fill_
  2. Future plans

    • A new fused op for npu top-k/top-p/min-p sampling is under development([feat]add apply_top_k_top_p_min_p op sgl-kernel-npu#340). We will rework the ascend sampling backend with it for better performance. The new version also will have clearer logic and be consistent with other backends in Sampler.forward part.

Copy link
Copy Markdown
Contributor

@merrymercy merrymercy Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to make code alway correct and readable. The readability should be there regardless of whether it changed the behavior of other backends or not.

Refactor here: #18915

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Nov 6, 2025
@Alexhaoge Alexhaoge marked this pull request as ready for review November 6, 2025 09:16
@hnyls2002
Copy link
Copy Markdown
Collaborator

@ping1jing2 When you add a new test file under test/ folder, you should add it to the run_suite.py

@ping1jing2
Copy link
Copy Markdown
Collaborator

@ping1jing2 When you add a new test file under test/ folder, you should add it to the run_suite.py

ok, thanks for your reminder and we are trying

@hnyls2002 hnyls2002 merged commit 10592e9 into sgl-project:main Nov 15, 2025
51 of 61 checks passed
@ping1jing2 ping1jing2 changed the title [Ascend][Feat] Add Ascend sampling backend [NPU][Feat] Add Ascend sampling backend Dec 31, 2025
@ping1jing2 ping1jing2 deleted the sampling branch May 4, 2026 07:23
0826joyce pushed a commit to 0826joyce/sglang-perf-opt that referenced this pull request May 19, 2026
Co-authored-by: ronnie_zheng <zl19940307@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants