Skip to content

[Perf][vLLM-ATOM] Optimize Sparse MLA in vLLM-ATOM#765

Open
kliuae wants to merge 5 commits into
ROCm:mainfrom
kliuae:kliuae/plugin_opt_dsa
Open

[Perf][vLLM-ATOM] Optimize Sparse MLA in vLLM-ATOM#765
kliuae wants to merge 5 commits into
ROCm:mainfrom
kliuae:kliuae/plugin_opt_dsa

Conversation

@kliuae
Copy link
Copy Markdown
Contributor

@kliuae kliuae commented May 12, 2026

Motivation

This PR adds optimization for sparse MLA in vLLM-ATOM.

Technical Details

  • Enable preshuffled indexer cache and add block_size 64
  • Use persistent mla and add respective workspace buffers

Test Plan

Accuracy test with lm_eval on MI355X

Model: deepseek-ai/DeepSeek-V3.2

Server command:

ATOM_DISABLE_VLLM_PLUGIN=0 \
ATOM_DISABLE_VLLM_PLUGIN_ATTENTION=0 \
VLLM_ROCM_USE_AITER=1 \
vllm serve deepseek-ai/DeepSeek-V3.2 \
  -tp 8 \
  --gpu-memory-utilization 0.8 \
  --no-enable-prefix-caching \
  --disable-uvicorn-access-log \
  --trust-remote-code \
  --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
  --kv-cache-dtype fp8

lm_eval command

lm_eval --model local-completions   --model_args model=deepseek-ai/DeepSeek-V3.2,base_url=http://localhost:8000/v1/completions,num_concurrent=64,tokenized_requests=False  --tasks gsm8k --num_fewshot 20

Benchmark command

vllm bench serve --model deepseek-ai/DeepSeek-V3.2 --port 8000 --dataset-name random --random-input-len 1024 --random-output-len 1024 --max-concurrency $CONC --num-prompts $(( CONC * 10 )) --ignore-eos --temperature 0

Test Result

lm_eval

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 20 exact_match _ 0.9500 _ 0.0060
strict-match 20 exact_match _ 0.9484 _ 0.0061

Performance on MI355X, TP8

ISL/OSL Concurrency KV Cache PR req/s Main req/s PR over Main (req/s) PR Total tok/s Main Total tok/s PR over Main (tok/s)
1k/1k 64 fp8 2.50 2.17 +15.21% 5123.55 4440.94 +15.37%
1k/1k 128 fp8 4.06 3.47 +17.00% 8320.14 7103.16 +17.13%

Submission Checklist

kliuae added 2 commits May 12, 2026 07:18
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
@zejunchen-zejun
Copy link
Copy Markdown
Collaborator

Thank you @kliuae

Hi, @XiaobingSuper
Could you help review this PR?

@kliuae kliuae marked this pull request as ready for review May 13, 2026 10:15
@XiaobingSuper XiaobingSuper self-requested a review May 14, 2026 01:41
XiaobingSuper
XiaobingSuper previously approved these changes May 14, 2026
The sparse MLA plugin now defaults to block_size=64 for preshuffled
indexer cache. The hardcoded --block-size 1 in CI configs would override
this default and prevent the performance gains from taking effect.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@zejunchen-zejun zejunchen-zejun dismissed stale reviews from XiaobingSuper and themself via 99ae2a0 May 14, 2026 05:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants