[Perf][vLLM-ATOM] Optimize Sparse MLA in vLLM-ATOM by kliuae · Pull Request #765 · ROCm/ATOM

kliuae · 2026-05-12T10:01:22Z

Motivation

This PR adds optimization for sparse MLA in vLLM-ATOM.

Technical Details

Enable preshuffled indexer cache and add block_size 64
Use persistent mla and add respective workspace buffers

Test Plan

Accuracy test with lm_eval on MI355X

Model: deepseek-ai/DeepSeek-V3.2

Server command:

ATOM_DISABLE_VLLM_PLUGIN=0 \
ATOM_DISABLE_VLLM_PLUGIN_ATTENTION=0 \
VLLM_ROCM_USE_AITER=1 \
vllm serve deepseek-ai/DeepSeek-V3.2 \
  -tp 8 \
  --gpu-memory-utilization 0.8 \
  --no-enable-prefix-caching \
  --disable-uvicorn-access-log \
  --trust-remote-code \
  --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
  --kv-cache-dtype fp8

lm_eval command

lm_eval --model local-completions   --model_args model=deepseek-ai/DeepSeek-V3.2,base_url=http://localhost:8000/v1/completions,num_concurrent=64,tokenized_requests=False  --tasks gsm8k --num_fewshot 20

Benchmark command

vllm bench serve --model deepseek-ai/DeepSeek-V3.2 --port 8000 --dataset-name random --random-input-len 1024 --random-output-len 1024 --max-concurrency $CONC --num-prompts $(( CONC * 10 )) --ignore-eos --temperature 0

Test Result

lm_eval

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	20	exact_match	_	0.9500	_	0.0060
		strict-match	20	exact_match	_	0.9484	_	0.0061

Performance on MI355X, TP8

ISL/OSL	Concurrency	KV Cache	PR req/s	Main req/s	PR over Main (req/s)	PR Total tok/s	Main Total tok/s	PR over Main (tok/s)
1k/1k	64	fp8	2.50	2.17	+15.21%	5123.55	4440.94	+15.37%
1k/1k	128	fp8	4.06	3.47	+17.00%	8320.14	7103.16	+17.13%

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

zejunchen-zejun · 2026-05-12T11:21:26Z

Thank you @kliuae

Hi, @XiaobingSuper
Could you help review this PR?

The sparse MLA plugin now defaults to block_size=64 for preshuffled indexer cache. The hardcoded --block-size 1 in CI configs would override this default and prevent the performance gains from taking effect. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

kliuae added 2 commits May 12, 2026 07:18

preshuffle indexer cache

8a8d4b2

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

use persistent mode mla

2a54960

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

kliuae marked this pull request as ready for review May 13, 2026 10:15

XiaobingSuper added 2 commits May 13, 2026 18:43

Merge branch 'main' into kliuae/plugin_opt_dsa

9dc9e7b

Format sparse MLA plugin files

47ffd05

XiaobingSuper self-requested a review May 14, 2026 01:41

XiaobingSuper previously approved these changes May 14, 2026

View reviewed changes

XiaobingSuper requested a review from zejunchen-zejun May 14, 2026 01:47

zejunchen-zejun previously approved these changes May 14, 2026

View reviewed changes

zejunchen-zejun dismissed stale reviews from XiaobingSuper and themself via 99ae2a0 May 14, 2026 05:49

valarLip approved these changes May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf][vLLM-ATOM] Optimize Sparse MLA in vLLM-ATOM#765

[Perf][vLLM-ATOM] Optimize Sparse MLA in vLLM-ATOM#765
kliuae wants to merge 5 commits into
ROCm:mainfrom
kliuae:kliuae/plugin_opt_dsa

kliuae commented May 12, 2026

Uh oh!

zejunchen-zejun commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kliuae commented May 12, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

zejunchen-zejun commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants