[ROCm][aiter] Add DSv3.2 BF16 GEMM tuned configs for gfx950 (K=7168) by sunway513 · Pull Request #3109 · ROCm/aiter

sunway513 · 2026-05-10T15:16:33Z

Problem

The aiter/configs/bf16_tuned_gemm.csv shipped in v0.1.13-rc5 is empty (header only, 1 line). Every BF16 GEMM lookup falls through to the torch default solver. For DSv3.2's sparse_attn_indexer, this is on the hot path — server logs are flooded with:

[aiter] shape is M:..., N:192/256, K:7168 ... not found tuned config in
        /tmp/aiter_configs/bf16_tuned_gemm.csv, will use default config!
        using torch solution:0

Concrete impact measured on mi355-gpu-15 (MI355X 8-GPU host) with vllm-rocm:v1-rc5 (mainline 4140faa4a + 8 ROCm PRs + aiter rc5):

Config	aiter 0.1.10 (V1-old)	aiter 0.1.13rc5 (V1-rc5)	Δ
DSv3.2 FP8 TP=8 1024/1024 conc=64	1164 out_tps	786 out_tps	-32.5% 🔴

aiter rc5 wins single-stream (kernel autotune) but loses high-conc because the tuned cache is empty.

What this PR adds

Two CSV files following the existing aiter/configs/model_configs/ pattern (same as a8w8_blockscale_tuned_gemm_ds_v3.csv etc.):

aiter/configs/model_configs/dsv32_bf16_untuned_gemm.csv — 30 seed shapes (15 M × 2 N values @ K=7168)
aiter/configs/model_configs/dsv32_bf16_tuned_gemm.csv — gradlib gemm_tuner.py results, all-backends sweep on gfx950 / 256 CU

Workload coverage justification

Real perf runs hit 2928 unique M:N:K shapes (M ∈ [232, 16321], N ∈ {192, 256}, K=7168). The 30 representative M values (geometric-ish, 64 → 16384) cover that range; aiter.tuned_gemm does nearest-key M lookup, so every actual M finds a tuned solution.

Tuner backend auto-selection (data-driven)

M range	Backend	Sample TFLOPS
[64, 768)	`flydsl` (split_k=14, t32x64x256)	27 → 108
[768, 1536)	`asm` (`bf16gemm_fp32bf16_tn_96x64_splitk_clean`)	143 → 225
512 N=256	`triton`	140
[1536, 16384]	`hipblaslt` (Cijk_Alik_Bljk MT* kernels)	307 → 831

Peak: 831.4 TFLOPS @ M=16384, N=256, K=7168 (hipblaslt, gfx950, 256 CU).

How to reproduce / regenerate for another gfx target

cd aiter
python3 gradlib/gradlib/gemm_tuner.py \
  --input_file aiter/configs/model_configs/dsv32_bf16_untuned_gemm.csv \
  --tuned_file aiter/configs/model_configs/dsv32_bf16_tuned_gemm.csv \
  --indtype bf16 --outdtype bf16 --libtype all

Tuning wall time ≈ 90 min on 8x MI355X (parallel via mp_tuner).

Validation plan (post-merge)

Run V1-rc5 image (or rc6 if that's where this lands) DSv3.2 FP8 TP=8 1024/1024 conc-{1,4,16,64} sweep with VLLM_ROCM_USE_AITER=1. Expect conc=64 out_tps recovery from current 786 → ≥1100 tps band (parity with aiter 0.1.10).

Caveats

Tuned for gfx950, 256 CU only. Other gfx targets need regeneration (script above).
N values fixed to {192, 256}; if DSv3.2 evolves to use different N for sparse-attn indexer, re-seed.
Same shape set should also benefit Kimi-K2.5 and other DSv3-family models that share K=7168 hidden dim.

The bf16_tuned_gemm.csv shipped in v0.1.13-rc5 is empty (header only), forcing every BF16 GEMM lookup in DSv3.2's sparse_attn_indexer onto the torch fallback solver. Real-host perf runs on MI355X 8-GPU TP=8 hit 2928 unique M:N:K shapes (M ∈ [232, 16321], N ∈ {192, 256}, K=7168 = DSv3.2 hidden dim) — all missing the cache. This PR adds 30 representative seed shapes (15 M × 2 N) and the matching tuned results from gradlib gemm_tuner sweep (all backends: flydsl, asm, hipblaslt, triton). Files (follow the existing model_configs/ pattern): - aiter/configs/model_configs/dsv32_bf16_untuned_gemm.csv (30 seed rows) - aiter/configs/model_configs/dsv32_bf16_tuned_gemm.csv (30 tuned rows) Backend auto-selection by tuner: M ∈ [ 64, 768): flydsl (small batch, 27-108 TFLOPS) M ∈ [ 768, 1536): asm (mid batch, 143-225 TFLOPS) M ∈ [ 1536,16384]: hipblaslt (large batch, peak 831 TFLOPS @ 16384/256) Validation: vllm-rocm:v1-rc5 image (mainline 4140faa4a + 8 ROCm PRs) with these tuned configs should restore conc=64 throughput regression (currently -32% vs aiter 0.1.10) on DSv3.2 FP8 TP=8 1024/1024 sweep. Co-tuned for gfx950, 256 CU. For other gfx targets, regenerate via \`gradlib/gradlib/gemm_tuner.py --input_file ... --tuned_file ...\`.

github-actions · 2026-05-10T15:16:48Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3109 --add-label <label>

Copilot

Pull request overview

Adds DeepSeek v3.2-specific BF16 GEMM seed shapes and tuned kernel selections (gfx950, K=7168) under aiter/configs/model_configs/, intended to prevent BF16 GEMM lookups from falling back to default solvers for DSv3.2 hot-path shapes.

Changes:

Add an “untuned” seed-shape CSV for DSv3.2 BF16 GEMM at K=7168 and N ∈ {192, 256}.
Add a corresponding “tuned” CSV with backend/libtype selections and performance metadata for gfx950 / 256 CU.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
aiter/configs/model_configs/dsv32_bf16_untuned_gemm.csv	Adds DSv3.2 seed M/N/K shape list for BF16 GEMM tuning/regeneration.
aiter/configs/model_configs/dsv32_bf16_tuned_gemm.csv	Adds tuned solutions for those shapes on gfx950/256CU (flydsl/asm/triton/hipblaslt mix).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+gfx950,256,64,256,7168,False,torch.bfloat16,torch.bfloat16,False,False,flydsl,1119,14,6.5154,flydsl_gemm2_abf16_wbf16_bf16_t32x64x256_split_k14_block_m_warp2_block_n_warp2_async_copyTrue_b_to_ldsTrue_b_preshuffleFalse_c_to_ldsFalse_gfx950,0.0281,36.05,709.13
+gfx950,256,128,256,7168,False,torch.bfloat16,torch.bfloat16,False,False,flydsl,1825,14,8.7679,flydsl_gemm2_abf16_wbf16_bf16_t32x64x256_split_k14_block_m_warp2_block_n_warp2_async_copyTrue_b_to_ldsTrue_b_preshuffleFalse_c_to_ldsFalse_gfx950,0.0297,53.58,635.34
+gfx950,256,256,256,7168,False,torch.bfloat16,torch.bfloat16,False,False,flydsl,1852,7,10.8758,flydsl_gemm2_abf16_wbf16_bf16_t32x64x256_split_k7_block_m_warp1_block_n_warp4_async_copyTrue_b_to_ldsFalse_b_preshuffleFalse_c_to_ldsFalse_gfx950,0.0206,86.39,686.95


The FlyDSL AOT build asserts on duplicate shapes during config merge. Three of our seed shapes (M ∈ {64,128,256}, N=256, K=7168) duplicate existing entries in aiter/configs/model_configs/dsv3_bf16_tuned_gemm.csv. At runtime the glob lookup in get_config_file() picks those shapes from dsv3_bf16_tuned_gemm.csv automatically, so DSv3.2 coverage is preserved. Net: 27 unique shapes for dsv32 (15 M × 2 N minus 3 dupes already in dsv3). N=192 set is fully unique and retained.

sunway513 requested review from a team and Copilot May 10, 2026 15:16

Copilot started reviewing on behalf of sunway513 May 10, 2026 15:17 View session

Copilot AI reviewed May 10, 2026

View reviewed changes

sunway513 and others added 3 commits May 10, 2026 16:08

ci: retrigger build_aiter_wheels (auto-dedup retry)

cf992de

Merge branch 'main' into feat/dsv32-bf16-tuned-configs

a408b31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm][aiter] Add DSv3.2 BF16 GEMM tuned configs for gfx950 (K=7168)#3109

[ROCm][aiter] Add DSv3.2 BF16 GEMM tuned configs for gfx950 (K=7168)#3109
sunway513 wants to merge 4 commits intoROCm:mainfrom
sunway513:feat/dsv32-bf16-tuned-configs

sunway513 commented May 10, 2026

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sunway513 commented May 10, 2026

Problem

What this PR adds

Workload coverage justification

Tuner backend auto-selection (data-driven)

How to reproduce / regenerate for another gfx target

Validation plan (post-merge)

Caveats

Related

Uh oh!

github-actions Bot commented May 10, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants