Skip to content

[ROCm][aiter] Add DSv3.2 BF16 GEMM tuned configs for gfx950 (K=7168)#3109

Open
sunway513 wants to merge 4 commits intoROCm:mainfrom
sunway513:feat/dsv32-bf16-tuned-configs
Open

[ROCm][aiter] Add DSv3.2 BF16 GEMM tuned configs for gfx950 (K=7168)#3109
sunway513 wants to merge 4 commits intoROCm:mainfrom
sunway513:feat/dsv32-bf16-tuned-configs

Conversation

@sunway513
Copy link
Copy Markdown
Collaborator

Problem

The aiter/configs/bf16_tuned_gemm.csv shipped in v0.1.13-rc5 is empty (header only, 1 line). Every BF16 GEMM lookup falls through to the torch default solver. For DSv3.2's sparse_attn_indexer, this is on the hot path — server logs are flooded with:

[aiter] shape is M:..., N:192/256, K:7168 ... not found tuned config in
        /tmp/aiter_configs/bf16_tuned_gemm.csv, will use default config!
        using torch solution:0

Concrete impact measured on mi355-gpu-15 (MI355X 8-GPU host) with vllm-rocm:v1-rc5 (mainline 4140faa4a + 8 ROCm PRs + aiter rc5):

Config aiter 0.1.10 (V1-old) aiter 0.1.13rc5 (V1-rc5) Δ
DSv3.2 FP8 TP=8 1024/1024 conc=64 1164 out_tps 786 out_tps -32.5% 🔴

aiter rc5 wins single-stream (kernel autotune) but loses high-conc because the tuned cache is empty.

What this PR adds

Two CSV files following the existing aiter/configs/model_configs/ pattern (same as a8w8_blockscale_tuned_gemm_ds_v3.csv etc.):

  • aiter/configs/model_configs/dsv32_bf16_untuned_gemm.csv — 30 seed shapes (15 M × 2 N values @ K=7168)
  • aiter/configs/model_configs/dsv32_bf16_tuned_gemm.csv — gradlib gemm_tuner.py results, all-backends sweep on gfx950 / 256 CU

Workload coverage justification

Real perf runs hit 2928 unique M:N:K shapes (M ∈ [232, 16321], N ∈ {192, 256}, K=7168). The 30 representative M values (geometric-ish, 64 → 16384) cover that range; aiter.tuned_gemm does nearest-key M lookup, so every actual M finds a tuned solution.

Tuner backend auto-selection (data-driven)

M range Backend Sample TFLOPS
[64, 768) flydsl (split_k=14, t32x64x256) 27 → 108
[768, 1536) asm (bf16gemm_fp32bf16_tn_96x64_splitk_clean) 143 → 225
512 N=256 triton 140
[1536, 16384] hipblaslt (Cijk_Alik_Bljk MT* kernels) 307 → 831

Peak: 831.4 TFLOPS @ M=16384, N=256, K=7168 (hipblaslt, gfx950, 256 CU).

How to reproduce / regenerate for another gfx target

cd aiter
python3 gradlib/gradlib/gemm_tuner.py \
  --input_file aiter/configs/model_configs/dsv32_bf16_untuned_gemm.csv \
  --tuned_file aiter/configs/model_configs/dsv32_bf16_tuned_gemm.csv \
  --indtype bf16 --outdtype bf16 --libtype all

Tuning wall time ≈ 90 min on 8x MI355X (parallel via mp_tuner).

Validation plan (post-merge)

Run V1-rc5 image (or rc6 if that's where this lands) DSv3.2 FP8 TP=8 1024/1024 conc-{1,4,16,64} sweep with VLLM_ROCM_USE_AITER=1. Expect conc=64 out_tps recovery from current 786 → ≥1100 tps band (parity with aiter 0.1.10).

Caveats

  • Tuned for gfx950, 256 CU only. Other gfx targets need regeneration (script above).
  • N values fixed to {192, 256}; if DSv3.2 evolves to use different N for sparse-attn indexer, re-seed.
  • Same shape set should also benefit Kimi-K2.5 and other DSv3-family models that share K=7168 hidden dim.

Related

The bf16_tuned_gemm.csv shipped in v0.1.13-rc5 is empty (header only),
forcing every BF16 GEMM lookup in DSv3.2's sparse_attn_indexer onto the
torch fallback solver. Real-host perf runs on MI355X 8-GPU TP=8 hit
2928 unique M:N:K shapes (M ∈ [232, 16321], N ∈ {192, 256}, K=7168 =
DSv3.2 hidden dim) — all missing the cache.

This PR adds 30 representative seed shapes (15 M × 2 N) and the matching
tuned results from gradlib gemm_tuner sweep (all backends: flydsl, asm,
hipblaslt, triton).

Files (follow the existing model_configs/ pattern):
- aiter/configs/model_configs/dsv32_bf16_untuned_gemm.csv  (30 seed rows)
- aiter/configs/model_configs/dsv32_bf16_tuned_gemm.csv    (30 tuned rows)

Backend auto-selection by tuner:
  M ∈ [   64,  768): flydsl     (small batch, 27-108 TFLOPS)
  M ∈ [  768, 1536): asm        (mid batch,  143-225 TFLOPS)
  M ∈ [ 1536,16384]: hipblaslt  (large batch, peak 831 TFLOPS @ 16384/256)

Validation: vllm-rocm:v1-rc5 image (mainline 4140faa4a + 8 ROCm PRs)
with these tuned configs should restore conc=64 throughput regression
(currently -32% vs aiter 0.1.10) on DSv3.2 FP8 TP=8 1024/1024 sweep.

Co-tuned for gfx950, 256 CU. For other gfx targets, regenerate via
\`gradlib/gradlib/gemm_tuner.py --input_file ... --tuned_file ...\`.
@sunway513 sunway513 requested review from a team and Copilot May 10, 2026 15:16
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3109 --add-label <label>

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds DeepSeek v3.2-specific BF16 GEMM seed shapes and tuned kernel selections (gfx950, K=7168) under aiter/configs/model_configs/, intended to prevent BF16 GEMM lookups from falling back to default solvers for DSv3.2 hot-path shapes.

Changes:

  • Add an “untuned” seed-shape CSV for DSv3.2 BF16 GEMM at K=7168 and N ∈ {192, 256}.
  • Add a corresponding “tuned” CSV with backend/libtype selections and performance metadata for gfx950 / 256 CU.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
aiter/configs/model_configs/dsv32_bf16_untuned_gemm.csv Adds DSv3.2 seed M/N/K shape list for BF16 GEMM tuning/regeneration.
aiter/configs/model_configs/dsv32_bf16_tuned_gemm.csv Adds tuned solutions for those shapes on gfx950/256CU (flydsl/asm/triton/hipblaslt mix).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +17 to +19
gfx950,256,64,256,7168,False,torch.bfloat16,torch.bfloat16,False,False,flydsl,1119,14,6.5154,flydsl_gemm2_abf16_wbf16_bf16_t32x64x256_split_k14_block_m_warp2_block_n_warp2_async_copyTrue_b_to_ldsTrue_b_preshuffleFalse_c_to_ldsFalse_gfx950,0.0281,36.05,709.13
gfx950,256,128,256,7168,False,torch.bfloat16,torch.bfloat16,False,False,flydsl,1825,14,8.7679,flydsl_gemm2_abf16_wbf16_bf16_t32x64x256_split_k14_block_m_warp2_block_n_warp2_async_copyTrue_b_to_ldsTrue_b_preshuffleFalse_c_to_ldsFalse_gfx950,0.0297,53.58,635.34
gfx950,256,256,256,7168,False,torch.bfloat16,torch.bfloat16,False,False,flydsl,1852,7,10.8758,flydsl_gemm2_abf16_wbf16_bf16_t32x64x256_split_k7_block_m_warp1_block_n_warp4_async_copyTrue_b_to_ldsFalse_b_preshuffleFalse_c_to_ldsFalse_gfx950,0.0206,86.39,686.95
sunway513 and others added 3 commits May 10, 2026 16:08
The FlyDSL AOT build asserts on duplicate shapes during config merge.
Three of our seed shapes (M ∈ {64,128,256}, N=256, K=7168) duplicate
existing entries in aiter/configs/model_configs/dsv3_bf16_tuned_gemm.csv.
At runtime the glob lookup in get_config_file() picks those shapes
from dsv3_bf16_tuned_gemm.csv automatically, so DSv3.2 coverage is
preserved. Net: 27 unique shapes for dsv32 (15 M × 2 N minus 3 dupes
already in dsv3). N=192 set is fully unique and retained.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants