[ROCm][aiter] Add DSv3.2 BF16 GEMM tuned configs for gfx950 (K=7168)#3109
Open
[ROCm][aiter] Add DSv3.2 BF16 GEMM tuned configs for gfx950 (K=7168)#3109
Conversation
The bf16_tuned_gemm.csv shipped in v0.1.13-rc5 is empty (header only),
forcing every BF16 GEMM lookup in DSv3.2's sparse_attn_indexer onto the
torch fallback solver. Real-host perf runs on MI355X 8-GPU TP=8 hit
2928 unique M:N:K shapes (M ∈ [232, 16321], N ∈ {192, 256}, K=7168 =
DSv3.2 hidden dim) — all missing the cache.
This PR adds 30 representative seed shapes (15 M × 2 N) and the matching
tuned results from gradlib gemm_tuner sweep (all backends: flydsl, asm,
hipblaslt, triton).
Files (follow the existing model_configs/ pattern):
- aiter/configs/model_configs/dsv32_bf16_untuned_gemm.csv (30 seed rows)
- aiter/configs/model_configs/dsv32_bf16_tuned_gemm.csv (30 tuned rows)
Backend auto-selection by tuner:
M ∈ [ 64, 768): flydsl (small batch, 27-108 TFLOPS)
M ∈ [ 768, 1536): asm (mid batch, 143-225 TFLOPS)
M ∈ [ 1536,16384]: hipblaslt (large batch, peak 831 TFLOPS @ 16384/256)
Validation: vllm-rocm:v1-rc5 image (mainline 4140faa4a + 8 ROCm PRs)
with these tuned configs should restore conc=64 throughput regression
(currently -32% vs aiter 0.1.10) on DSv3.2 FP8 TP=8 1024/1024 sweep.
Co-tuned for gfx950, 256 CU. For other gfx targets, regenerate via
\`gradlib/gradlib/gemm_tuner.py --input_file ... --tuned_file ...\`.
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
Contributor
There was a problem hiding this comment.
Pull request overview
Adds DeepSeek v3.2-specific BF16 GEMM seed shapes and tuned kernel selections (gfx950, K=7168) under aiter/configs/model_configs/, intended to prevent BF16 GEMM lookups from falling back to default solvers for DSv3.2 hot-path shapes.
Changes:
- Add an “untuned” seed-shape CSV for DSv3.2 BF16 GEMM at K=7168 and N ∈ {192, 256}.
- Add a corresponding “tuned” CSV with backend/libtype selections and performance metadata for gfx950 / 256 CU.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| aiter/configs/model_configs/dsv32_bf16_untuned_gemm.csv | Adds DSv3.2 seed M/N/K shape list for BF16 GEMM tuning/regeneration. |
| aiter/configs/model_configs/dsv32_bf16_tuned_gemm.csv | Adds tuned solutions for those shapes on gfx950/256CU (flydsl/asm/triton/hipblaslt mix). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+17
to
+19
| gfx950,256,64,256,7168,False,torch.bfloat16,torch.bfloat16,False,False,flydsl,1119,14,6.5154,flydsl_gemm2_abf16_wbf16_bf16_t32x64x256_split_k14_block_m_warp2_block_n_warp2_async_copyTrue_b_to_ldsTrue_b_preshuffleFalse_c_to_ldsFalse_gfx950,0.0281,36.05,709.13 | ||
| gfx950,256,128,256,7168,False,torch.bfloat16,torch.bfloat16,False,False,flydsl,1825,14,8.7679,flydsl_gemm2_abf16_wbf16_bf16_t32x64x256_split_k14_block_m_warp2_block_n_warp2_async_copyTrue_b_to_ldsTrue_b_preshuffleFalse_c_to_ldsFalse_gfx950,0.0297,53.58,635.34 | ||
| gfx950,256,256,256,7168,False,torch.bfloat16,torch.bfloat16,False,False,flydsl,1852,7,10.8758,flydsl_gemm2_abf16_wbf16_bf16_t32x64x256_split_k7_block_m_warp1_block_n_warp4_async_copyTrue_b_to_ldsFalse_b_preshuffleFalse_c_to_ldsFalse_gfx950,0.0206,86.39,686.95 |
The FlyDSL AOT build asserts on duplicate shapes during config merge.
Three of our seed shapes (M ∈ {64,128,256}, N=256, K=7168) duplicate
existing entries in aiter/configs/model_configs/dsv3_bf16_tuned_gemm.csv.
At runtime the glob lookup in get_config_file() picks those shapes
from dsv3_bf16_tuned_gemm.csv automatically, so DSv3.2 coverage is
preserved. Net: 27 unique shapes for dsv32 (15 M × 2 N minus 3 dupes
already in dsv3). N=192 set is fully unique and retained.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
aiter/configs/bf16_tuned_gemm.csvshipped in v0.1.13-rc5 is empty (header only, 1 line). Every BF16 GEMM lookup falls through to the torch default solver. For DSv3.2'ssparse_attn_indexer, this is on the hot path — server logs are flooded with:Concrete impact measured on
mi355-gpu-15(MI355X 8-GPU host) withvllm-rocm:v1-rc5(mainline4140faa4a+ 8 ROCm PRs + aiter rc5):aiter rc5 wins single-stream (kernel autotune) but loses high-conc because the tuned cache is empty.
What this PR adds
Two CSV files following the existing
aiter/configs/model_configs/pattern (same asa8w8_blockscale_tuned_gemm_ds_v3.csvetc.):aiter/configs/model_configs/dsv32_bf16_untuned_gemm.csv— 30 seed shapes (15 M × 2 N values @ K=7168)aiter/configs/model_configs/dsv32_bf16_tuned_gemm.csv— gradlibgemm_tuner.pyresults, all-backends sweep on gfx950 / 256 CUWorkload coverage justification
Real perf runs hit 2928 unique M:N:K shapes (M ∈ [232, 16321], N ∈ {192, 256}, K=7168). The 30 representative M values (geometric-ish, 64 → 16384) cover that range;
aiter.tuned_gemmdoes nearest-key M lookup, so every actual M finds a tuned solution.Tuner backend auto-selection (data-driven)
flydsl(split_k=14, t32x64x256)asm(bf16gemm_fp32bf16_tn_96x64_splitk_clean)tritonhipblaslt(Cijk_Alik_Bljk MT* kernels)Peak: 831.4 TFLOPS @ M=16384, N=256, K=7168 (hipblaslt, gfx950, 256 CU).
How to reproduce / regenerate for another gfx target
cd aiter python3 gradlib/gradlib/gemm_tuner.py \ --input_file aiter/configs/model_configs/dsv32_bf16_untuned_gemm.csv \ --tuned_file aiter/configs/model_configs/dsv32_bf16_tuned_gemm.csv \ --indtype bf16 --outdtype bf16 --libtype allTuning wall time ≈ 90 min on 8x MI355X (parallel via mp_tuner).
Validation plan (post-merge)
Run V1-rc5 image (or rc6 if that's where this lands) DSv3.2 FP8 TP=8 1024/1024 conc-{1,4,16,64} sweep with
VLLM_ROCM_USE_AITER=1. Expect conc=64 out_tps recovery from current 786 → ≥1100 tps band (parity with aiter 0.1.10).Caveats
gfx950, 256 CUonly. Other gfx targets need regeneration (script above).K=7168hidden dim.Related