Fix address overflow extra issue by JaxChen29 · Pull Request #3124 · ROCm/aiter

JaxChen29 · 2026-05-11T08:53:42Z

Motivation

PR #2189 ("[FIX] address overflow fix on fmha_bwd of gfx942/gfx950") shipped
two unintended regressions on gfx950 that fail in production-shape tests but
were not covered by the existing smoke_test_bwd_v3.sh:

Bucket 1: bwd_hd128_*_causal_br_a32_psskddv_group.co (bf16 + fp16)
takes a GPU memory access fault or returns wrong values for any
group-mode + bottom-right causal + GQA shape such as
b=2 h=24 h_k=1 s=2048 s_k=2048 mask=b -bwd_v3=1 -v3_atomic_fp32=1 -mode=1.
Originally surfaced by a downstream FlashAttention test.
Bucket 2: all six D128 A16 group-mode kernels
(bwd_hd128_{bf16,fp16}_a16_psskddv_group.co,
bwd_hd128_{bf16,fp16}_causal_a16_psskddv_group.co,
bwd_hd128_{bf16,fp16}_causal_br_a16_psskddv_group.co)
return wrong values on every shape including SMOKE
(b=2 h=3 h_k=3 s=200 s_k=200). Cross-batch data corruption from
dq_acc layout mismatch.
Both regressions trace to a single upstream commit
f3b06a7 in poc_kl_merg ("fix address overflow on fmha bwd"), shipped as
PR [FIX] address overflow fix on fmha_bwd of gfx942/gfx950 #2189 on the aiter side. The SP3 source fixes live in poc_kl_merg
under scripts/fmha_bwd/prod_kernels/ (3 SP3 files); this PR ships only
the regenerated .co binaries so they can be picked up at runtime via
AITER_ASM_DIR.

Technical Details

kernel changes in
https://github.com/niels-zhang/poc_kl_merg/pull/33

Three SP3 source files (in poc_kl_merg, separately committed) and eight
regenerated .co files (this PR):

Bucket 1 - `FMHA_BWD_D128_..._A32_cas_br_kb_Genl.sp3` (commits `a3812ca` + `4a2508c`)

PR #2189 lifted the per-block dK/dV byte offset
Seqs_dk * loop_idx_k from a per-thread vector add on v_dK_addr to a
scalar add on s_dK_buf[0:1] (with s_addc_u32 for the carry). The change
is algebraically equivalent for idxen:1 reads, but it failed to also
adjust s_dK_buf[2] (num_records, the OOB envelope). With a bumped base
but un-bumped envelope, the buffer-OOB rule index < num_records never
fires, so the bottom partial-block of the per-batch dK region writes
spill into the next batch's region, surfacing as KGrad/VGrad mismatches at
the batch-1 boundary or full GPU memory faults under GQA replication.
Fix: keep PR #2189's 64-bit-safe scalar bump, additionally shrink/restore
s_dK_buf[2] and s_dV_buf[2] by the same offset in DW around the writes
and at code_exit_mask:. 4 hunks, +14 lines net.

Bucket 2 - `FMHA_BWD_D128_..._A16_Genl.sp3` + `FMHA_BWD_D128_DQ_SHUFFLE.sp3` (commit `3bf422d`)

The host allocates dq_acc for A16 group as (b, h, padded_max_sq, d)
with a16_dq_acc_seq = (max_seqlen_q + 15) / 16 * 16
(see op_tests/cpp/mha/benchmark_mha_bwd.cpp:357-360, 535, 550).
PR #2189 rewrote 3 separate MODE==1 indexing sites to use a flat
(h, sum_padded_seqlen, d) layout and unpadded seqlen_q for OOB,
mismatching the host allocation:

Sub-bug	File	Buggy MODE==1	Fixed (unconditional)
A	`FMHA_BWD_D128_DQ_SHUFFLE.sp3` (~line 292)	`sq_start * Seqs_dq_acc` for dq_acc READ batch_offset	`tg_idz * BAs_dq_acc`
B	`FMHA_BWD_D128_..._A16_Genl.sp3` (~line 2857)	`sq_start * Seqs_dQ` for dq_acc WRITE batch_offset	`tg_idz * BAs_dq`
C	`FMHA_BWD_D128_..._A16_Genl.sp3` (~line 2871)	`seqlen_q * Seqs_dQ` for OOB envelope (unpadded -> rows in `[seqlen, padded)` get garbage)	`padded_seqlen_q * Seqs_dQ`
PR #2189's actual 64-bit overflow protection (the `s_mul_hi_u32` +
`s_addc_u32` carry into `s_dQ_buf[1]`) is preserved - the fix only drops
the buggy MODE==1 indexing formulas and unifies both modes onto the
already-correct MODE==0 64-bit-safe pattern. 3 hunks, +10 lines, -24 lines net.

Files in this PR (8 .co binaries)

Regenerated from the SP3 source fixes above. Each file's MODE==0 (batch
mode) bytes are functionally equivalent to the prior shipped versions; only
the MODE==1 (group mode) code paths change.

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

github-actions · 2026-05-11T08:53:53Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3124 --add-label <label>

JaxChen29 added 3 commits May 11, 2026 08:42

fix bug

a3812ca

fix bug

4a2508c

fix_bug_0511

3bf422d

JaxChen29 requested a review from a team May 11, 2026 08:53

valarLip approved these changes May 12, 2026

View reviewed changes

valarLip merged commit 28fc36f into ROCm:main May 12, 2026
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix address overflow extra issue#3124

Fix address overflow extra issue#3124
valarLip merged 3 commits into
ROCm:mainfrom
JaxChen29:fix_address_overflow_extra_issue

JaxChen29 commented May 11, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JaxChen29 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Bucket 1 - FMHA_BWD_D128_..._A32_cas_br_kb_Genl.sp3 (commits a3812ca + 4a2508c)

Bucket 2 - FMHA_BWD_D128_..._A16_Genl.sp3 + FMHA_BWD_D128_DQ_SHUFFLE.sp3 (commit 3bf422d)

Files in this PR (8 .co binaries)

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions Bot commented May 11, 2026

🏷️ CI Guide

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JaxChen29 commented May 11, 2026 •

edited

Loading

Bucket 1 - `FMHA_BWD_D128_..._A32_cas_br_kb_Genl.sp3` (commits `a3812ca` + `4a2508c`)

Bucket 2 - `FMHA_BWD_D128_..._A16_Genl.sp3` + `FMHA_BWD_D128_DQ_SHUFFLE.sp3` (commit `3bf422d`)