Skip to content

Fix address overflow extra issue#3124

Merged
valarLip merged 3 commits into
ROCm:mainfrom
JaxChen29:fix_address_overflow_extra_issue
May 12, 2026
Merged

Fix address overflow extra issue#3124
valarLip merged 3 commits into
ROCm:mainfrom
JaxChen29:fix_address_overflow_extra_issue

Conversation

@JaxChen29
Copy link
Copy Markdown
Contributor

@JaxChen29 JaxChen29 commented May 11, 2026

Motivation

PR #2189 ("[FIX] address overflow fix on fmha_bwd of gfx942/gfx950") shipped
two unintended regressions on gfx950 that fail in production-shape tests but
were not covered by the existing smoke_test_bwd_v3.sh:

  • Bucket 1: bwd_hd128_*_causal_br_a32_psskddv_group.co (bf16 + fp16)
    takes a GPU memory access fault or returns wrong values for any
    group-mode + bottom-right causal + GQA shape such as
    b=2 h=24 h_k=1 s=2048 s_k=2048 mask=b -bwd_v3=1 -v3_atomic_fp32=1 -mode=1.
    Originally surfaced by a downstream FlashAttention test.
  • Bucket 2: all six D128 A16 group-mode kernels
    (bwd_hd128_{bf16,fp16}_a16_psskddv_group.co,
    bwd_hd128_{bf16,fp16}_causal_a16_psskddv_group.co,
    bwd_hd128_{bf16,fp16}_causal_br_a16_psskddv_group.co)
    return wrong values on every shape including SMOKE
    (b=2 h=3 h_k=3 s=200 s_k=200). Cross-batch data corruption from
    dq_acc layout mismatch.
    Both regressions trace to a single upstream commit
    f3b06a7 in poc_kl_merg ("fix address overflow on fmha bwd"), shipped as
    PR [FIX] address overflow fix on fmha_bwd of gfx942/gfx950 #2189 on the aiter side. The SP3 source fixes live in poc_kl_merg
    under scripts/fmha_bwd/prod_kernels/ (3 SP3 files); this PR ships only
    the regenerated .co binaries so they can be picked up at runtime via
    AITER_ASM_DIR.

Technical Details

kernel changes in
https://github.com/niels-zhang/poc_kl_merg/pull/33

Three SP3 source files (in poc_kl_merg, separately committed) and eight
regenerated .co files (this PR):

Bucket 1 - FMHA_BWD_D128_..._A32_cas_br_kb_Genl.sp3 (commits a3812ca + 4a2508c)

PR #2189 lifted the per-block dK/dV byte offset
Seqs_dk * loop_idx_k from a per-thread vector add on v_dK_addr to a
scalar add on s_dK_buf[0:1] (with s_addc_u32 for the carry). The change
is algebraically equivalent for idxen:1 reads, but it failed to also
adjust s_dK_buf[2] (num_records, the OOB envelope). With a bumped base
but un-bumped envelope, the buffer-OOB rule index < num_records never
fires, so the bottom partial-block of the per-batch dK region writes
spill into the next batch's region, surfacing as KGrad/VGrad mismatches at
the batch-1 boundary or full GPU memory faults under GQA replication.
Fix: keep PR #2189's 64-bit-safe scalar bump, additionally shrink/restore
s_dK_buf[2] and s_dV_buf[2] by the same offset in DW around the writes
and at code_exit_mask:. 4 hunks, +14 lines net.

Bucket 2 - FMHA_BWD_D128_..._A16_Genl.sp3 + FMHA_BWD_D128_DQ_SHUFFLE.sp3 (commit 3bf422d)

The host allocates dq_acc for A16 group as (b, h, padded_max_sq, d)
with a16_dq_acc_seq = (max_seqlen_q + 15) / 16 * 16
(see op_tests/cpp/mha/benchmark_mha_bwd.cpp:357-360, 535, 550).
PR #2189 rewrote 3 separate MODE==1 indexing sites to use a flat
(h, sum_padded_seqlen, d) layout and unpadded seqlen_q for OOB,
mismatching the host allocation:

Sub-bug File Buggy MODE==1 Fixed (unconditional)
A FMHA_BWD_D128_DQ_SHUFFLE.sp3 (~line 292) sq_start * Seqs_dq_acc for dq_acc READ batch_offset tg_idz * BAs_dq_acc
B FMHA_BWD_D128_..._A16_Genl.sp3 (~line 2857) sq_start * Seqs_dQ for dq_acc WRITE batch_offset tg_idz * BAs_dq
C FMHA_BWD_D128_..._A16_Genl.sp3 (~line 2871) seqlen_q * Seqs_dQ for OOB envelope (unpadded -> rows in [seqlen, padded) get garbage) padded_seqlen_q * Seqs_dQ
PR #2189's actual 64-bit overflow protection (the s_mul_hi_u32 +
s_addc_u32 carry into s_dQ_buf[1]) is preserved - the fix only drops
the buggy MODE==1 indexing formulas and unifies both modes onto the
already-correct MODE==0 64-bit-safe pattern. 3 hunks, +10 lines, -24 lines net.

Files in this PR (8 .co binaries)

Regenerated from the SP3 source fixes above. Each file's MODE==0 (batch
mode) bytes are functionally equivalent to the prior shipped versions; only
the MODE==1 (group mode) code paths change.

Test Plan

Test Result

Submission Checklist

@JaxChen29 JaxChen29 requested a review from a team May 11, 2026 08:53
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3124 --add-label <label>

@valarLip valarLip merged commit 28fc36f into ROCm:main May 12, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants