Skip to content

[ck_tile][fmha_bwd] Fix sink_host OOB in group mode reference runner#7272

Open
LJ-underdog wants to merge 2 commits intodevelopfrom
fix/fmha-bwd-sink-host-group-mode-oob
Open

[ck_tile][fmha_bwd] Fix sink_host OOB in group mode reference runner#7272
LJ-underdog wants to merge 2 commits intodevelopfrom
fix/fmha-bwd-sink-host-group-mode-oob

Conversation

@LJ-underdog
Copy link
Copy Markdown
Contributor

@LJ-underdog LJ-underdog commented May 11, 2026

Summary

In fmha_bwd_runner.hpp, the sink_host HostTensor is allocated with first
dimension shape_batch (= 1 in group mode), but the reference forward loop
accesses sink_host(wb, i_h) with wb ∈ [0, batch-1]. For any wb >= 1 this
is an out-of-bounds heap read, silently corrupting the reference forward math
chain (lse_host, o_host) and turning the bwd-side d_sink_head_acc
reference into non-deterministic garbage.

HostTensor::operator() does not bounds check, so the OOB is not caught at
runtime. This manifests as intermittent tile_example_fmha_bwd failures
(25–67% fail rate) when -sink_grad=1 is combined with -mode=1 (group mode),
with bit-exact but spurious max_err values like 4.27 / 14.6.

Fix

One-line: allocate sink_host with batch (the real per-batch dim) instead of
shape_batch, mirroring how sink_host is accessed by the loop.

-    sink_grad ? std::array<ck_tile::index_t, 2>{shape_batch, nhead}
+    sink_grad ? std::array<ck_tile::index_t, 2>{batch, nhead}

Repro

tile_example_fmha_bwd -b=2 -h=2 -s=516 -s_k=253 -prec=bf16 -d=72 \
  -bias=n -dbias=0 -p_drop=0 -iperm=1 -operm=1 -deterministic=0 \
  -v=3 -mode=1 -kname=1 -sink_grad=1

Verification

- 0/30 fail on the repro config after fix
- Baselines (before fix):
  - sink=1, mask=n: 25% fail rate (p ≈ 1.8e-4)
  - sink=1, mask=t: 67% fail rate (p ≈ 6e-15)

Attribution

Shape bug introduced together with sink_grad in #5504. Unrelated to #6914
(which is a fwd-only fix on a different code path)

Submission Checklist

In group mode, shape_batch is hard-coded to 1 (vs batch in batch mode),
but the reference forward loop accesses sink_host(wb, i_h) with
wb in [0, batch-1]. Allocating sink_host's first dimension as
shape_batch (=1) causes out-of-bounds heap reads for any wb >= 1,
silently corrupting the reference forward math chain (lse_host,
o_host) and turning the bwd-side d_sink_head_acc reference into
non-deterministic garbage.

HostTensor::operator() does not bounds check (host_tensor.hpp:541),
so the OOB is not caught at runtime. This manifests as intermittent
fmha_bwd test failures (25-67% fail rate) when -sink_grad=1 is
combined with -mode=1 (group mode), with bit-exact but spurious
max_err values like 4.27 / 14.6.

Fix: allocate sink_host with batch (the real per-batch dim) instead
of shape_batch, mirroring how sink_host is accessed by the loop.

Repro:
  tile_example_fmha_bwd -b=2 -h=2 -s=516 -s_k=253 -prec=bf16 -d=72 \
    -bias=n -dbias=0 -p_drop=0 -iperm=1 -operm=1 -deterministic=0 \
    -v=3 -mode=1 -kname=1 -sink_grad=1

Verification: 0/30 fail on the repro config after fix
(baselines: 25%/67% fail rates depending on mask config;
p-values 1.8e-4 and 6e-15 respectively).

Attribution: shape bug introduced together with sink_grad in
PR #5504; unrelated to PR #6914 (which is fwd-only).

Signed-off-by: junlin12 <junlin12@amd.com>
@LJ-underdog LJ-underdog marked this pull request as ready for review May 11, 2026 08:48
@LJ-underdog LJ-underdog requested a review from a team as a code owner May 11, 2026 08:48
@LJ-underdog LJ-underdog requested review from DDEle, Copilot and poyenc May 11, 2026 08:48
@LJ-underdog
Copy link
Copy Markdown
Contributor Author

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an out-of-bounds host-side access in the ck_tile FMHA backward example/reference runner when sink_grad is enabled in group mode, by allocating sink_host with the true per-batch dimension (batch) instead of the packed/grouped dimension (shape_batch).

Changes:

  • Allocate sink_host as [batch, nhead] when sink_grad is enabled, matching the reference loop’s indexing (sink_host(wb, i_h)).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…e OOB

Adds a parameterized gtest case SinkGradGroupMode in test_fmha_bwd.cpp
that exercises sink_grad=true + mode=group + batch>=2, the exact
combination that triggered the out-of-bounds read fixed in commit
95a8c05.

Configurations covered:
- batch=2, h=2, s=516, s_k=253, hdim=72/64/128, mask in {no, causal-tl, causal-br}
  (the original bf16 repro that failed 25-67% of 30 runs without the fix)
- batch=3, h=4, hk=2, s=259, square (no mask)
- batch=4, h=2, s=200, s_k=180 (multi-batch stress)

Without the runner.hpp fix, fwd reference reads heap garbage from the
under-allocated sink_host buffer and validation fails non-deterministically;
with the fix this test is expected to PASS deterministically.

Addresses Copilot review comment on PR #7272.
Copy link
Copy Markdown
Contributor

@poyenc poyenc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, correct, minimal fix with good regression test coverage. The one-line change directly addresses the root cause — sink_host was allocated with shape_batch (=1 in group mode) but accessed with wb ∈ [0, batch-1], causing OOB heap reads. The new SinkGradGroupMode test class properly targets the exact bug condition and will catch regressions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants