[ck_tile][fmha_bwd] Fix sink_host OOB in group mode reference runner#7272
[ck_tile][fmha_bwd] Fix sink_host OOB in group mode reference runner#7272LJ-underdog wants to merge 2 commits intodevelopfrom
Conversation
In group mode, shape_batch is hard-coded to 1 (vs batch in batch mode),
but the reference forward loop accesses sink_host(wb, i_h) with
wb in [0, batch-1]. Allocating sink_host's first dimension as
shape_batch (=1) causes out-of-bounds heap reads for any wb >= 1,
silently corrupting the reference forward math chain (lse_host,
o_host) and turning the bwd-side d_sink_head_acc reference into
non-deterministic garbage.
HostTensor::operator() does not bounds check (host_tensor.hpp:541),
so the OOB is not caught at runtime. This manifests as intermittent
fmha_bwd test failures (25-67% fail rate) when -sink_grad=1 is
combined with -mode=1 (group mode), with bit-exact but spurious
max_err values like 4.27 / 14.6.
Fix: allocate sink_host with batch (the real per-batch dim) instead
of shape_batch, mirroring how sink_host is accessed by the loop.
Repro:
tile_example_fmha_bwd -b=2 -h=2 -s=516 -s_k=253 -prec=bf16 -d=72 \
-bias=n -dbias=0 -p_drop=0 -iperm=1 -operm=1 -deterministic=0 \
-v=3 -mode=1 -kname=1 -sink_grad=1
Verification: 0/30 fail on the repro config after fix
(baselines: 25%/67% fail rates depending on mask config;
p-values 1.8e-4 and 6e-15 respectively).
Attribution: shape bug introduced together with sink_grad in
PR #5504; unrelated to PR #6914 (which is fwd-only).
Signed-off-by: junlin12 <junlin12@amd.com>
There was a problem hiding this comment.
Pull request overview
Fixes an out-of-bounds host-side access in the ck_tile FMHA backward example/reference runner when sink_grad is enabled in group mode, by allocating sink_host with the true per-batch dimension (batch) instead of the packed/grouped dimension (shape_batch).
Changes:
- Allocate
sink_hostas[batch, nhead]whensink_gradis enabled, matching the reference loop’s indexing (sink_host(wb, i_h)).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…e OOB Adds a parameterized gtest case SinkGradGroupMode in test_fmha_bwd.cpp that exercises sink_grad=true + mode=group + batch>=2, the exact combination that triggered the out-of-bounds read fixed in commit 95a8c05. Configurations covered: - batch=2, h=2, s=516, s_k=253, hdim=72/64/128, mask in {no, causal-tl, causal-br} (the original bf16 repro that failed 25-67% of 30 runs without the fix) - batch=3, h=4, hk=2, s=259, square (no mask) - batch=4, h=2, s=200, s_k=180 (multi-batch stress) Without the runner.hpp fix, fwd reference reads heap garbage from the under-allocated sink_host buffer and validation fails non-deterministically; with the fix this test is expected to PASS deterministically. Addresses Copilot review comment on PR #7272.
poyenc
left a comment
There was a problem hiding this comment.
Clean, correct, minimal fix with good regression test coverage. The one-line change directly addresses the root cause — sink_host was allocated with shape_batch (=1 in group mode) but accessed with wb ∈ [0, batch-1], causing OOB heap reads. The new SinkGradGroupMode test class properly targets the exact bug condition and will catch regressions.
Summary
In
fmha_bwd_runner.hpp, thesink_hostHostTensoris allocated with firstdimension
shape_batch(= 1 in group mode), but the reference forward loopaccesses
sink_host(wb, i_h)withwb ∈ [0, batch-1]. For anywb >= 1thisis an out-of-bounds heap read, silently corrupting the reference forward math
chain (
lse_host,o_host) and turning the bwd-sided_sink_head_accreference into non-deterministic garbage.
HostTensor::operator()does not bounds check, so the OOB is not caught atruntime. This manifests as intermittent
tile_example_fmha_bwdfailures(25–67% fail rate) when
-sink_grad=1is combined with-mode=1(group mode),with bit-exact but spurious
max_errvalues like 4.27 / 14.6.Fix
One-line: allocate
sink_hostwithbatch(the real per-batch dim) instead ofshape_batch, mirroring howsink_hostis accessed by the loop.Submission Checklist