Commit ad4dea5
dot/schedule_bench: fix A_stride_m for transpose_a kernels
Dot kernels that set dot_flag::transpose_a (e.g. the SME/SME2 ones)
consume A packed as [k1/tile_k, k3, k2, {i, tile_k}]. They advance A_k1
per inner-k step by `A_stride_m * dot_factor` bytes, but the semantically
correct value for that slot is the stride ALONG the k1 dimension of the
packed tensor — which is `a_k_strides[0]`, not the (i, tile_k) intra-row
stride that schedule_bench was passing as `a_stride_m`.
subgraph/dot.cc already does this swap in `call_kernel`; without it in
schedule_bench, a K step advances A_k1 by one element (4 bytes for fp32)
instead of one full packed row (M×sizeof(TA) bytes). The kernel still
runs and the built-in correctness check passes because A and B are filled
with 1s, but the measured bandwidth and cache behaviour don't match the
production path — benchmark numbers reported against this bench were
artificially hot because successive k-steps re-read near-identical
addresses.
Mirror the transposed_a-aware swap from subgraph/dot.cc so
schedule_bench measures the same work the production path executes.
After the fix, `dot_fp32_sme2` on the new (auto:8 MiB) schedule
reports realistic numbers that track the subgraph production bench.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent ef69e72 commit ad4dea5
1 file changed
Lines changed: 6 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
152 | 152 | | |
153 | 153 | | |
154 | 154 | | |
155 | | - | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
156 | 161 | | |
157 | 162 | | |
158 | 163 | | |
| |||
0 commit comments