webgpu: adjust the parms for gemm-subgroup kernel#28760
Conversation
|
Nice work! A small comment: The code seems to be special-casing |
I had tested many architectures, saw there was few regression on xe-lpg and xe-3lpg when merging |
|
|
@qjia7, please help me review it, thank you. |
There was a problem hiding this comment.
Pull request overview
This PR adjusts WebGPU MatMul/Gemm subgroup-kernel tuning, primarily for Intel GPUs, by changing heuristics used for batch-folding and per-thread tile sizing based on matrix dimensions and detected adapter architecture. It also adds (currently disabled) large-shape regression tests intended to exercise these kernels.
Changes:
- Add Intel-architecture–aware
elements_per_thread_yselection for subgroup GEMM/MatMul (e.g., Xe-LPG/Xe-3LPG tuning). - Refine the “fold batchA into M” heuristic (including an
M < 128threshold and Intel-architecture gating). - Add disabled large-shape WebGPU MatMul/Gemm tests with a CPU reference implementation.
Reviewed changes
Copilot reviewed 3 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/providers/webgpu/matmul_large_test.cc | Adds disabled large-shape MatMul tests with a CPU reference path. |
| onnxruntime/test/providers/webgpu/gemm_large_test.cc | Adds disabled large-shape Gemm tests with a CPU reference path and bias-broadcast coverage. |
| onnxruntime/core/providers/webgpu/vendor/intel/math/matmul.cc | Adjusts Intel MatMul batch-folding heuristic and wires architecture-aware tile selection. |
| onnxruntime/core/providers/webgpu/vendor/intel/math/gemm.cc | Updates subgroup GEMM dispatch tiling to use the new architecture-aware helper. |
| onnxruntime/core/providers/webgpu/vendor/intel/math/gemm_subgroup.h | Changes ElementsPerThreadY signature to accept context (used for architecture-based tuning). |
| onnxruntime/core/providers/webgpu/vendor/intel/math/gemm_subgroup.cc | Implements Xe-LPG/Xe-3LPG-specific ElementsPerThreadY tuning. |
| onnxruntime/core/providers/webgpu/math/matmul.cc | Restricts batch-folding in the generic WebGPU MatMul path to cases with small M. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Just thinking out loud — @daijh ever hit an Intel driver issue in #25908, where an unbalanced dispatch shape caused a big perf drop on LNL. Is it possible that the fold path with large M is exposing the same class of driver behavior on Xe-LPG / Xe-3LPG? If yes, reshaping the dispatch (rather than skipping fold) might be a more general fix. |
|
The PR is not related to #25908, the dispatch issue only happens on LNL, and the regression here does not happen on LNL. |
d819dee to
9bade98
Compare
Review of PR #28760 —
|
| Architecture | M % 32 == 0 |
M % 32 ∈ [1, 24] |
M % 32 ∈ [25, 31] |
|---|---|---|---|
| Non Xe-LPG/3LPG | fold | fold | fold |
| Xe-LPG or Xe-3LPG | don't fold | fold | don't fold |
Rewriting using a named local mirrors what gemm_subgroup.cc:99-101 already does, avoids the double field access, and makes the intent obvious:
const int64_t M = output_shape[output_shape.NumDimensions() - 2];
const auto& arch = context.AdapterInfo().architecture;
const bool is_xe_lpg = arch == std::string_view("xe-lpg") || arch == std::string_view("xe-3lpg");
const int64_t m_mod_32 = M % 32;
// On Xe-LPG/3LPG, folding a batched matmul into a single large M loses Z-dispatch
// parallelism. Only fold when each per-batch M leaves a large fraction of invalid
// threads in its trailing workgroup (m_mod_32 ∈ [1, 24] = 25%–97% wasted), so the
// fold actually claws that waste back. Otherwise the Z-dispatch path wins.
const bool xe_lpg_fold_ok = (m_mod_32 > 0 && m_mod_32 <= 24);
if (batchA != 1 && batchB == 1 && (!is_xe_lpg || xe_lpg_fold_ok)) {Same semantics, one architecture lookup, and the comment captures the actual heuristic instead of the abstract "when the proportion of workgroups containing invalid threads is relatively small ." sentence (which also has a stray space before the period).
2. The 32 magic constant is implicitly coupled to two other constants
32 = kSubgroupLogicalWorkGroupSizeY (= 8) × ElementsPerThreadY-for-large-M-on-Xe-LPG (= 4). If either of those changes, this heuristic silently breaks. A one-liner near the comparison would prevent that:
// 32 = kSubgroupLogicalWorkGroupSizeY * ElementsPerThreadY(M > 32) on Xe-LPG/3LPG
const int64_t m_mod_32 = M % 32;Or, if you want it to track changes automatically, derive it: static_assert(kSubgroupLogicalWorkGroupSizeY * 4 == 32, "..."); — not strictly necessary, but cheap insurance.
3. GetBiasValue silently returns 0 on shape mismatch
if (c_shape.NumDimensions() == 2) {
if (c_shape[0] == M && c_shape[1] == 1) return c_vals[m];
if (c_shape[0] == M && c_shape[1] == N) return c_vals[m * N + n];
if (c_shape[0] == 1 && c_shape[1] == N) return c_vals[n];
}
return 0.0f;If someone writes a future test case with an unsupported bias shape (e.g., {1, M}, or a typo), the reference silently treats bias as zero and the test passes regardless of what the kernel did with the bias. Replacing the trailing return 0.0f; with ADD_FAILURE() << "Unsupported bias shape in test reference"; return 0.0f; keeps the function total but loudly flags the test-author mistake. Minor.
4. Test naming nit (not a real issue)
Gemm_Large / MatMul_Large with single tests DISABLED_AllShapes means there's one giant catch-all that, when it fails, dumps all sub-cases together. Splitting into a few smaller DISABLED_* tests (e.g., DISABLED_Aligned, DISABLED_Unaligned, DISABLED_TransAB, DISABLED_BiasBroadcast) would make manual triage easier when one shape regresses. Optional.
Confirmed correct
- The architecture-list scope is conservative. Xe-2LPG was checked and behaves fine without the workaround (per author's measurements: M=129/B=128 → +20–30% with fold on all three; M=127/B=128 → +10% on Xe-2LPG but −2/−4% on Xe-LPG/3LPG). Not expanding the list to Xe-2LPG or speculative-future Xe variants is the right call.
- The
Mused for the gate is the per-batch M (read offoutput_shape[NumDimensions()-2]before the reshape), which is what the heuristic is actually reasoning about. Confirmed by reading order of statements. ConfigEp(std::move(webgpu_ep))moves the unique_ptr so the EP isn't reused across multipleRunWithConfig()calls within the sameRunTestTyped. Each test call constructs a fresh EP. Fine.- Test reference uses
int64_tloop counters matchingM/N/Ktypes — no narrowing concerns.
Bottom line
Approve. The PR fixes a measured perf regression with an empirical heuristic that the author validated across the three Xe variants, the test infrastructure for catching future regressions is in place (even if opt-in), and the cleanup to drop the leading 1 from the fold reshape is independently nice. The boolean-restructure / comment / magic-constant items above would all improve readability and are worth a small follow-up commit if you're already touching the file, but none are blockers — @qjia7 has already signed off and the measurements support shipping this as-is.
|
@hariharans29, addressed your comments, please take a look again, thanks |
Re-review of PR #28760 —
|
|
Please update the motivation and description of the PR ? |
Done |
Description
Motivation and Context
Fix the Intel WebGPU performance regression reported in [Web] Performance regression in 20260502 version when using intel xe-lpg iGPU #28531, and improve test for large-shape MatMul and Gemm scenarios.