webgpu: adjust the parms for gemm-subgroup kernel by xhcao · Pull Request #28760 · microsoft/onnxruntime

xhcao · 2026-06-03T08:41:28Z

Description

Fixed the regression in the Intel xe-lpg path.
Tuned parameters/heuristics for xe-lpg and xe-3lpg.
Added large-shape MatMul and Gemm test coverage, following webgpu: add MatMul and Gemm cases with large shapes webgpu: add MatMul and Gemm cases with large shapes #26572 (comment), kept these large-shape tests with DISABLED_ prefix by default; run them with --gtest_also_run_disabled_tests when needed.

Motivation and Context

Fix the Intel WebGPU performance regression reported in [Web] Performance regression in 20260502 version when using intel xe-lpg iGPU #28531, and improve test for large-shape MatMul and Gemm scenarios.

xmcp · 2026-06-03T14:02:41Z

Nice work! A small comment: The code seems to be special-casing xe-lpg and xe-3lpg. So xe-2lpg (is that LNL?) and future intel cards are not getting ths behavior. Is that intended?

xhcao · 2026-06-05T03:32:52Z

Nice work! A small comment: The code seems to be special-casing xe-lpg and xe-3lpg. So xe-2lpg (is that LNL?) and future intel cards are not getting ths behavior. Is that intended?

I had tested many architectures, saw there was few regression on xe-lpg and xe-3lpg when merging batch dim and M dim. And more important, observed that 4 elements per thread of row dim is optimal when M is large on xe-lpg and xe-3lpg, so adjust the parms for these architectures

xhcao · 2026-06-05T03:40:32Z

Fixed the issue, [Web] Performance regression in 20260502 version when using intel xe-lpg iGPU #28531
Tuned the parameters for xe-lpg and xe-3lpg
Added the test cases, followed the comment webgpu: add MatMul and Gemm cases with large shapes #26572 (comment) to use DISABLED_ as prefix, use --gtest_also_run_disabled_tests when running the cases

xhcao · 2026-06-05T03:46:45Z

@qjia7, please help me review it, thank you.

Copilot

Pull request overview

This PR adjusts WebGPU MatMul/Gemm subgroup-kernel tuning, primarily for Intel GPUs, by changing heuristics used for batch-folding and per-thread tile sizing based on matrix dimensions and detected adapter architecture. It also adds (currently disabled) large-shape regression tests intended to exercise these kernels.

Changes:

Add Intel-architecture–aware elements_per_thread_y selection for subgroup GEMM/MatMul (e.g., Xe-LPG/Xe-3LPG tuning).
Refine the “fold batchA into M” heuristic (including an M < 128 threshold and Intel-architecture gating).
Add disabled large-shape WebGPU MatMul/Gemm tests with a CPU reference implementation.

Reviewed changes

Copilot reviewed 3 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
onnxruntime/test/providers/webgpu/matmul_large_test.cc	Adds disabled large-shape MatMul tests with a CPU reference path.
onnxruntime/test/providers/webgpu/gemm_large_test.cc	Adds disabled large-shape Gemm tests with a CPU reference path and bias-broadcast coverage.
onnxruntime/core/providers/webgpu/vendor/intel/math/matmul.cc	Adjusts Intel MatMul batch-folding heuristic and wires architecture-aware tile selection.
onnxruntime/core/providers/webgpu/vendor/intel/math/gemm.cc	Updates subgroup GEMM dispatch tiling to use the new architecture-aware helper.
onnxruntime/core/providers/webgpu/vendor/intel/math/gemm_subgroup.h	Changes `ElementsPerThreadY` signature to accept context (used for architecture-based tuning).
onnxruntime/core/providers/webgpu/vendor/intel/math/gemm_subgroup.cc	Implements Xe-LPG/Xe-3LPG-specific `ElementsPerThreadY` tuning.
onnxruntime/core/providers/webgpu/math/matmul.cc	Restricts batch-folding in the generic WebGPU MatMul path to cases with small M.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

qjia7 · 2026-06-05T09:53:16Z

Just thinking out loud — @daijh ever hit an Intel driver issue in #25908, where an unbalanced dispatch shape caused a big perf drop on LNL. Is it possible that the fold path with large M is exposing the same class of driver behavior on Xe-LPG / Xe-3LPG? If yes, reshaping the dispatch (rather than skipping fold) might be a more general fix.

xhcao · 2026-06-08T08:58:10Z

The PR is not related to #25908, the dispatch issue only happens on LNL, and the regression here does not happen on LNL.
The root cause of regression on lpg or 3lpg was the tile size. Take [128, 32, 768] * [768, 768] for an example, if there was no folding, the M = 32, ElementsPerThreadY returned 4, the tile size was [32, 128]. If folding A, [1, 4096, 768] * [768, 768], the M = 4096, ElementsPerThreadY returned 8, the tile size was [64, 128], there was a regression lpg and a light regression on 3lpg, so I adjusted the params here.

hariharans29 · 2026-06-15T18:15:22Z

Review of PR #28760 — `webgpu: adjust the parms for gemm-subgroup kernel`

Verdict: approve. The empirical changes are backed by measured numbers across Xe-LPG / Xe-2LPG / Xe-3LPG (documented in the thread), already approved by @qjia7, and the regression target (issue #28531) is addressed. A few readability improvements worth landing as a small follow-up are below — none are blocking.

Things that look good

The non-vendor cleanup in matmul.cc is pure simplification. Dropping the leading 1 from dims_a = {1, batchAndM, K} / dims_b = {1, K, N} / output_shape = {1, batchAndM, N} removes a meaningless rank-3 wrapping that the kernel didn't need. Cleaner. The same cleanup is consistently applied to intel/math/matmul.cc.
ElementsPerThreadY simplification. Dropping is_vec4 and collapsing to a single M-only formula:
```
return M <= 8 ? 1 : (M <= 16 ? 2 : (M <= 32 ? 4 : (is_xe_lpg || is_xe_3lpg ? 4 : 8)));
```
is a real behavior change in the !is_vec4 path (was always 4, now follows the M ladder) — but the author re-tuned in response to @qjia7's question and confirmed it. Two call sites updated atomically (gemm.cc:87 and intel/math/matmul.cc:101). The Xe-LPG/3LPG cap at 4 for M > 32 matches the reported [128, 32, 768] × [768, 768] regression case (folded path was picking 8 → tile [64×128] → regression).
Test EP selection. The new tests explicitly grab DefaultWebGpuExecutionProvider(), GTEST_SKIP() when it's not present, and ConfigEp(std::move(webgpu_ep)).RunWithConfig(). That's exactly the fix Copilot suggested, applied consistently to both new files. Without this, RunWithConfig() would have silently passed on CPU-only configurations without ever exercising WebGPU.
DISABLED_AllShapes opt-in pattern follows the precedent from webgpu: add MatMul and Gemm cases with large shapes #26572. The tests are reproducible (RandomValueGenerator{1234} seeded), cover aligned + unaligned + transA/transB + various bias-broadcast shapes (gemm) and 2D/3D/4D + broadcast (matmul), and won't slow CI.

Worth tightening (non-blocking)

1. The fold gate in `intel/math/matmul.cc:60-67` is hard to parse

const int64_t M = output_shape[output_shape.NumDimensions() - 2];
const int64_t m_mod_32 = M % 32;
if (batchA != 1 && batchB == 1 &&
    (!(context.AdapterInfo().architecture == std::string_view("xe-lpg") ||
       context.AdapterInfo().architecture == std::string_view("xe-3lpg")) ||
     (m_mod_32 > 0 && m_mod_32 <= 24))) {

Negation of a disjunction with a positive on the other side, plus the architecture string is fetched twice. It took me a couple of passes to convince myself the truth table is:

Architecture	`M % 32 == 0`	`M % 32 ∈ [1, 24]`	`M % 32 ∈ [25, 31]`
Non Xe-LPG/3LPG	fold	fold	fold
Xe-LPG or Xe-3LPG	don't fold	fold	don't fold

Rewriting using a named local mirrors what gemm_subgroup.cc:99-101 already does, avoids the double field access, and makes the intent obvious:

const int64_t M = output_shape[output_shape.NumDimensions() - 2];
const auto& arch = context.AdapterInfo().architecture;
const bool is_xe_lpg = arch == std::string_view("xe-lpg") || arch == std::string_view("xe-3lpg");
const int64_t m_mod_32 = M % 32;
// On Xe-LPG/3LPG, folding a batched matmul into a single large M loses Z-dispatch
// parallelism. Only fold when each per-batch M leaves a large fraction of invalid
// threads in its trailing workgroup (m_mod_32 ∈ [1, 24] = 25%–97% wasted), so the
// fold actually claws that waste back. Otherwise the Z-dispatch path wins.
const bool xe_lpg_fold_ok = (m_mod_32 > 0 && m_mod_32 <= 24);
if (batchA != 1 && batchB == 1 && (!is_xe_lpg || xe_lpg_fold_ok)) {

Same semantics, one architecture lookup, and the comment captures the actual heuristic instead of the abstract "when the proportion of workgroups containing invalid threads is relatively small ." sentence (which also has a stray space before the period).

2. The `32` magic constant is implicitly coupled to two other constants

32 = kSubgroupLogicalWorkGroupSizeY (= 8) × ElementsPerThreadY-for-large-M-on-Xe-LPG (= 4). If either of those changes, this heuristic silently breaks. A one-liner near the comparison would prevent that:

// 32 = kSubgroupLogicalWorkGroupSizeY * ElementsPerThreadY(M > 32) on Xe-LPG/3LPG
const int64_t m_mod_32 = M % 32;

Or, if you want it to track changes automatically, derive it: static_assert(kSubgroupLogicalWorkGroupSizeY * 4 == 32, "..."); — not strictly necessary, but cheap insurance.

3. `GetBiasValue` silently returns 0 on shape mismatch

gemm_large_test.cc:14-31:

if (c_shape.NumDimensions() == 2) {
  if (c_shape[0] == M && c_shape[1] == 1) return c_vals[m];
  if (c_shape[0] == M && c_shape[1] == N) return c_vals[m * N + n];
  if (c_shape[0] == 1 && c_shape[1] == N) return c_vals[n];
}
return 0.0f;

If someone writes a future test case with an unsupported bias shape (e.g., {1, M}, or a typo), the reference silently treats bias as zero and the test passes regardless of what the kernel did with the bias. Replacing the trailing return 0.0f; with ADD_FAILURE() << "Unsupported bias shape in test reference"; return 0.0f; keeps the function total but loudly flags the test-author mistake. Minor.

4. Test naming nit (not a real issue)

Gemm_Large / MatMul_Large with single tests DISABLED_AllShapes means there's one giant catch-all that, when it fails, dumps all sub-cases together. Splitting into a few smaller DISABLED_* tests (e.g., DISABLED_Aligned, DISABLED_Unaligned, DISABLED_TransAB, DISABLED_BiasBroadcast) would make manual triage easier when one shape regresses. Optional.

Confirmed correct

The architecture-list scope is conservative. Xe-2LPG was checked and behaves fine without the workaround (per author's measurements: M=129/B=128 → +20–30% with fold on all three; M=127/B=128 → +10% on Xe-2LPG but −2/−4% on Xe-LPG/3LPG). Not expanding the list to Xe-2LPG or speculative-future Xe variants is the right call.
The M used for the gate is the per-batch M (read off output_shape[NumDimensions()-2] before the reshape), which is what the heuristic is actually reasoning about. Confirmed by reading order of statements.
ConfigEp(std::move(webgpu_ep)) moves the unique_ptr so the EP isn't reused across multiple RunWithConfig() calls within the same RunTestTyped. Each test call constructs a fresh EP. Fine.
Test reference uses int64_t loop counters matching M/N/K types — no narrowing concerns.

Bottom line

Approve. The PR fixes a measured perf regression with an empirical heuristic that the author validated across the three Xe variants, the test infrastructure for catching future regressions is in place (even if opt-in), and the cleanup to drop the leading 1 from the fold reshape is independently nice. The boolean-restructure / comment / magic-constant items above would all improve readability and are worth a small follow-up commit if you're already touching the file, but none are blockers — @qjia7 has already signed off and the measurements support shipping this as-is.

xhcao · 2026-06-17T07:48:31Z

@hariharans29, addressed your comments, please take a look again, thanks

hariharans29 · 2026-06-17T07:58:23Z

Re-review of PR #28760 — `webgpu: adjust the parms for gemm-subgroup kernel`

Verdict: re-approve on d6f2f3e. All four follow-up items from my prior review have been addressed cleanly. The stale-reviews flag is just because of the main-merge in 2bd1e14; the substantive changes since 5687cea are concentrated in d6f2f3e and they're all readability fixes (no logic change).

What was already good before this round

Unchanged from my prior review — still holds:

Dropping the leading 1 from the fold reshape on both math/matmul.cc:194-206 and vendor/intel/math/matmul.cc.
ElementsPerThreadY simplification (Xe-LPG/3LPG cap at 4 for M > 32) with both call sites updated atomically.
Test EP selection via DefaultWebGpuExecutionProvider() + GTEST_SKIP() + ConfigEp(std::move(webgpu_ep)).
DISABLED_ opt-in pattern, seeded RandomValueGenerator{1234}, no CI cost.

What changed in `d6f2f3e` — each prior follow-up item is now resolved

Fold-gate readability in vendor/intel/math/matmul.cc:59-70:

// On Xe-LPG/3LPG, folding a batched matmul into a single large M loses Z-dispatch
// parallelism. Only fold when each per-batch M leaves a large fraction of invalid
// threads in its trailing workgroup (m_mod_32 ∈ [1, 24] = 25%–97% wasted), so the
// fold actually claws that waste back. Otherwise the Z-dispatch path wins.
const int64_t M = output_shape[output_shape.NumDimensions() - 2];
const auto& arch = context.AdapterInfo().architecture;
const bool is_xe_lpg_or_xe_3lpg = arch == std::string_view("xe-lpg") ||
                                  arch == std::string_view("xe-3lpg");
// 32 = kSubgroupLogicalWorkGroupSizeY * ElementsPerThreadY(M > 32) on Xe-LPG/3LPG
const int64_t m_mod_32 = M % 32;
const bool xe_lpg_or_xe_3lpg_fold_ok = (m_mod_32 > 0 && m_mod_32 <= 24);
if (batchA != 1 && batchB == 1 && (!is_xe_lpg_or_xe_3lpg || xe_lpg_or_xe_3lpg_fold_ok)) {

Named local + single architecture lookup + boolean restructure with the truth-table-obvious form (!is_xe_lpg_or_xe_3lpg || xe_lpg_or_xe_3lpg_fold_ok). Reads as intended now.

32 magic constant — // 32 = kSubgroupLogicalWorkGroupSizeY * ElementsPerThreadY(M > 32) on Xe-LPG/3LPG. Documents the implicit coupling so a future change to either constant will surface here at review time. Resolved.
GetBiasValue silent fallback in gemm_large_test.cc:30-31:
```
ADD_FAILURE() << "Unsupported bias shape in test reference";
return 0.0f;
```
Now a future test author writing an unsupported bias shape (e.g. {1, M}) gets a loud test-author error instead of a silent pass. Resolved.
Test naming / triage — the single DISABLED_AllShapes is gone, replaced by:
- Gemm_Large.{DISABLED_Aligned, DISABLED_Unaligned, DISABLED_TransAB, DISABLED_BiasBroadcast}
- MatMul_Large.{DISABLED_Aligned, DISABLED_Unaligned, DISABLED_Broadcast3D, DISABLED_Broadcast4D}
When a future regression hits, manual triage starts with the failing subgroup name. Resolved.

The same is_xe_lpg_or_xe_3lpg pattern now lives in two places (here and in vendor/intel/math/gemm_subgroup.cc:99-101). Tiny duplication, but the natural extraction point is a kIntelArchitectureFamilies helper that I don't think is worth carving out for two call sites. Leave it.

Nothing new to flag

I re-read the full delta from a3afac0 through 2bd1e14. No new logic paths, no new heuristics, the merge in 2bd1e14 is a clean main-merge with no conflict-resolution surprises. qjia7 had already approved on 5687cea, the only changes since are the four readability fixes you requested.

Sanity check on the heuristic math (unchanged from prior review)

kSubgroupLogicalWorkGroupSizeY = 8 × ElementsPerThreadY(M > 32) on Xe-LPG/3LPG = 4 → 32 rows per Y-workgroup. m_mod_32 is the rows in the trailing workgroup:

m_mod_32 = 0 → fully utilized, fold gains nothing.
m_mod_32 ∈ [1, 24] → 25%–97% waste per batch's trailing workgroup → folding reclaims it.
m_mod_32 ∈ [25, 31] → only 3%–22% waste → not enough to offset losing Z-dispatch.

Comment and constants are consistent.

Bottom line

Ship it. d6f2f3e is a textbook follow-up commit — pure readability and test-hygiene fixes, no behavior change. Once a reviewer re-clicks approve, this is good to merge.

hariharans29 · 2026-06-17T17:37:33Z

Please update the motivation and description of the PR ?

xhcao · 2026-06-18T02:32:34Z

Please update the motivation and description of the PR ?

Done

webgpu: adjust the parms for gemm-subgroup kernel

a3afac0

xhcao mentioned this pull request Jun 3, 2026

[Web] Performance regression in 20260502 version when using intel xe-lpg iGPU #28531

Open

Add comments and cases

a7c63bd

xhcao marked this pull request as ready for review June 5, 2026 03:40

qjia7 reviewed Jun 5, 2026

View reviewed changes

Comment thread onnxruntime/core/providers/webgpu/math/matmul.cc Outdated

Comment thread onnxruntime/core/providers/webgpu/vendor/intel/math/gemm_subgroup.cc Outdated

qjia7 requested a review from Copilot June 5, 2026 09:24

Copilot started reviewing on behalf of qjia7 June 5, 2026 09:25 View session

Copilot AI reviewed Jun 5, 2026

View reviewed changes

Comment thread onnxruntime/test/providers/webgpu/matmul_large_test.cc Outdated

Comment thread onnxruntime/test/providers/webgpu/gemm_large_test.cc Outdated

Address comments

5f7f673

remove the condition M < 128

9bade98

xhcao force-pushed the adjust-subgroup-kernel-args branch from d819dee to 9bade98 Compare June 10, 2026 06:15

xhcao commented Jun 10, 2026

View reviewed changes

Comment thread onnxruntime/core/providers/webgpu/vendor/intel/math/matmul.cc Outdated

qjia7 reviewed Jun 10, 2026

View reviewed changes

Comment thread onnxruntime/core/providers/webgpu/math/matmul.cc Outdated

Comment thread onnxruntime/core/providers/webgpu/vendor/intel/math/matmul.cc Outdated

update comments

5687cea

qjia7 previously approved these changes Jun 12, 2026

View reviewed changes

qjia7 requested a review from hariharans29 June 12, 2026 07:19

hariharans29 previously approved these changes Jun 15, 2026

View reviewed changes

xhcao added 2 commits June 17, 2026 15:31

Address comments

d6f2f3e

Merge branch 'main' into adjust-subgroup-kernel-args

2bd1e14

xhcao dismissed stale reviews from hariharans29 and qjia7 via 2bd1e14 June 17, 2026 07:44

hariharans29 requested a review from qjia7 June 17, 2026 07:57

hariharans29 requested a review from guschmue June 17, 2026 07:57

hariharans29 approved these changes Jun 17, 2026

View reviewed changes

hariharans29 merged commit 26b9cfd into microsoft:main Jun 17, 2026
85 checks passed

Uh oh!

Conversation

xhcao commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

xmcp commented Jun 3, 2026

Uh oh!

xhcao commented Jun 5, 2026

Uh oh!

xhcao commented Jun 5, 2026

Uh oh!

xhcao commented Jun 5, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

qjia7 commented Jun 5, 2026

Uh oh!

xhcao commented Jun 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hariharans29 commented Jun 15, 2026

Review of PR #28760 — webgpu: adjust the parms for gemm-subgroup kernel

Things that look good

Worth tightening (non-blocking)

1. The fold gate in intel/math/matmul.cc:60-67 is hard to parse

2. The 32 magic constant is implicitly coupled to two other constants

3. GetBiasValue silently returns 0 on shape mismatch

4. Test naming nit (not a real issue)

Confirmed correct

Bottom line

Uh oh!

xhcao commented Jun 17, 2026

Uh oh!

hariharans29 commented Jun 17, 2026

Re-review of PR #28760 — webgpu: adjust the parms for gemm-subgroup kernel

What was already good before this round

What changed in d6f2f3e — each prior follow-up item is now resolved

Nothing new to flag

Sanity check on the heuristic math (unchanged from prior review)

Bottom line

Uh oh!

hariharans29 commented Jun 17, 2026

Uh oh!

Uh oh!

xhcao commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xhcao commented Jun 3, 2026 •

edited

Loading

Review of PR #28760 — `webgpu: adjust the parms for gemm-subgroup kernel`

1. The fold gate in `intel/math/matmul.cc:60-67` is hard to parse

2. The `32` magic constant is implicitly coupled to two other constants

3. `GetBiasValue` silently returns 0 on shape mismatch

Re-review of PR #28760 — `webgpu: adjust the parms for gemm-subgroup kernel`

What changed in `d6f2f3e` — each prior follow-up item is now resolved