Skip to content

webgpu: adjust the parms for gemm-subgroup kernel#28760

Merged
hariharans29 merged 7 commits into
microsoft:mainfrom
xhcao:adjust-subgroup-kernel-args
Jun 17, 2026
Merged

webgpu: adjust the parms for gemm-subgroup kernel#28760
hariharans29 merged 7 commits into
microsoft:mainfrom
xhcao:adjust-subgroup-kernel-args

Conversation

@xhcao

@xhcao xhcao commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Description

  1. Fixed the regression in the Intel xe-lpg path.
  2. Tuned parameters/heuristics for xe-lpg and xe-3lpg.
  3. Added large-shape MatMul and Gemm test coverage, following webgpu: add MatMul and Gemm cases with large shapes webgpu: add MatMul and Gemm cases with large shapes #26572 (comment), kept these large-shape tests with DISABLED_ prefix by default; run them with --gtest_also_run_disabled_tests when needed.

Motivation and Context

Fix the Intel WebGPU performance regression reported in [Web] Performance regression in 20260502 version when using intel xe-lpg iGPU #28531, and improve test for large-shape MatMul and Gemm scenarios.

@xmcp

xmcp commented Jun 3, 2026

Copy link
Copy Markdown

Nice work! A small comment: The code seems to be special-casing xe-lpg and xe-3lpg. So xe-2lpg (is that LNL?) and future intel cards are not getting ths behavior. Is that intended?

@xhcao

xhcao commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Nice work! A small comment: The code seems to be special-casing xe-lpg and xe-3lpg. So xe-2lpg (is that LNL?) and future intel cards are not getting ths behavior. Is that intended?

I had tested many architectures, saw there was few regression on xe-lpg and xe-3lpg when merging batch dim and M dim. And more important, observed that 4 elements per thread of row dim is optimal when M is large on xe-lpg and xe-3lpg, so adjust the parms for these architectures

@xhcao

xhcao commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author
  1. Fixed the issue, [Web] Performance regression in 20260502 version when using intel xe-lpg iGPU #28531
  2. Tuned the parameters for xe-lpg and xe-3lpg
  3. Added the test cases, followed the comment webgpu: add MatMul and Gemm cases with large shapes #26572 (comment) to use DISABLED_ as prefix, use --gtest_also_run_disabled_tests when running the cases

@xhcao xhcao marked this pull request as ready for review June 5, 2026 03:40
@xhcao

xhcao commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

@qjia7, please help me review it, thank you.

Comment thread onnxruntime/core/providers/webgpu/math/matmul.cc Outdated
Comment thread onnxruntime/core/providers/webgpu/vendor/intel/math/gemm_subgroup.cc Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts WebGPU MatMul/Gemm subgroup-kernel tuning, primarily for Intel GPUs, by changing heuristics used for batch-folding and per-thread tile sizing based on matrix dimensions and detected adapter architecture. It also adds (currently disabled) large-shape regression tests intended to exercise these kernels.

Changes:

  • Add Intel-architecture–aware elements_per_thread_y selection for subgroup GEMM/MatMul (e.g., Xe-LPG/Xe-3LPG tuning).
  • Refine the “fold batchA into M” heuristic (including an M < 128 threshold and Intel-architecture gating).
  • Add disabled large-shape WebGPU MatMul/Gemm tests with a CPU reference implementation.

Reviewed changes

Copilot reviewed 3 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
onnxruntime/test/providers/webgpu/matmul_large_test.cc Adds disabled large-shape MatMul tests with a CPU reference path.
onnxruntime/test/providers/webgpu/gemm_large_test.cc Adds disabled large-shape Gemm tests with a CPU reference path and bias-broadcast coverage.
onnxruntime/core/providers/webgpu/vendor/intel/math/matmul.cc Adjusts Intel MatMul batch-folding heuristic and wires architecture-aware tile selection.
onnxruntime/core/providers/webgpu/vendor/intel/math/gemm.cc Updates subgroup GEMM dispatch tiling to use the new architecture-aware helper.
onnxruntime/core/providers/webgpu/vendor/intel/math/gemm_subgroup.h Changes ElementsPerThreadY signature to accept context (used for architecture-based tuning).
onnxruntime/core/providers/webgpu/vendor/intel/math/gemm_subgroup.cc Implements Xe-LPG/Xe-3LPG-specific ElementsPerThreadY tuning.
onnxruntime/core/providers/webgpu/math/matmul.cc Restricts batch-folding in the generic WebGPU MatMul path to cases with small M.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/test/providers/webgpu/matmul_large_test.cc Outdated
Comment thread onnxruntime/test/providers/webgpu/gemm_large_test.cc Outdated
@qjia7

qjia7 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Just thinking out loud — @daijh ever hit an Intel driver issue in #25908, where an unbalanced dispatch shape caused a big perf drop on LNL. Is it possible that the fold path with large M is exposing the same class of driver behavior on Xe-LPG / Xe-3LPG? If yes, reshaping the dispatch (rather than skipping fold) might be a more general fix.

@xhcao

xhcao commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

The PR is not related to #25908, the dispatch issue only happens on LNL, and the regression here does not happen on LNL.
The root cause of regression on lpg or 3lpg was the tile size. Take [128, 32, 768] * [768, 768] for an example, if there was no folding, the M = 32, ElementsPerThreadY returned 4, the tile size was [32, 128]. If folding A, [1, 4096, 768] * [768, 768], the M = 4096, ElementsPerThreadY returned 8, the tile size was [64, 128], there was a regression lpg and a light regression on 3lpg, so I adjusted the params here.

@xhcao xhcao force-pushed the adjust-subgroup-kernel-args branch from d819dee to 9bade98 Compare June 10, 2026 06:15
Comment thread onnxruntime/core/providers/webgpu/vendor/intel/math/matmul.cc Outdated
Comment thread onnxruntime/core/providers/webgpu/math/matmul.cc Outdated
Comment thread onnxruntime/core/providers/webgpu/vendor/intel/math/matmul.cc Outdated
qjia7
qjia7 previously approved these changes Jun 12, 2026
@qjia7 qjia7 requested a review from hariharans29 June 12, 2026 07:19
@hariharans29

Copy link
Copy Markdown
Member

Review of PR #28760webgpu: adjust the parms for gemm-subgroup kernel

Verdict: approve. The empirical changes are backed by measured numbers across Xe-LPG / Xe-2LPG / Xe-3LPG (documented in the thread), already approved by @qjia7, and the regression target (issue #28531) is addressed. A few readability improvements worth landing as a small follow-up are below — none are blocking.


Things that look good

  1. The non-vendor cleanup in matmul.cc is pure simplification. Dropping the leading 1 from dims_a = {1, batchAndM, K} / dims_b = {1, K, N} / output_shape = {1, batchAndM, N} removes a meaningless rank-3 wrapping that the kernel didn't need. Cleaner. The same cleanup is consistently applied to intel/math/matmul.cc.

  2. ElementsPerThreadY simplification. Dropping is_vec4 and collapsing to a single M-only formula:

    return M <= 8 ? 1 : (M <= 16 ? 2 : (M <= 32 ? 4 : (is_xe_lpg || is_xe_3lpg ? 4 : 8)));

    is a real behavior change in the !is_vec4 path (was always 4, now follows the M ladder) — but the author re-tuned in response to @qjia7's question and confirmed it. Two call sites updated atomically (gemm.cc:87 and intel/math/matmul.cc:101). The Xe-LPG/3LPG cap at 4 for M > 32 matches the reported [128, 32, 768] × [768, 768] regression case (folded path was picking 8 → tile [64×128] → regression).

  3. Test EP selection. The new tests explicitly grab DefaultWebGpuExecutionProvider(), GTEST_SKIP() when it's not present, and ConfigEp(std::move(webgpu_ep)).RunWithConfig(). That's exactly the fix Copilot suggested, applied consistently to both new files. Without this, RunWithConfig() would have silently passed on CPU-only configurations without ever exercising WebGPU.

  4. DISABLED_AllShapes opt-in pattern follows the precedent from webgpu: add MatMul and Gemm cases with large shapes #26572. The tests are reproducible (RandomValueGenerator{1234} seeded), cover aligned + unaligned + transA/transB + various bias-broadcast shapes (gemm) and 2D/3D/4D + broadcast (matmul), and won't slow CI.


Worth tightening (non-blocking)

1. The fold gate in intel/math/matmul.cc:60-67 is hard to parse

const int64_t M = output_shape[output_shape.NumDimensions() - 2];
const int64_t m_mod_32 = M % 32;
if (batchA != 1 && batchB == 1 &&
    (!(context.AdapterInfo().architecture == std::string_view("xe-lpg") ||
       context.AdapterInfo().architecture == std::string_view("xe-3lpg")) ||
     (m_mod_32 > 0 && m_mod_32 <= 24))) {

Negation of a disjunction with a positive on the other side, plus the architecture string is fetched twice. It took me a couple of passes to convince myself the truth table is:

Architecture M % 32 == 0 M % 32 ∈ [1, 24] M % 32 ∈ [25, 31]
Non Xe-LPG/3LPG fold fold fold
Xe-LPG or Xe-3LPG don't fold fold don't fold

Rewriting using a named local mirrors what gemm_subgroup.cc:99-101 already does, avoids the double field access, and makes the intent obvious:

const int64_t M = output_shape[output_shape.NumDimensions() - 2];
const auto& arch = context.AdapterInfo().architecture;
const bool is_xe_lpg = arch == std::string_view("xe-lpg") || arch == std::string_view("xe-3lpg");
const int64_t m_mod_32 = M % 32;
// On Xe-LPG/3LPG, folding a batched matmul into a single large M loses Z-dispatch
// parallelism. Only fold when each per-batch M leaves a large fraction of invalid
// threads in its trailing workgroup (m_mod_32 ∈ [1, 24] = 25%–97% wasted), so the
// fold actually claws that waste back. Otherwise the Z-dispatch path wins.
const bool xe_lpg_fold_ok = (m_mod_32 > 0 && m_mod_32 <= 24);
if (batchA != 1 && batchB == 1 && (!is_xe_lpg || xe_lpg_fold_ok)) {

Same semantics, one architecture lookup, and the comment captures the actual heuristic instead of the abstract "when the proportion of workgroups containing invalid threads is relatively small ." sentence (which also has a stray space before the period).

2. The 32 magic constant is implicitly coupled to two other constants

32 = kSubgroupLogicalWorkGroupSizeY (= 8) × ElementsPerThreadY-for-large-M-on-Xe-LPG (= 4). If either of those changes, this heuristic silently breaks. A one-liner near the comparison would prevent that:

// 32 = kSubgroupLogicalWorkGroupSizeY * ElementsPerThreadY(M > 32) on Xe-LPG/3LPG
const int64_t m_mod_32 = M % 32;

Or, if you want it to track changes automatically, derive it: static_assert(kSubgroupLogicalWorkGroupSizeY * 4 == 32, "..."); — not strictly necessary, but cheap insurance.

3. GetBiasValue silently returns 0 on shape mismatch

gemm_large_test.cc:14-31:

if (c_shape.NumDimensions() == 2) {
  if (c_shape[0] == M && c_shape[1] == 1) return c_vals[m];
  if (c_shape[0] == M && c_shape[1] == N) return c_vals[m * N + n];
  if (c_shape[0] == 1 && c_shape[1] == N) return c_vals[n];
}
return 0.0f;

If someone writes a future test case with an unsupported bias shape (e.g., {1, M}, or a typo), the reference silently treats bias as zero and the test passes regardless of what the kernel did with the bias. Replacing the trailing return 0.0f; with ADD_FAILURE() << "Unsupported bias shape in test reference"; return 0.0f; keeps the function total but loudly flags the test-author mistake. Minor.

4. Test naming nit (not a real issue)

Gemm_Large / MatMul_Large with single tests DISABLED_AllShapes means there's one giant catch-all that, when it fails, dumps all sub-cases together. Splitting into a few smaller DISABLED_* tests (e.g., DISABLED_Aligned, DISABLED_Unaligned, DISABLED_TransAB, DISABLED_BiasBroadcast) would make manual triage easier when one shape regresses. Optional.


Confirmed correct

  • The architecture-list scope is conservative. Xe-2LPG was checked and behaves fine without the workaround (per author's measurements: M=129/B=128 → +20–30% with fold on all three; M=127/B=128 → +10% on Xe-2LPG but −2/−4% on Xe-LPG/3LPG). Not expanding the list to Xe-2LPG or speculative-future Xe variants is the right call.
  • The M used for the gate is the per-batch M (read off output_shape[NumDimensions()-2] before the reshape), which is what the heuristic is actually reasoning about. Confirmed by reading order of statements.
  • ConfigEp(std::move(webgpu_ep)) moves the unique_ptr so the EP isn't reused across multiple RunWithConfig() calls within the same RunTestTyped. Each test call constructs a fresh EP. Fine.
  • Test reference uses int64_t loop counters matching M/N/K types — no narrowing concerns.

Bottom line

Approve. The PR fixes a measured perf regression with an empirical heuristic that the author validated across the three Xe variants, the test infrastructure for catching future regressions is in place (even if opt-in), and the cleanup to drop the leading 1 from the fold reshape is independently nice. The boolean-restructure / comment / magic-constant items above would all improve readability and are worth a small follow-up commit if you're already touching the file, but none are blockers — @qjia7 has already signed off and the measurements support shipping this as-is.

hariharans29
hariharans29 previously approved these changes Jun 15, 2026
@xhcao xhcao dismissed stale reviews from hariharans29 and qjia7 via 2bd1e14 June 17, 2026 07:44
@xhcao

xhcao commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@hariharans29, addressed your comments, please take a look again, thanks

@hariharans29 hariharans29 requested a review from qjia7 June 17, 2026 07:57
@hariharans29 hariharans29 requested a review from guschmue June 17, 2026 07:57
@hariharans29

Copy link
Copy Markdown
Member

Re-review of PR #28760webgpu: adjust the parms for gemm-subgroup kernel

Verdict: re-approve on d6f2f3e. All four follow-up items from my prior review have been addressed cleanly. The stale-reviews flag is just because of the main-merge in 2bd1e14; the substantive changes since 5687cea are concentrated in d6f2f3e and they're all readability fixes (no logic change).

What was already good before this round

Unchanged from my prior review — still holds:

  • Dropping the leading 1 from the fold reshape on both math/matmul.cc:194-206 and vendor/intel/math/matmul.cc.
  • ElementsPerThreadY simplification (Xe-LPG/3LPG cap at 4 for M > 32) with both call sites updated atomically.
  • Test EP selection via DefaultWebGpuExecutionProvider() + GTEST_SKIP() + ConfigEp(std::move(webgpu_ep)).
  • DISABLED_ opt-in pattern, seeded RandomValueGenerator{1234}, no CI cost.

What changed in d6f2f3e — each prior follow-up item is now resolved

  1. Fold-gate readability in vendor/intel/math/matmul.cc:59-70:

    // On Xe-LPG/3LPG, folding a batched matmul into a single large M loses Z-dispatch
    // parallelism. Only fold when each per-batch M leaves a large fraction of invalid
    // threads in its trailing workgroup (m_mod_32 ∈ [1, 24] = 25%–97% wasted), so the
    // fold actually claws that waste back. Otherwise the Z-dispatch path wins.
    const int64_t M = output_shape[output_shape.NumDimensions() - 2];
    const auto& arch = context.AdapterInfo().architecture;
    const bool is_xe_lpg_or_xe_3lpg = arch == std::string_view("xe-lpg") ||
                                      arch == std::string_view("xe-3lpg");
    // 32 = kSubgroupLogicalWorkGroupSizeY * ElementsPerThreadY(M > 32) on Xe-LPG/3LPG
    const int64_t m_mod_32 = M % 32;
    const bool xe_lpg_or_xe_3lpg_fold_ok = (m_mod_32 > 0 && m_mod_32 <= 24);
    if (batchA != 1 && batchB == 1 && (!is_xe_lpg_or_xe_3lpg || xe_lpg_or_xe_3lpg_fold_ok)) {

    Named local + single architecture lookup + boolean restructure with the truth-table-obvious form (!is_xe_lpg_or_xe_3lpg || xe_lpg_or_xe_3lpg_fold_ok). Reads as intended now.

  2. 32 magic constant// 32 = kSubgroupLogicalWorkGroupSizeY * ElementsPerThreadY(M > 32) on Xe-LPG/3LPG. Documents the implicit coupling so a future change to either constant will surface here at review time. Resolved.

  3. GetBiasValue silent fallback in gemm_large_test.cc:30-31:

    ADD_FAILURE() << "Unsupported bias shape in test reference";
    return 0.0f;

    Now a future test author writing an unsupported bias shape (e.g. {1, M}) gets a loud test-author error instead of a silent pass. Resolved.

  4. Test naming / triage — the single DISABLED_AllShapes is gone, replaced by:

    • Gemm_Large.{DISABLED_Aligned, DISABLED_Unaligned, DISABLED_TransAB, DISABLED_BiasBroadcast}
    • MatMul_Large.{DISABLED_Aligned, DISABLED_Unaligned, DISABLED_Broadcast3D, DISABLED_Broadcast4D}

    When a future regression hits, manual triage starts with the failing subgroup name. Resolved.

The same is_xe_lpg_or_xe_3lpg pattern now lives in two places (here and in vendor/intel/math/gemm_subgroup.cc:99-101). Tiny duplication, but the natural extraction point is a kIntelArchitectureFamilies helper that I don't think is worth carving out for two call sites. Leave it.

Nothing new to flag

I re-read the full delta from a3afac0 through 2bd1e14. No new logic paths, no new heuristics, the merge in 2bd1e14 is a clean main-merge with no conflict-resolution surprises. qjia7 had already approved on 5687cea, the only changes since are the four readability fixes you requested.

Sanity check on the heuristic math (unchanged from prior review)

kSubgroupLogicalWorkGroupSizeY = 8 × ElementsPerThreadY(M > 32) on Xe-LPG/3LPG = 4 → 32 rows per Y-workgroup. m_mod_32 is the rows in the trailing workgroup:

  • m_mod_32 = 0 → fully utilized, fold gains nothing.
  • m_mod_32 ∈ [1, 24] → 25%–97% waste per batch's trailing workgroup → folding reclaims it.
  • m_mod_32 ∈ [25, 31] → only 3%–22% waste → not enough to offset losing Z-dispatch.

Comment and constants are consistent.

Bottom line

Ship it. d6f2f3e is a textbook follow-up commit — pure readability and test-hygiene fixes, no behavior change. Once a reviewer re-clicks approve, this is good to merge.

@hariharans29

Copy link
Copy Markdown
Member

Please update the motivation and description of the PR ?

@hariharans29 hariharans29 merged commit 26b9cfd into microsoft:main Jun 17, 2026
85 checks passed
@xhcao

xhcao commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Please update the motivation and description of the PR ?

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants