[MLAS] KleidiAI fix igemm regression by martin-klacer-arm · Pull Request #28571 · microsoft/onnxruntime

martin-klacer-arm · 2026-05-19T15:32:18Z

Description

This PR fixes a convolution performance regression affecting some OCR models with large-kernel convolutions when the KleidiAI SME IGEMM convolution path is selected.

The change has 2 parts:

updates to the KleidiAI IGEMM LHS packing to pack rows in bounded chunks instead of packing the full LHS buffer up front, which reduces memory usage and improves cache locality for large convolutions,
a new route selection function ArmKleidiAI::SelectConvRoute that decides between Igemm, GemmFallback and None based on convolution parameters and a workload size-based heuristic.

The function CheckCapabilitiesSme runs SelectConvRoute and only returns true if the selected route is Igemm. The patch also adds a standard GEMM fallback to the ConvRoute possibilities, and runs MlasGemm if said fallback is selected. If the function selects None, then the convolution falls back to MlasSgemmOperation.

Motivation and Context

Fixes #27633.

Copilot

Pull request overview

This PR addresses a CPU convolution performance regression on ARM64 when the KleidiAI SME IGEMM path is selected, by introducing a new convolution route-selection heuristic and changing the KleidiAI LHS packing strategy to reduce peak temporary memory and improve cache locality on large workloads.

Changes:

Added ArmKleidiAI::SelectConvRoute (with ConvRoute options Igemm, GemmFallback, None) to decide whether to run KleidiAI IGEMM, fall back to GEMM, or use the existing path.
Updated KleidiAI convolution to pack the IGEMM LHS in bounded chunks instead of packing the full LHS up front.
In the generic MLAS convolution implementation, routed certain workloads to MlasGemm (instead of MlasSgemmOperation) when SelectConvRoute chooses GemmFallback.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h	Adds `ConvRoute` + `SelectConvRoute` heuristic helpers for choosing conv execution route.
onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp	Reworks KleidiAI SME IGEMM LHS packing into chunked packing and integrates route selection into capability checks.
onnxruntime/core/mlas/lib/convolve.cpp	Uses `SelectConvRoute` to optionally switch im2col+SGEMM slices to `MlasGemm` for the fallback route.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

martin-klacer-arm · 2026-05-20T09:29:13Z

@microsoft-github-policy-service agree company="Arm"

hariharans29 · 2026-05-20T20:53:29Z

Can you please address the Copliot comments ?

martin-klacer-arm · 2026-05-21T16:07:24Z

Hi, thanks for the reminder. The comments Copilot left are sensible, I'm addressing them with a new patch that's currently work in progress. Once it's done and clears internal review, I'll add it to the PR.

hariharans29 · 2026-05-21T18:48:49Z

Hi, thanks for the reminder. The comments Copilot left are sensible, I'm addressing them with a new patch that's currently work in progress. Once it's done and clears internal review, I'll add it to the PR.

Thank you!

martin-klacer-arm · 2026-05-27T09:33:46Z

Hi, I pushed a new patch addressing the Copilot comments, and left a reply with further details on one of the comments.

hariharans29 · 2026-05-27T15:59:08Z

Review — PR #28571 `[MLAS] KleidiAI fix igemm regression`

Verdict: good direction, accept after a few fixes. The two changes (chunked LHS packing + workload-based route selection) are the right shape for fixing #27633. There are some real concerns around the heuristic, the new per-tile allocation, and use of ORT_ENFORCE inside a parallel worker — but nothing structural.

Substantive concerns

1. ORT_ENFORCE inside MlasTrySimpleParallel. The inner loop in ConvolveSme now does:

ORT_ENFORCE(bytes_per_m_step != 0,
            "KleidiAI LHS packed size must be non-zero.");

inside the worker lambda. MlasTrySimpleParallel does not catch exceptions thrown from workers — in exception-disabled builds ORT_ENFORCE calls abort(), and in exception-enabled builds the throw will likely terminate the process from a worker thread. Also: bytes_per_m_step depends only on m_step, d_kh*d_kw, ci — none of which change between iterations. Compute it (and check it) once before entering the parallel region. CheckCapabilitiesSme already validates this same quantity for the Igemm route, so by the time we reach ConvolveSme it cannot be zero; the enforce here is dead weight. Hoist the call out of the lambda and either drop the check or convert to an assertion.

2. Per-tile std::make_unique<std::byte[]> allocation. Each parallel worker now does:

auto lhs = std::make_unique<std::byte[]>(lhs_buffer_bytes);

inside the MlasTrySimpleParallel lambda, so this is one heap allocation per (m_tile, n_tile). For large co/n_tile_step, this is a lot of churn that the old code avoided by allocating the full LHS once. A thread_local grow-only std::vector<std::byte> (mirroring the pad_ptr pattern just above) would eliminate it. Yes, that re-introduces the cross-call state pattern this codebase has been bitten by (see #28654), so document the invariant clearly — but the cost of per-tile allocation in a hot conv loop is not free.

3. MAX_LHS_CHUNK_BYTES = 2 MiB is a magic constant. It's commented as "cache-friendly," but there is no cache-size discovery and no per-CPU tuning — Apple M-series, Snapdragon X, AWS Graviton, and Ampere Altra all have very different L2 sizes (and SME implementations). At minimum:

Add a comment explaining why 2 MiB specifically (target L2? something else?).
Consider deriving from MlasGetPreferredBufferAlignment or a cpuinfo-derived size.
Add a runtime override (env var / session option) for tuning experiments before this is baked in.

Also: the chunk computation

size_t m_chunk = std::max<size_t>(m_step,
                                  MAX_LHS_CHUNK_BYTES / bytes_per_m_step * m_step);

silently truncates MAX_LHS_CHUNK_BYTES / bytes_per_m_step to zero when bytes_per_m_step > MAX_LHS_CHUNK_BYTES, then the std::max rescues it back to m_step. That works, but it means the "2 MiB cap" silently does nothing for any workload whose per-m_step packed size exceeds 2 MiB. Make this explicit with a comment, or restructure so the intent is obvious.

4. ConvIgemmMaxWork = 1'000'000 threshold. This is the heuristic boundary between Igemm and GemmFallback. There is no data in the PR description showing how this was chosen (which models, which CPUs, which kernel sizes). The linked issue #27633 specifically mentions Apple M-series and 9×9 OCR kernels; what does the curve look like at, say, output_m × effective_k × filter_count near the threshold? At minimum, please add a comment citing the measurement methodology, and ideally make this overridable.

The formula

const auto igemm_max_output_m = (ConvIgemmMaxWork / effective_k / filter_count);
if (output_m > igemm_max_output_m) return ConvRoute::GemmFallback;

is equivalent to output_m * effective_k * filter_count > 1'000'000 modulo integer-division rounding. The integer-division form silently lowers the effective threshold when effective_k * filter_count doesn't divide cleanly into 1M. Use SafeInt/mul_overflow_size_t_builtin and a single comparison:

size_t total_work;
if (mul_overflow_size_t_builtin(output_m, effective_k, &total_work) ||
    mul_overflow_size_t_builtin(total_work, filter_count, &total_work)) {
    return ConvRoute::GemmFallback;  // assume too large
}
if (total_work > ConvIgemmMaxWork) return ConvRoute::GemmFallback;

5. New cross-file coupling. convolve.cpp now #includes kleidiai/mlasi_kleidiai.h and calls ArmKleidiAI::SelectConvRoute directly:

#if defined(USE_KLEIDIAI)
#include "kleidiai/mlasi_kleidiai.h"
#endif
...
#if defined(USE_KLEIDIAI)
use_gemm_batch_override =
    ArmKleidiAI::SelectConvRoute(Parameters) == ArmKleidiAI::ConvRoute::GemmFallback;
#endif

This means the generic MLAS convolve path now has compile-time and call-site knowledge of one specific backend. The existing pattern in MLAS for this is the dispatch override / MlasPlatform selector. Wrap this behind a MLAS_CONV_ROUTE_OVERRIDE function pointer set up in platform.cpp, the same way every other backend hook works. That keeps convolve.cpp backend-agnostic and avoids more #if defined(USE_KLEIDIAI) scaffolding creeping in.

Also: SelectConvRoute is now called twice per Compute() — once in CheckCapabilitiesSme and once at the top of the convolve op (via the new flag). It does non-trivial overflow-checked arithmetic. Cache the result on MLAS_CONV_PARAMETERS (there is already a BackendKernelSelectorConfig pointer there) so it isn't recomputed.

6. MlasGemm vs MlasSgemmOperation substitution semantics. The change does:

if (use_gemm_batch_override) {
    MlasGemm(CblasNoTrans, CblasNoTrans, FilterCount, CountN, CountK, 1.0f,
             Filter + k, K, ColumnBuffer, CountN, beta,
             SegmentOutput, OutputSize, nullptr, Parameters->BackendKernelSelectorConfig);
} else {
    MlasSgemmOperation(...);
}

MlasGemm accepts a threadpool (passed as nullptr here) and may parallelize internally, while MlasSgemmOperation is a single-segment operation that the caller (MlasConvOperation) is already running inside its own segment loop. Confirm that calling MlasGemm(..., nullptr, ...) from inside the per-thread segment loop doesn't accidentally re-enter the threadpool or serialize what was previously parallelized. A quick perf check on the regression case from #27633 would also confirm the fallback actually wins where the heuristic claims it does.

7. mul_overflow_size_t_builtin forward declaration in a public-ish header. mlasi_kleidiai.h now has:

inline bool mul_overflow_size_t_builtin(size_t a, size_t b, size_t* out);

with the comment "implementation is at the bottom of the file." Forward-declaring an inline function whose definition lives in the same TU is fine, but per AGENTS.md there is already MlasMultiplyOverflowsSizeT in mlasi.h from PR #28416. Use that and delete the local helper (PR #28487 already introduces a third copy as MlasTryMultiplySizeT). Three names for the same operation is too many.

8. CheckCapabilitiesSme redundant work. The new code path:

const auto route = ArmKleidiAI::SelectConvRoute(Parameters);
if (route == ArmKleidiAI::ConvRoute::Igemm) {
    const size_t d_kh = ComputeKernelSize(...);
    const size_t d_kw = ComputeKernelSize(...);
    const size_t m_step = imatmul_conv.ukernel.get_m_step();
    const size_t bytes_per_m_step = kai_get_lhs_packed_size_...(m_step, d_kh*d_kw, ...);
    if (bytes_per_m_step == 0) { ... return false; }
    return true;
}

SelectConvRoute already computes effective_kernel_h/w. Returning the dilated kernel sizes from SelectConvRoute (or expanding to a struct return) would avoid recomputation. Minor.

Nits

mlasi_kleidiai.h: the #include <iostream> move into the KLEIDIAI_DEBUG_LOGGING || KLEIDIAI_KERNEL_LOGGING block is good. Drive-by cleanup, +1.
TryComputeConvOutputSize returns true with *result = 0 for the padded_input < kernel case and for stride == 0. The caller (SelectConvRoute) then checks output_m == 0 and returns ConvRoute::None. That works but is non-obvious — the function returns true on what is semantically failure. Consider returning false for these (or rename the function to reflect "compute output size, with zero meaning 'no valid output'").
SelectConvRoute rejects filter_count == 1 outright. Comment why (presumably because the SME imatmul has min-N requirements). Future readers will wonder.
The comment //RhsPackWeightsBiasSme() - to perform dilation increases kernel size and masks unused weights was deleted — fine, since it was just a note, but the new code still does the same dilation expansion; consider keeping the note near ComputeKernelSize calls.
MIdx/NIdx → m_idx/n_idx rename is good; please apply the same to other CamelCase locals (ATile, BTile, CTile, TileSize*) for consistency in the touched block.
ConvRoute::None value ordering: None, Igemm, GemmFallback. Consider documenting the contract — most readers will expect None to mean "this PR's logic doesn't apply, defer to caller's default" rather than "fail." That's how convolve.cpp interprets it, but the enum name doesn't make it obvious.
inline constexpr size_t ConvIgemmMaxWork = 1000000ULL; — please use 1'000'000 for readability.

Things to verify before merging

Re-run the regression model from [Performance] CPU Conv kernel regression for large kernels (9x9) on macOS ARM64 in ORT 1.24.x #27633 on both Apple M-series and Snapdragon X with the chunked packing; confirm the regression is gone and there is no slowdown on the cases that previously hit the IGEMM path cleanly.
Confirm MlasGemm(..., nullptr, ...) from inside MlasConvOperation's segment loop doesn't cause re-entrant threadpool dispatch.
Run existing MLAS conv unit tests under TSAN if available — the thread_local caches plus chunked packing change cross-thread access patterns.
Confirm the bytes_per_m_step == 0 early-out in CheckCapabilitiesSme makes the ORT_ENFORCE in ConvolveSme provably unreachable, then drop the enforce.

Bottom line

Approve after:

Hoist or drop the ORT_ENFORCE from the worker lambda.
Replace per-tile make_unique with TLS scratch (or document why per-tile is acceptable).
Route SelectConvRoute through the existing dispatch-override pattern instead of #if defined(USE_KLEIDIAI) in convolve.cpp.
Document/justify the 2 MiB and 1M thresholds (or make them overridable).
Converge on one mul_overflow_size_t helper.

The chunked-packing idea and the route-selection split are both correct fixes for the reported regression. With the above tightening this should be safe to land.

martin-klacer-arm · 2026-05-28T11:30:27Z

Thank you for the thorough review! The comments are sensible and understandable. I will update the PR as soon as we have a patch to address them finished and internally reviewed.

martin-klacer-arm · 2026-06-09T17:48:51Z

Hi @hariharans29, thanks again for the thorough review. I've pushed an updated patch addressing the comments.

Summary of the changes with respect to the previous review comment points:

Moved ORT_ENFORCE out of the MlasTrySimpleParallel inner loop and changed it to a sanity-check assert
Replaced per-tile std::make_unique allocation with a thread-local grow-only reusable buffer
Updated MAX_LHS_CHUNK_BYTES

Renamed MAX_LHS_CHUNK_BYTES and improved explanation comment for consistency and clarity
Made the m_step > MAX_LHS_CHUNK_BYTES case explicit, using m_step directly when over the byte limit

Added session option override for ConvIgemmMaxWork
Removed the new convolve.cpp cross-file coupling by moving route selection behind MLAS platform override hook
No code changes made, MlasGemm(..., nullptr, ...) call correctness addressed in reply
Removed duplicate mul_overflow_size_t_builtin helper, reused MlasMultiplyOverflowsSizeT instead
Cached dilated kernel size calculation in SelectConvRoute to avoid redundant work

Nits:

Changed TryComputeConvOutputSize to return false when there's no valid output shape
Added a comment explaining why SelectConvRoute rejects filter_count == 1
Restored the kernel dilation explanation comment
Minor readability fixes

Additional verification and notes

Re-running the regressing model

I reran the regressing PP-OCR model from #27633 on a macOS Arm64 device based on the original report.

Using onnxruntime_perf_test, the patch gives an average total inference time of about 1,057 ms. This is a substantial improvement over the regressing 1.24.3 result of about 2,837 ms, also improving on the previously tested version 1.21.1 with a result of about 1,431 ms.

I didn't include a Snapdragon X run because the original issue reported the regression on macOS Arm64.

`ConvIgemmMaxWork` default value

The default ConvIgemmMaxWork threshold of 1,000,000 is intended as a conservative heuristic anchored to the existing MLAS SGEMM threading scale.

The source file onnxruntime/core/mlas/lib/mlasi.h defines the constants MLAS_SGEMM_THREAD_COMPLEXITY to be 64 * 1024 and MLAS_MAXIMUM_THREAD_COUNT to be 16, the existing MLAS SGEMM threading heuristic uses MLAS_SGEMM_THREAD_COMPLEXITY * MLAS_MAXIMUM_THREAD_COUNT as the scale for reaching the maximum thread budget, which puts the boundary just over 1e6.

Below this value, the KleidiAI SME IGEMM path may be competitive by avoiding extra im2col work. Above the value, since MLAS can use the maximum available thread count, the SME IGEMM path's additional indirect/LHS packing overhead cost becomes more difficult to justify. As such, for Conv nodes above this size, the safer default is to route to the SGEMM-backed fallback.

I also swept ConvIgemmMaxWork from 0 to 1e9 on the regressing PP-OCR det model, plus mobilenetv1_ssd_f32 and retinaface_f32. The smaller model sweeps are affected by normal run-to-run noise, and while they don't show 1e6 as the clearly optimal threshold, they also don't clearly point to a better default threshold. The reported PP-OCR regression case strongly supports falling back for the large convolution nodes involved in the reported regression.

Given that, I kept 1e6 as a conservative default. It is tied to an existing MLAS work scale, fixes the reported large-convolution regression, and avoids making a broader design change to the IGEMM path, which would be out of scope for this regression-fix PR. The new session option leaves the threshold overridable for follow-up tuning.

For reference, the PP-OCR, mobilenet and retinaface sweep data is shown in the following graphs:

MlasGemm vs MlasSgemmOperation per-thread segment (review point 6.)

The MlasGemm(..., nullptr, ...) call from MlasConvOperation under use_gemm_batch_override should not cause re-entrant threadpool dispatch.

In this call path, MlasGemm reaches MlasGemmBatch, which uses MlasTrySimpleParallel. With a nullptr threadpool, the internally used TrySimpleParallelFor collapses to a single-threaded loop, so passing nullptr here keeps the inner MlasGemm execution single-threaded.

Unit Tests run under TSAN

I built onnxruntime_mlas_test with -fsanitize=thread and ran existing MLAS Conv unit tests. The tests completed without any TSAN issues.

hariharans29 · 2026-06-09T18:23:57Z

Re-review — PR #28571 `[MLAS] KleidiAI fix igemm regression`

Verdict: approve. Commit 3d7acb4 ("Addressed maintainer review feedback") plus the inline replies cover every one of the eight substantive concerns from the prior round, and the author backed up the heuristic with actual measurement data. One minor non-blocker remains, called out below.

What got fixed since the last pass

1. ORT_ENFORCE in the parallel worker — gone. The bytes_per_m_step computation is now hoisted above MlasTrySimpleParallel and reduced to assert(bytes_per_m_step != 0); — which is correct, since CheckCapabilitiesSme already rejects the zero-size case for the Igemm route. Clean.

2. Per-tile std::make_unique — replaced with TLS scratch. The pattern now mirrors the pad_ptr thread-local cache:

// Thread-local grow-only reusable buffer for LHS packing
thread_local std::vector<std::byte> lhs_buffer;
if (lhs_buffer.size() < lhs_buffer_bytes) {
    lhs_buffer.resize(lhs_buffer_bytes);
}
auto* lhs = lhs_buffer.data();

No per-tile heap churn, and the invariant (single-threaded inside the lambda, grow-only) is the same one already accepted for pad_ptr and lhs_ptrs_cache_by_pad. The lifecycle is fine because the buffer is only read/written within the lambda body — no cross-call state escapes.

3. The 2 MiB chunk constant — documented and renamed. kMaxLhsPackedChunkBytes now has a four-line comment explaining the budget choice (common L2 size) and the per-m_step fall-through case, and the silent-truncation arithmetic is now explicit:

const size_t max_m_step_chunks = kMaxLhsPackedChunkBytes / bytes_per_m_step;
// If bytes_per_m_step exceeds kMaxLhsPackedChunkBytes, process m_step at a time
size_t m_chunk = max_m_step_chunks == 0 ? m_step : m_step * max_m_step_chunks;

Intent is now readable. The cpuinfo-derived tuning ask was correctly deferred — it's a separate concern from this regression fix.

4. ConvIgemmMaxWork — overflow-safe, overridable, and justified. Three good fixes here:

The formula is now the explicit output_m * effective_k * filter_count > limit form using MlasMultiplyOverflowsSizeT, with overflow correctly treated as "too large → fall back":

size_t total_work;
if (MlasMultiplyOverflowsSizeT(output_m, effective_k, &total_work) ||
    MlasMultiplyOverflowsSizeT(total_work, filter_count, &total_work)) {
    return ConvRouteSelection{ConvRoute::GemmFallback, ...};
}

New session option kOrtSessionOptionsMlasKleidiAiConvIgemmMaxWork ("mlas.kleidiai.conv_igemm_max_work") with the docstring "0" or unset uses the MLAS default heuristic. Lets us tune this per deployment without a rebuild.
The 1'000'000ULL default is now justified in the PR thread: anchored to MLAS_SGEMM_THREAD_COMPLEXITY * MLAS_MAXIMUM_THREAD_COUNT == 64*1024 * 16 ≈ 1.05M, i.e., the existing point at which MLAS SGEMM already saturates its thread budget. That's a defensible anchor, not a freshly invented magic number.

The PP-OCR re-run is the part I most wanted to see: 2837 ms → 1057 ms on macOS Arm64, also beating the pre-regression 1.21.1 baseline of 1431 ms. That's a real fix, not a heuristic that papers over the symptom. The threshold sweeps on mobilenet/retinaface didn't show a sharper optimum, which is consistent with "1M is where SGEMM-with-threads catches up to single-thread IGEMM-with-im2col."

5. Cross-file coupling — removed via the dispatch hook. This is exactly the pattern I asked for:

// mlasi.h
enum MLAS_CONV_ROUTE { MlasConvRouteDefault, MlasConvRouteGemmFallback };
typedef MLAS_CONV_ROUTE (MLASCALL MLAS_CONV_ROUTE_OVERRIDE)(const MLAS_CONV_PARAMETERS*);
struct MLAS_PLATFORM {
    ...
    MLAS_CONV_ROUTE_OVERRIDE* MlasConvRouteOverride = nullptr;
};

// platform.cpp
this->MlasConvRouteOverride = ArmKleidiAI::MlasConvRoute;

// convolve.cpp
bool use_gemm_batch_override =
    GetMlasPlatform().MlasConvRouteOverride != nullptr &&
    GetMlasPlatform().MlasConvRouteOverride(Parameters) == MlasConvRouteGemmFallback;

No #if defined(USE_KLEIDIAI) scaffolding in convolve.cpp, the generic path stays backend-agnostic, and non-KleidiAI builds get the unchanged old behavior (nullptr override → MlasSgemmOperation).

6. MlasGemm(..., nullptr, ...) re-entrancy. Answered in the thread, and the answer is right: MlasGemmBatch with a null threadpool goes through MlasTrySimpleParallel → TrySimpleParallelFor, which with nullptr collapses to a serial loop. No re-entrant dispatch, no nested parallelism. Acceptable.

7. Duplicate overflow helper — converged. mul_overflow_size_t_builtin is deleted; the canonical MlasMultiplyOverflowsSizeT is hoisted into mlasi.h and the three call sites (sgemm_kleidiai.cpp, sbgemm_kleidiai.cpp, mlasi_kleidiai.h) all use it. One name, one definition. Good.

8. CheckCapabilitiesSme redundant kernel-size compute — cached. SelectConvRoute now returns a ConvRouteSelection { route, effective_kernel_h, effective_kernel_w }, and the caller in CheckCapabilitiesSme reads the dilated sizes off the result instead of recomputing.

Nits — all addressed

TryComputeConvOutputSize now returns false on degenerate shapes instead of true with *result = 0.
The filter_count == 1 rejection has a comment explaining the SME N-tile fill rationale.
Dilation-expansion comment restored above the ComputeKernelSize calls in ConvolveSme.
1'000'000ULL uses digit separators.
CamelCase locals (ATile/BTile/CTile/MIdx/NIdx/TileSizeM/TileSizeN) all converted to snake_case in the touched block.
#include <iostream> correctly moved under the KLEIDIAI_*_LOGGING guard.

Still open — one minor item

SelectConvRoute is still called twice per Compute() when the convolution would route to GemmFallback:

Once inside CheckCapabilitiesSme (returns false, so the KleidiAI conv override declines).
Once inside convolve.cpp via GetMlasPlatform().MlasConvRouteOverride(Parameters) (returns MlasConvRouteGemmFallback, so MlasGemm is used).

The arithmetic is overflow-checked and non-trivial: two TryComputeDilatedKernelSize, two TryComputeConvOutputSize, three MlasMultiplyOverflowsSizeT. Not catastrophic — these are integer ops, not memory traffic — but the result is deterministic for a given MLAS_CONV_PARAMETERS, so caching it on the parameters struct (or on BackendKernelSelectorConfig) would be cheap. Non-blocking; fine as a follow-up.

(The MlasConvRoute override itself also reads BackendKernelSelectorConfig->use_kleidiai before delegating to SelectConvRoute — which is correct, because the user-disable path should win without doing any heuristic arithmetic.)

Things to confirm in CI

The new session option key "mlas.kleidiai.conv_igemm_max_work" is parsed via TryParseStringWithClassicLocale<size_t> and the failure path is ORT_ENFORCE. That's session creation — failure throws cleanly, which is the right behavior. No unit test covering bad input (-1, "foo", "1e9"); a single negative test in test/framework/session_state_test.cc or similar would lock in the contract.
TSAN run on onnxruntime_mlas_test was reported clean — good. With three thread_local containers in play (pad_ptr, lhs_ptrs_cache_by_pad, lhs_buffer) and the new chunked write pattern, that was worth checking explicitly.
85/85 checks green, 4 commits, 349 / 155 lines.

Bottom line

Every blocking concern from the prior review is addressed, the perf justification is now grounded in PP-OCR numbers and an MLAS-internal anchor for the threshold, the dispatch coupling went through the right platform-override seam, and the overflow-helper triplication is converged. Ship it. Squash on merge (4 commits, the second and fourth are review-driven cleanup that don't need to survive history). The remaining SelectConvRoute double-evaluation is a follow-up, not a blocker.

hariharans29

I (and Copilot) left a few more comments. Also, some builds are failing - worth checking if they are to do with your change. Could you please take a look ?

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

martin-klacer-arm · 2026-06-11T18:11:23Z

Thank you for the comments! They are all clear and sensible. Good catch with the NHWC padding symmetry correctness comment.

I have a patch completed resolving the comments, I'll push it here once it clears internal review.

martin-klacer-arm · 2026-06-12T11:18:33Z

Hi @hariharans29, thank you for the review! I made changes to address the generated comments.

In addition, I found that the pipeline failures were indeed related to the patch, included in the latest commit is a fix moving from TryGetConfigEntry (not always supported) to GetConfigEntry (supported in all cases) in onnxruntime/core/providers/cpu/mlas_backend_kernel_selector_config_utils.h.

hariharans29 · 2026-06-12T17:06:13Z

Third pass — PR #28571

Verdict: still approve. Commit ebf3f4e ("Addressed review comments") plus the follow-up pipeline-failure fix are all positive changes. One real latent-bug fix slipped in alongside the requested edits, and the new config-option contract is now locked in by unit tests. Nothing blocking.

New since the last pass

1. NHWC padding gate is now strictly uniform — and that's a correctness fix

onnxruntime/core/mlas/lib/convolve.cpp (around line 1393-1398):

// The channels-last float convolution path is only implemented by the Arm® KleidiAI™
// SME conv override, which currently uses a single padding value in its LHS indirection table.
// Require uniform padding until separate H/W padding is supported there.
if (Padding[0] != Padding[2] ||
    Padding[1] != Padding[3] ||
    Padding[0] != Padding[1]) {
  return false;
}

The previous gate accepted any H-symmetric AND W-symmetric padding (e.g., {1, 2, 1, 2} was passing). Since ConvolveSme calls LhsPackImageDataSme(..., sh, sw, padding, ...) with a single scalar padding, the prior shape {1, 2, 1, 2} would silently compute wrong outputs (the kernel would use Padding[0]=1 for both H and W).

This isn't just a tightening for the new IGEMM route — MlasConvSupportsSymmetricChannelsLast2DFloatKernel is also the gate for the dense NHWC float path landed in #26834 / #28565. None of the existing tests use H≠W padding (all {1,1,1,1}), so the bug was never tripped. Real latent-bug fix. The matching pre-check inside SelectConvRoute (Padding[0] != Padding[1]) is consistent.

Worth a one-line note in the PR description acknowledging this is a behavioral change beyond the IGEMM regression scope — anyone bisecting an NHWC accuracy issue across this PR will appreciate it.

2. Enum and override renaming clarify scope

ConvRoute::{None, Igemm, GemmFallback} → ConvRoute::{NoKleidiAi, IGemm, SGemmFallback}. NoKleidiAi is now self-documenting ("decline the conv, caller runs unchanged"); my prior nit on None being ambiguous is resolved.
MLAS_CONV_ROUTE_OVERRIDE → MLAS_CONV_SGEMM_ROUTE_OVERRIDE with enum MLAS_CONV_SGEMM_ROUTE { MlasConvSGemmRouteDirect, MlasConvSGemmRouteDispatch }. The name now signals exactly what the hook controls — the choice between MlasSgemmOperation (direct) and MlasGemm (dispatched through backend overrides) inside MlasConvOperation. Not the whole conv route.
New helper MlasConvSGemmRoute in mlasi_kleidiai.h short-circuits on the user opt-out (use_kleidiai == false) before running any heuristic arithmetic. Correct order of operations.

3. Config-option parsing — contract is now tested

onnxruntime/test/providers/cpu/cpu_execution_provider_test.cc (around line 37-63) adds two tests:

MlasBackendKernelSelectorParsesKleidiAiConvIgemmMaxWork — positive path, checks 1234567 is parsed.
MlasBackendKernelSelectorRejectsInvalidKleidiAiConvIgemmMaxWork — negative path, verifies ORT_ENFORCE throws on "Not a Number" and that the error message mentions both the option key and the offending value.

Exactly the missing-test gap I flagged in the prior round. Note: the negative test catches OnnxRuntimeException, which assumes exceptions are enabled — fine, since gtest itself requires that. No need to add an MLAS_NO_EXCEPTION variant.

4. `TryGetConfigEntry` → `GetConfigEntry` build fix

onnxruntime/core/providers/cpu/mlas_backend_kernel_selector_config_utils.h:

if (auto conv_igemm_max_work =
        config_options.GetConfigEntry(kOrtSessionOptionsMlasKleidiAiConvIgemmMaxWork)) {
  ORT_ENFORCE(TryParseStringWithClassicLocale<size_t>(*conv_igemm_max_work,
                                                       config.kleidiai_conv_igemm_max_work),
              "Invalid value for ", kOrtSessionOptionsMlasKleidiAiConvIgemmMaxWork,
              ": ", *conv_igemm_max_work, ". Expected a non-negative integer.");
}

Author's note explains TryGetConfigEntry isn't available in all build configs (which is what was failing CI). GetConfigEntry returning std::optional<std::string> works in every flavor of ConfigOptions we ship. Right call.

5. `MlasMultiplyOverflowsSizeT` canonicalization

Moved into mlasi.h (around line 122-147) as MLAS_FORCEINLINE, with __has_builtin guard for __builtin_mul_overflow and a manual if (b != 0 && a > SIZE_MAX/b) fallback. The local mul_overflow_size_t_builtin is deleted from mlasi_kleidiai.h and all three call sites (sgemm_kleidiai.cpp, sbgemm_kleidiai.cpp, the new SelectConvRoute paths) migrated. One name, one definition — addresses my prior triplication concern.

Tiny detail: when the builtin reports overflow it still writes truncated low bits to *out, but every caller checks the return value first and only consumes *out on the false path. Correct.

Tiny nits — not blocking

Inside the inline MlasConvSGemmRoute (which itself lives in the ArmKleidiAI namespace), the body explicitly qualifies ArmKleidiAI::SelectConvRoute(...) and ArmKleidiAI::ConvRoute::SGemmFallback. The qualifiers are redundant; unqualified SelectConvRoute and ConvRoute::SGemmFallback would read cleaner. Cosmetic.
The previously-open SelectConvRoute double-evaluation (once in CheckCapabilitiesSme, once via MlasConvSGemmRoute) is still open. Still a minor follow-up, not a blocker — the arithmetic is integer-only.

Bottom line

Every concern from the prior two rounds is addressed; the latest commits add a latent NHWC correctness fix, lock in the new config-option contract with two tests, fix the pipeline regression from TryGetConfigEntry, and converge on canonical helper names. CI is still settling on the latest push (ebf3f4e) — once it goes green this is ready. Squash on merge (5 commits → 1 logical change).

hariharans29 · 2026-06-13T19:08:52Z

Can you please fix the conflict ?

martin-klacer-arm · 2026-06-16T14:15:53Z

Hi @Hariharan29, thanks for the comment! I rebased the branch which should resolve the conflict. In addition, I added a small change to fix a numeric issue related to NCHW/NHWC correctness that was flagged by failing tests. With the latest commits, all tests are passing locally.

previously, the convolution kernel for KLEIDIAI would allocate a large contiguous buffer for the LHS (left-hand-side) matrix packing, which could consume excessive memory and reduce cache efficiency. This patch modifies the packing strategy to use a chunked approach: - Introduce a compile-time upper bound for temporary LHS packing buffers - Allocate a moderate-sized temporary buffer once. - Pack LHS rows in chunks, perform computation, then reuse the buffer for the next chunk. Benefits: - Significantly reduces peak memory usage. - Improves cache utilization and overall computation efficiency. - Avoids potential memory allocation failures for large convolutions. Performance improvement: - Test with model https://huggingface.co/garavv/arcface-onnx on MTK D9500 Before this patch ``` ./build/RelWithDebInfo/onnxruntime_perf_test -x 1 -r 1000 arc.onnx Number of inferences per second: 4.25327 ``` After this patch ``` ./build/RelWithDebInfo/onnxruntime_perf_test -x 1 -r 1000 arc.onnx Number of inferences per second: 5.03257 ``` ---------------------------------------------------------------------------------------------------------------- sme_with_patch | sme_without_patch | Diff (μs) | Change % | Node Name ---------------------------------------------------------------------------------------------------------------- 11975.10 | 31235.30 | 19260.20 | 160.84% | StatefulPartitionedCall/ResNet34/conv2_block1_1_conv/Conv2D 4514.80 | 7691.70 | 3176.90 | 70.37% | StatefulPartitionedCall/ResNet34/conv2_block2_1_conv/Conv2D 4220.20 | 7120.70 | 2900.50 | 68.73% | StatefulPartitionedCall/ResNet34/conv2_block3_1_conv/Conv2D 5429.20 | 8279.60 | 2850.40 | 52.50% | StatefulPartitionedCall/ResNet34/conv3_block1_1_conv/Conv2D 4497.80 | 5478.40 | 980.60 | 21.80% | StatefulPartitionedCall/ResNet34/conv4_block1_1_conv/Conv2D 3474.30 | 4351.80 | 877.50 | 25.26% | StatefulPartitionedCall/ResNet34/conv3_block3_1_conv/Conv2D 3627.30 | 4504.00 | 876.70 | 24.17% | StatefulPartitionedCall/ResNet34/conv3_block4_1_conv/Conv2D 5244.20 | 5961.10 | 716.90 | 13.67% | StatefulPartitionedCall/ResNet34/conv1_conv/Conv2D 3439.80 | 4050.90 | 611.10 | 17.77% | StatefulPartitionedCall/ResNet34/conv3_block2_1_conv/Conv2D 9749.80 | 10195.50 | 445.70 | 4.57% | StatefulPartitionedCall/ResNet34/conv2_block2_2_conv/Conv2D 3814.00 | 4209.80 | 395.80 | 10.38% | StatefulPartitionedCall/ResNet34/conv5_block2_2_conv/Conv2D 2715.90 | 3034.70 | 318.80 | 11.74% | StatefulPartitionedCall/ResNet34/conv4_block6_1_conv/Conv2D 4089.10 | 4367.80 | 278.70 | 6.82% | StatefulPartitionedCall/ResNet34/conv5_block1_1_conv/Conv2D 2698.00 | 2959.50 | 261.50 | 9.69% | StatefulPartitionedCall/ResNet34/conv4_block5_1_conv/Conv2D 3869.20 | 4102.80 | 233.60 | 6.04% | StatefulPartitionedCall/ResNet34/conv5_block3_2_conv/Conv2D 2767.90 | 2966.80 | 198.90 | 7.19% | StatefulPartitionedCall/ResNet34/conv4_block4_1_conv/Conv2D 9652.10 | 9816.60 | 164.50 | 1.70% | StatefulPartitionedCall/ResNet34/conv2_block3_2_conv/Conv2D 2897.50 | 3054.60 | 157.10 | 5.42% | StatefulPartitionedCall/ResNet34/conv4_block3_1_conv/Conv2D 4601.20 | 4748.60 | 147.40 | 3.20% | StatefulPartitionedCall/ResNet34/conv5_block1_2_conv/Conv2D 3134.00 | 3246.10 | 112.10 | 3.58% | StatefulPartitionedCall/ResNet34/conv4_block2_1_conv/Conv2D Signed-off-by: Qxiang Xu <Qixiang.Xu@arm.com> Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>

- Added SelectConvRoute function to mlasi_kleidiai.h to decide between GemmFallback and Igemm based on the convolution workload parameters - Updated CheckCapabilitiesSme function in convolve_kleidiai.cpp to use the new SelectConvRoute function Co-authored-by: Damien Dooley <damien.dooley@arm.com> Signed-off-by: Martin Klacer <martin.klacer@arm.com>

Signed-off-by: Martin Klacer <martin.klacer@arm.com>

- Move route selection behind the MLAS platform override hook so generic convolve.cpp doesn't depend directly on KleidiAI headers - Replace local overflow helper with MlasMultiplyOverflowsSizeT and use overflow-checked arithmetic for output/work estimates - Return effective kernel dimensions from route selection so capability checks don't recompute them - Move the LHS packed-size sanity check out of the parallel worker - Replace the per-tile heap allocation with a thread-local grow-only scratch buffer - Document the IGEMM workload heuristic and add a session config override for mlas.kleidiai.conv_igemm_max_work - Add comments for ConvRoute::None, single-output-channel routing and LHS chunk size constant Signed-off-by: Martin Klacer <martin.klacer@arm.com>

* Updated mlasi_kleidiai.h ConvRoute enum names and comments in line with reviewer suggestions * Updated mlasi.h MLAS_CONV_ROUTE name, values and comments in line with reviewer suggestions * Changed MlasConvSupportsSymmetricChannelsLast2DFloatKernel to accurately reflect Arm® KleidiAI™ symmetric padding requirement * Added valid and invalid tests for ConvIgemmMaxWork session option to cpu_execution_provider_test.cc Signed-off-by: Martin Klacer <martin.klacer@arm.com>

* Added check for NHWC layout in SelectConvRoute in onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h * Fallback routes in SelectConvRoute currently assume NCHW layout, added check to ensure NHWC stays on IGEMM Signed-off-by: Martin Klacer <martin.klacer@arm.com>

Signed-off-by: Martin Klacer <martin.klacer@arm.com>

hariharans29 requested a review from Copilot May 19, 2026 16:07

Copilot started reviewing on behalf of hariharans29 May 19, 2026 16:08 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h Outdated

Comment thread onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp

Comment thread onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp Outdated

Comment thread onnxruntime/core/mlas/lib/convolve.cpp Outdated

hariharans29 changed the title ~~KleidiAI fix igemm regression~~ [MLAS] KleidiAI fix igemm regression May 19, 2026

hariharans29 approved these changes Jun 9, 2026

View reviewed changes

hariharans29 reviewed Jun 9, 2026

View reviewed changes

Comment thread onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h Outdated

hariharans29 reviewed Jun 9, 2026

View reviewed changes

Comment thread onnxruntime/core/mlas/lib/mlasi.h Outdated

hariharans29 requested a review from Copilot June 9, 2026 19:22

Copilot started reviewing on behalf of hariharans29 June 9, 2026 19:22 View session

hariharans29 requested changes Jun 9, 2026

View reviewed changes

Copilot AI reviewed Jun 9, 2026

View reviewed changes

Comment thread onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h

Comment thread onnxruntime/core/mlas/lib/convolve.cpp Outdated

Comment thread onnxruntime/core/providers/cpu/mlas_backend_kernel_selector_config_utils.h Outdated

hariharans29 previously approved these changes Jun 12, 2026

View reviewed changes

hariharans29 mentioned this pull request Jun 16, 2026

[WebGPU] Make max_num_pending_dispatches configurable #28894

Merged

martin-klacer-arm dismissed hariharans29’s stale review via 8571e8f June 16, 2026 14:12

martin-klacer-arm force-pushed the markla01_fix_igemm_regression branch from ebf3f4e to 8571e8f Compare June 16, 2026 14:12

hariharans29 reviewed Jun 16, 2026

View reviewed changes

Comment thread include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h

hariharans29 added release:1.27.1 regression issues that demonstrate a regression in ORT functionality and need to be addressed immediately labels Jun 18, 2026

JonathanC-ARM and others added 7 commits June 19, 2026 15:50

Addressed upstream Copilot comments

e80a622

Signed-off-by: Martin Klacer <martin.klacer@arm.com>

Update IGEMM threshold tuning option documentation comment

75f0249

Signed-off-by: Martin Klacer <martin.klacer@arm.com>

martin-klacer-arm force-pushed the markla01_fix_igemm_regression branch from 8571e8f to 75f0249 Compare June 19, 2026 15:36

hariharans29 removed the release:1.27.1 label Jun 19, 2026

hariharans29 approved these changes Jun 19, 2026

View reviewed changes

hariharans29 enabled auto-merge (squash) June 19, 2026 18:22

hariharans29 merged commit bd0cb9a into microsoft:main Jun 19, 2026
86 checks passed

hariharans29 added the release:1.27.1 label Jun 19, 2026

Uh oh!

Conversation

martin-klacer-arm commented May 19, 2026

Description

Motivation and Context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

martin-klacer-arm commented May 20, 2026

Uh oh!

hariharans29 commented May 20, 2026

Uh oh!

martin-klacer-arm commented May 21, 2026

Uh oh!

hariharans29 commented May 21, 2026

Uh oh!

martin-klacer-arm commented May 27, 2026

Uh oh!

hariharans29 commented May 27, 2026

Review — PR #28571 [MLAS] KleidiAI fix igemm regression

Substantive concerns

Nits

Things to verify before merging

Bottom line

Uh oh!

martin-klacer-arm commented May 28, 2026

Uh oh!

martin-klacer-arm commented Jun 9, 2026

Additional verification and notes

Re-running the regressing model

ConvIgemmMaxWork default value

MlasGemm vs MlasSgemmOperation per-thread segment (review point 6.)

Unit Tests run under TSAN

Uh oh!

hariharans29 commented Jun 9, 2026

Re-review — PR #28571 [MLAS] KleidiAI fix igemm regression

What got fixed since the last pass

Nits — all addressed

Still open — one minor item

Things to confirm in CI

Bottom line

Uh oh!

Uh oh!

Uh oh!

hariharans29 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

martin-klacer-arm commented Jun 11, 2026

Uh oh!

martin-klacer-arm commented Jun 12, 2026

Uh oh!

hariharans29 commented Jun 12, 2026

Third pass — PR #28571

New since the last pass

1. NHWC padding gate is now strictly uniform — and that's a correctness fix

2. Enum and override renaming clarify scope

3. Config-option parsing — contract is now tested

4. TryGetConfigEntry → GetConfigEntry build fix

5. MlasMultiplyOverflowsSizeT canonicalization

Tiny nits — not blocking

Bottom line

Uh oh!

hariharans29 commented Jun 13, 2026

Uh oh!

martin-klacer-arm commented Jun 16, 2026

Review — PR #28571 `[MLAS] KleidiAI fix igemm regression`

`ConvIgemmMaxWork` default value

Re-review — PR #28571 `[MLAS] KleidiAI fix igemm regression`

hariharans29 left a comment •

edited

Loading

4. `TryGetConfigEntry` → `GetConfigEntry` build fix

5. `MlasMultiplyOverflowsSizeT` canonicalization