Skip to content

[MLAS] KleidiAI fix igemm regression#28571

Merged
hariharans29 merged 7 commits into
microsoft:mainfrom
martin-klacer-arm:markla01_fix_igemm_regression
Jun 19, 2026
Merged

[MLAS] KleidiAI fix igemm regression#28571
hariharans29 merged 7 commits into
microsoft:mainfrom
martin-klacer-arm:markla01_fix_igemm_regression

Conversation

@martin-klacer-arm

Copy link
Copy Markdown
Contributor

Description

This PR fixes a convolution performance regression affecting some OCR models with large-kernel convolutions when the KleidiAI SME IGEMM convolution path is selected.

The change has 2 parts:

  1. updates to the KleidiAI IGEMM LHS packing to pack rows in bounded chunks instead of packing the full LHS buffer up front, which reduces memory usage and improves cache locality for large convolutions,
  2. a new route selection function ArmKleidiAI::SelectConvRoute that decides between Igemm, GemmFallback and None based on convolution parameters and a workload size-based heuristic.

The function CheckCapabilitiesSme runs SelectConvRoute and only returns true if the selected route is Igemm. The patch also adds a standard GEMM fallback to the ConvRoute possibilities, and runs MlasGemm if said fallback is selected. If the function selects None, then the convolution falls back to MlasSgemmOperation.

Motivation and Context

Fixes #27633.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a CPU convolution performance regression on ARM64 when the KleidiAI SME IGEMM path is selected, by introducing a new convolution route-selection heuristic and changing the KleidiAI LHS packing strategy to reduce peak temporary memory and improve cache locality on large workloads.

Changes:

  • Added ArmKleidiAI::SelectConvRoute (with ConvRoute options Igemm, GemmFallback, None) to decide whether to run KleidiAI IGEMM, fall back to GEMM, or use the existing path.
  • Updated KleidiAI convolution to pack the IGEMM LHS in bounded chunks instead of packing the full LHS up front.
  • In the generic MLAS convolution implementation, routed certain workloads to MlasGemm (instead of MlasSgemmOperation) when SelectConvRoute chooses GemmFallback.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h Adds ConvRoute + SelectConvRoute heuristic helpers for choosing conv execution route.
onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp Reworks KleidiAI SME IGEMM LHS packing into chunked packing and integrates route selection into capability checks.
onnxruntime/core/mlas/lib/convolve.cpp Uses SelectConvRoute to optionally switch im2col+SGEMM slices to MlasGemm for the fallback route.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h Outdated
Comment thread onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp
Comment thread onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp Outdated
Comment thread onnxruntime/core/mlas/lib/convolve.cpp Outdated
@hariharans29 hariharans29 changed the title KleidiAI fix igemm regression [MLAS] KleidiAI fix igemm regression May 19, 2026
@martin-klacer-arm

Copy link
Copy Markdown
Contributor Author

@microsoft-github-policy-service agree company="Arm"

@hariharans29

Copy link
Copy Markdown
Member

Can you please address the Copliot comments ?

@martin-klacer-arm

Copy link
Copy Markdown
Contributor Author

Hi, thanks for the reminder. The comments Copilot left are sensible, I'm addressing them with a new patch that's currently work in progress. Once it's done and clears internal review, I'll add it to the PR.

@hariharans29

Copy link
Copy Markdown
Member

Hi, thanks for the reminder. The comments Copilot left are sensible, I'm addressing them with a new patch that's currently work in progress. Once it's done and clears internal review, I'll add it to the PR.

Thank you!

@martin-klacer-arm

Copy link
Copy Markdown
Contributor Author

Hi, I pushed a new patch addressing the Copilot comments, and left a reply with further details on one of the comments.

@hariharans29

Copy link
Copy Markdown
Member

Review — PR #28571 [MLAS] KleidiAI fix igemm regression

Verdict: good direction, accept after a few fixes. The two changes (chunked LHS packing + workload-based route selection) are the right shape for fixing #27633. There are some real concerns around the heuristic, the new per-tile allocation, and use of ORT_ENFORCE inside a parallel worker — but nothing structural.

Substantive concerns

1. ORT_ENFORCE inside MlasTrySimpleParallel. The inner loop in ConvolveSme now does:

ORT_ENFORCE(bytes_per_m_step != 0,
            "KleidiAI LHS packed size must be non-zero.");

inside the worker lambda. MlasTrySimpleParallel does not catch exceptions thrown from workers — in exception-disabled builds ORT_ENFORCE calls abort(), and in exception-enabled builds the throw will likely terminate the process from a worker thread. Also: bytes_per_m_step depends only on m_step, d_kh*d_kw, ci — none of which change between iterations. Compute it (and check it) once before entering the parallel region. CheckCapabilitiesSme already validates this same quantity for the Igemm route, so by the time we reach ConvolveSme it cannot be zero; the enforce here is dead weight. Hoist the call out of the lambda and either drop the check or convert to an assertion.

2. Per-tile std::make_unique<std::byte[]> allocation. Each parallel worker now does:

auto lhs = std::make_unique<std::byte[]>(lhs_buffer_bytes);

inside the MlasTrySimpleParallel lambda, so this is one heap allocation per (m_tile, n_tile). For large co/n_tile_step, this is a lot of churn that the old code avoided by allocating the full LHS once. A thread_local grow-only std::vector<std::byte> (mirroring the pad_ptr pattern just above) would eliminate it. Yes, that re-introduces the cross-call state pattern this codebase has been bitten by (see #28654), so document the invariant clearly — but the cost of per-tile allocation in a hot conv loop is not free.

3. MAX_LHS_CHUNK_BYTES = 2 MiB is a magic constant. It's commented as "cache-friendly," but there is no cache-size discovery and no per-CPU tuning — Apple M-series, Snapdragon X, AWS Graviton, and Ampere Altra all have very different L2 sizes (and SME implementations). At minimum:

  • Add a comment explaining why 2 MiB specifically (target L2? something else?).
  • Consider deriving from MlasGetPreferredBufferAlignment or a cpuinfo-derived size.
  • Add a runtime override (env var / session option) for tuning experiments before this is baked in.

Also: the chunk computation

size_t m_chunk = std::max<size_t>(m_step,
                                  MAX_LHS_CHUNK_BYTES / bytes_per_m_step * m_step);

silently truncates MAX_LHS_CHUNK_BYTES / bytes_per_m_step to zero when bytes_per_m_step > MAX_LHS_CHUNK_BYTES, then the std::max rescues it back to m_step. That works, but it means the "2 MiB cap" silently does nothing for any workload whose per-m_step packed size exceeds 2 MiB. Make this explicit with a comment, or restructure so the intent is obvious.

4. ConvIgemmMaxWork = 1'000'000 threshold. This is the heuristic boundary between Igemm and GemmFallback. There is no data in the PR description showing how this was chosen (which models, which CPUs, which kernel sizes). The linked issue #27633 specifically mentions Apple M-series and 9×9 OCR kernels; what does the curve look like at, say, output_m × effective_k × filter_count near the threshold? At minimum, please add a comment citing the measurement methodology, and ideally make this overridable.

The formula

const auto igemm_max_output_m = (ConvIgemmMaxWork / effective_k / filter_count);
if (output_m > igemm_max_output_m) return ConvRoute::GemmFallback;

is equivalent to output_m * effective_k * filter_count > 1'000'000 modulo integer-division rounding. The integer-division form silently lowers the effective threshold when effective_k * filter_count doesn't divide cleanly into 1M. Use SafeInt/mul_overflow_size_t_builtin and a single comparison:

size_t total_work;
if (mul_overflow_size_t_builtin(output_m, effective_k, &total_work) ||
    mul_overflow_size_t_builtin(total_work, filter_count, &total_work)) {
    return ConvRoute::GemmFallback;  // assume too large
}
if (total_work > ConvIgemmMaxWork) return ConvRoute::GemmFallback;

5. New cross-file coupling. convolve.cpp now #includes kleidiai/mlasi_kleidiai.h and calls ArmKleidiAI::SelectConvRoute directly:

#if defined(USE_KLEIDIAI)
#include "kleidiai/mlasi_kleidiai.h"
#endif
...
#if defined(USE_KLEIDIAI)
use_gemm_batch_override =
    ArmKleidiAI::SelectConvRoute(Parameters) == ArmKleidiAI::ConvRoute::GemmFallback;
#endif

This means the generic MLAS convolve path now has compile-time and call-site knowledge of one specific backend. The existing pattern in MLAS for this is the dispatch override / MlasPlatform selector. Wrap this behind a MLAS_CONV_ROUTE_OVERRIDE function pointer set up in platform.cpp, the same way every other backend hook works. That keeps convolve.cpp backend-agnostic and avoids more #if defined(USE_KLEIDIAI) scaffolding creeping in.

Also: SelectConvRoute is now called twice per Compute() — once in CheckCapabilitiesSme and once at the top of the convolve op (via the new flag). It does non-trivial overflow-checked arithmetic. Cache the result on MLAS_CONV_PARAMETERS (there is already a BackendKernelSelectorConfig pointer there) so it isn't recomputed.

6. MlasGemm vs MlasSgemmOperation substitution semantics. The change does:

if (use_gemm_batch_override) {
    MlasGemm(CblasNoTrans, CblasNoTrans, FilterCount, CountN, CountK, 1.0f,
             Filter + k, K, ColumnBuffer, CountN, beta,
             SegmentOutput, OutputSize, nullptr, Parameters->BackendKernelSelectorConfig);
} else {
    MlasSgemmOperation(...);
}

MlasGemm accepts a threadpool (passed as nullptr here) and may parallelize internally, while MlasSgemmOperation is a single-segment operation that the caller (MlasConvOperation) is already running inside its own segment loop. Confirm that calling MlasGemm(..., nullptr, ...) from inside the per-thread segment loop doesn't accidentally re-enter the threadpool or serialize what was previously parallelized. A quick perf check on the regression case from #27633 would also confirm the fallback actually wins where the heuristic claims it does.

7. mul_overflow_size_t_builtin forward declaration in a public-ish header. mlasi_kleidiai.h now has:

inline bool mul_overflow_size_t_builtin(size_t a, size_t b, size_t* out);

with the comment "implementation is at the bottom of the file." Forward-declaring an inline function whose definition lives in the same TU is fine, but per AGENTS.md there is already MlasMultiplyOverflowsSizeT in mlasi.h from PR #28416. Use that and delete the local helper (PR #28487 already introduces a third copy as MlasTryMultiplySizeT). Three names for the same operation is too many.

8. CheckCapabilitiesSme redundant work. The new code path:

const auto route = ArmKleidiAI::SelectConvRoute(Parameters);
if (route == ArmKleidiAI::ConvRoute::Igemm) {
    const size_t d_kh = ComputeKernelSize(...);
    const size_t d_kw = ComputeKernelSize(...);
    const size_t m_step = imatmul_conv.ukernel.get_m_step();
    const size_t bytes_per_m_step = kai_get_lhs_packed_size_...(m_step, d_kh*d_kw, ...);
    if (bytes_per_m_step == 0) { ... return false; }
    return true;
}

SelectConvRoute already computes effective_kernel_h/w. Returning the dilated kernel sizes from SelectConvRoute (or expanding to a struct return) would avoid recomputation. Minor.

Nits

  • mlasi_kleidiai.h: the #include <iostream> move into the KLEIDIAI_DEBUG_LOGGING || KLEIDIAI_KERNEL_LOGGING block is good. Drive-by cleanup, +1.
  • TryComputeConvOutputSize returns true with *result = 0 for the padded_input < kernel case and for stride == 0. The caller (SelectConvRoute) then checks output_m == 0 and returns ConvRoute::None. That works but is non-obvious — the function returns true on what is semantically failure. Consider returning false for these (or rename the function to reflect "compute output size, with zero meaning 'no valid output'").
  • SelectConvRoute rejects filter_count == 1 outright. Comment why (presumably because the SME imatmul has min-N requirements). Future readers will wonder.
  • The comment //RhsPackWeightsBiasSme() - to perform dilation increases kernel size and masks unused weights was deleted — fine, since it was just a note, but the new code still does the same dilation expansion; consider keeping the note near ComputeKernelSize calls.
  • MIdx/NIdxm_idx/n_idx rename is good; please apply the same to other CamelCase locals (ATile, BTile, CTile, TileSize*) for consistency in the touched block.
  • ConvRoute::None value ordering: None, Igemm, GemmFallback. Consider documenting the contract — most readers will expect None to mean "this PR's logic doesn't apply, defer to caller's default" rather than "fail." That's how convolve.cpp interprets it, but the enum name doesn't make it obvious.
  • inline constexpr size_t ConvIgemmMaxWork = 1000000ULL; — please use 1'000'000 for readability.

Things to verify before merging

  1. Re-run the regression model from [Performance] CPU Conv kernel regression for large kernels (9x9) on macOS ARM64 in ORT 1.24.x #27633 on both Apple M-series and Snapdragon X with the chunked packing; confirm the regression is gone and there is no slowdown on the cases that previously hit the IGEMM path cleanly.
  2. Confirm MlasGemm(..., nullptr, ...) from inside MlasConvOperation's segment loop doesn't cause re-entrant threadpool dispatch.
  3. Run existing MLAS conv unit tests under TSAN if available — the thread_local caches plus chunked packing change cross-thread access patterns.
  4. Confirm the bytes_per_m_step == 0 early-out in CheckCapabilitiesSme makes the ORT_ENFORCE in ConvolveSme provably unreachable, then drop the enforce.

Bottom line

Approve after:

  • Hoist or drop the ORT_ENFORCE from the worker lambda.
  • Replace per-tile make_unique with TLS scratch (or document why per-tile is acceptable).
  • Route SelectConvRoute through the existing dispatch-override pattern instead of #if defined(USE_KLEIDIAI) in convolve.cpp.
  • Document/justify the 2 MiB and 1M thresholds (or make them overridable).
  • Converge on one mul_overflow_size_t helper.

The chunked-packing idea and the route-selection split are both correct fixes for the reported regression. With the above tightening this should be safe to land.

@martin-klacer-arm

Copy link
Copy Markdown
Contributor Author

Thank you for the thorough review! The comments are sensible and understandable. I will update the PR as soon as we have a patch to address them finished and internally reviewed.

@martin-klacer-arm

Copy link
Copy Markdown
Contributor Author

Hi @hariharans29, thanks again for the thorough review. I've pushed an updated patch addressing the comments.

Summary of the changes with respect to the previous review comment points:

  1. Moved ORT_ENFORCE out of the MlasTrySimpleParallel inner loop and changed it to a sanity-check assert
  2. Replaced per-tile std::make_unique allocation with a thread-local grow-only reusable buffer
  3. Updated MAX_LHS_CHUNK_BYTES
  • Renamed MAX_LHS_CHUNK_BYTES and improved explanation comment for consistency and clarity
  • Made the m_step > MAX_LHS_CHUNK_BYTES case explicit, using m_step directly when over the byte limit
  1. Added session option override for ConvIgemmMaxWork
  2. Removed the new convolve.cpp cross-file coupling by moving route selection behind MLAS platform override hook
  3. No code changes made, MlasGemm(..., nullptr, ...) call correctness addressed in reply
  4. Removed duplicate mul_overflow_size_t_builtin helper, reused MlasMultiplyOverflowsSizeT instead
  5. Cached dilated kernel size calculation in SelectConvRoute to avoid redundant work

Nits:

  • Changed TryComputeConvOutputSize to return false when there's no valid output shape
  • Added a comment explaining why SelectConvRoute rejects filter_count == 1
  • Restored the kernel dilation explanation comment
  • Minor readability fixes

Additional verification and notes

Re-running the regressing model

I reran the regressing PP-OCR model from #27633 on a macOS Arm64 device based on the original report.

Using onnxruntime_perf_test, the patch gives an average total inference time of about 1,057 ms. This is a substantial improvement over the regressing 1.24.3 result of about 2,837 ms, also improving on the previously tested version 1.21.1 with a result of about 1,431 ms.

I didn't include a Snapdragon X run because the original issue reported the regression on macOS Arm64.

ConvIgemmMaxWork default value

The default ConvIgemmMaxWork threshold of 1,000,000 is intended as a conservative heuristic anchored to the existing MLAS SGEMM threading scale.

The source file onnxruntime/core/mlas/lib/mlasi.h defines the constants MLAS_SGEMM_THREAD_COMPLEXITY to be 64 * 1024 and MLAS_MAXIMUM_THREAD_COUNT to be 16, the existing MLAS SGEMM threading heuristic uses MLAS_SGEMM_THREAD_COMPLEXITY * MLAS_MAXIMUM_THREAD_COUNT as the scale for reaching the maximum thread budget, which puts the boundary just over 1e6.

Below this value, the KleidiAI SME IGEMM path may be competitive by avoiding extra im2col work. Above the value, since MLAS can use the maximum available thread count, the SME IGEMM path's additional indirect/LHS packing overhead cost becomes more difficult to justify. As such, for Conv nodes above this size, the safer default is to route to the SGEMM-backed fallback.

I also swept ConvIgemmMaxWork from 0 to 1e9 on the regressing PP-OCR det model, plus mobilenetv1_ssd_f32 and retinaface_f32. The smaller model sweeps are affected by normal run-to-run noise, and while they don't show 1e6 as the clearly optimal threshold, they also don't clearly point to a better default threshold. The reported PP-OCR regression case strongly supports falling back for the large convolution nodes involved in the reported regression.

Given that, I kept 1e6 as a conservative default. It is tied to an existing MLAS work scale, fixes the reported large-convolution regression, and avoids making a broader design change to the IGEMM path, which would be out of scope for this regression-fix PR. The new session option leaves the threshold overridable for follow-up tuning.

For reference, the PP-OCR, mobilenet and retinaface sweep data is shown in the following graphs:
mobilenetv1_log_heuristic
OCR_log_heuristic
retinaface_log_heuristic

MlasGemm vs MlasSgemmOperation per-thread segment (review point 6.)

The MlasGemm(..., nullptr, ...) call from MlasConvOperation under use_gemm_batch_override should not cause re-entrant threadpool dispatch.

In this call path, MlasGemm reaches MlasGemmBatch, which uses MlasTrySimpleParallel. With a nullptr threadpool, the internally used TrySimpleParallelFor collapses to a single-threaded loop, so passing nullptr here keeps the inner MlasGemm execution single-threaded.

Unit Tests run under TSAN

I built onnxruntime_mlas_test with -fsanitize=thread and ran existing MLAS Conv unit tests. The tests completed without any TSAN issues.

@hariharans29

Copy link
Copy Markdown
Member

Re-review — PR #28571 [MLAS] KleidiAI fix igemm regression

Verdict: approve. Commit 3d7acb4 ("Addressed maintainer review feedback") plus the inline replies cover every one of the eight substantive concerns from the prior round, and the author backed up the heuristic with actual measurement data. One minor non-blocker remains, called out below.

What got fixed since the last pass

1. ORT_ENFORCE in the parallel worker — gone. The bytes_per_m_step computation is now hoisted above MlasTrySimpleParallel and reduced to assert(bytes_per_m_step != 0); — which is correct, since CheckCapabilitiesSme already rejects the zero-size case for the Igemm route. Clean.

2. Per-tile std::make_unique — replaced with TLS scratch. The pattern now mirrors the pad_ptr thread-local cache:

// Thread-local grow-only reusable buffer for LHS packing
thread_local std::vector<std::byte> lhs_buffer;
if (lhs_buffer.size() < lhs_buffer_bytes) {
    lhs_buffer.resize(lhs_buffer_bytes);
}
auto* lhs = lhs_buffer.data();

No per-tile heap churn, and the invariant (single-threaded inside the lambda, grow-only) is the same one already accepted for pad_ptr and lhs_ptrs_cache_by_pad. The lifecycle is fine because the buffer is only read/written within the lambda body — no cross-call state escapes.

3. The 2 MiB chunk constant — documented and renamed. kMaxLhsPackedChunkBytes now has a four-line comment explaining the budget choice (common L2 size) and the per-m_step fall-through case, and the silent-truncation arithmetic is now explicit:

const size_t max_m_step_chunks = kMaxLhsPackedChunkBytes / bytes_per_m_step;
// If bytes_per_m_step exceeds kMaxLhsPackedChunkBytes, process m_step at a time
size_t m_chunk = max_m_step_chunks == 0 ? m_step : m_step * max_m_step_chunks;

Intent is now readable. The cpuinfo-derived tuning ask was correctly deferred — it's a separate concern from this regression fix.

4. ConvIgemmMaxWork — overflow-safe, overridable, and justified. Three good fixes here:

  • The formula is now the explicit output_m * effective_k * filter_count > limit form using MlasMultiplyOverflowsSizeT, with overflow correctly treated as "too large → fall back":
    size_t total_work;
    if (MlasMultiplyOverflowsSizeT(output_m, effective_k, &total_work) ||
        MlasMultiplyOverflowsSizeT(total_work, filter_count, &total_work)) {
        return ConvRouteSelection{ConvRoute::GemmFallback, ...};
    }
  • New session option kOrtSessionOptionsMlasKleidiAiConvIgemmMaxWork ("mlas.kleidiai.conv_igemm_max_work") with the docstring "0" or unset uses the MLAS default heuristic. Lets us tune this per deployment without a rebuild.
  • The 1'000'000ULL default is now justified in the PR thread: anchored to MLAS_SGEMM_THREAD_COMPLEXITY * MLAS_MAXIMUM_THREAD_COUNT == 64*1024 * 16 ≈ 1.05M, i.e., the existing point at which MLAS SGEMM already saturates its thread budget. That's a defensible anchor, not a freshly invented magic number.

The PP-OCR re-run is the part I most wanted to see: 2837 ms → 1057 ms on macOS Arm64, also beating the pre-regression 1.21.1 baseline of 1431 ms. That's a real fix, not a heuristic that papers over the symptom. The threshold sweeps on mobilenet/retinaface didn't show a sharper optimum, which is consistent with "1M is where SGEMM-with-threads catches up to single-thread IGEMM-with-im2col."

5. Cross-file coupling — removed via the dispatch hook. This is exactly the pattern I asked for:

// mlasi.h
enum MLAS_CONV_ROUTE { MlasConvRouteDefault, MlasConvRouteGemmFallback };
typedef MLAS_CONV_ROUTE (MLASCALL MLAS_CONV_ROUTE_OVERRIDE)(const MLAS_CONV_PARAMETERS*);
struct MLAS_PLATFORM {
    ...
    MLAS_CONV_ROUTE_OVERRIDE* MlasConvRouteOverride = nullptr;
};

// platform.cpp
this->MlasConvRouteOverride = ArmKleidiAI::MlasConvRoute;

// convolve.cpp
bool use_gemm_batch_override =
    GetMlasPlatform().MlasConvRouteOverride != nullptr &&
    GetMlasPlatform().MlasConvRouteOverride(Parameters) == MlasConvRouteGemmFallback;

No #if defined(USE_KLEIDIAI) scaffolding in convolve.cpp, the generic path stays backend-agnostic, and non-KleidiAI builds get the unchanged old behavior (nullptr override → MlasSgemmOperation).

6. MlasGemm(..., nullptr, ...) re-entrancy. Answered in the thread, and the answer is right: MlasGemmBatch with a null threadpool goes through MlasTrySimpleParallel → TrySimpleParallelFor, which with nullptr collapses to a serial loop. No re-entrant dispatch, no nested parallelism. Acceptable.

7. Duplicate overflow helper — converged. mul_overflow_size_t_builtin is deleted; the canonical MlasMultiplyOverflowsSizeT is hoisted into mlasi.h and the three call sites (sgemm_kleidiai.cpp, sbgemm_kleidiai.cpp, mlasi_kleidiai.h) all use it. One name, one definition. Good.

8. CheckCapabilitiesSme redundant kernel-size compute — cached. SelectConvRoute now returns a ConvRouteSelection { route, effective_kernel_h, effective_kernel_w }, and the caller in CheckCapabilitiesSme reads the dilated sizes off the result instead of recomputing.

Nits — all addressed

  • TryComputeConvOutputSize now returns false on degenerate shapes instead of true with *result = 0.
  • The filter_count == 1 rejection has a comment explaining the SME N-tile fill rationale.
  • Dilation-expansion comment restored above the ComputeKernelSize calls in ConvolveSme.
  • 1'000'000ULL uses digit separators.
  • CamelCase locals (ATile/BTile/CTile/MIdx/NIdx/TileSizeM/TileSizeN) all converted to snake_case in the touched block.
  • #include <iostream> correctly moved under the KLEIDIAI_*_LOGGING guard.

Still open — one minor item

SelectConvRoute is still called twice per Compute() when the convolution would route to GemmFallback:

  1. Once inside CheckCapabilitiesSme (returns false, so the KleidiAI conv override declines).
  2. Once inside convolve.cpp via GetMlasPlatform().MlasConvRouteOverride(Parameters) (returns MlasConvRouteGemmFallback, so MlasGemm is used).

The arithmetic is overflow-checked and non-trivial: two TryComputeDilatedKernelSize, two TryComputeConvOutputSize, three MlasMultiplyOverflowsSizeT. Not catastrophic — these are integer ops, not memory traffic — but the result is deterministic for a given MLAS_CONV_PARAMETERS, so caching it on the parameters struct (or on BackendKernelSelectorConfig) would be cheap. Non-blocking; fine as a follow-up.

(The MlasConvRoute override itself also reads BackendKernelSelectorConfig->use_kleidiai before delegating to SelectConvRoute — which is correct, because the user-disable path should win without doing any heuristic arithmetic.)

Things to confirm in CI

  • The new session option key "mlas.kleidiai.conv_igemm_max_work" is parsed via TryParseStringWithClassicLocale<size_t> and the failure path is ORT_ENFORCE. That's session creation — failure throws cleanly, which is the right behavior. No unit test covering bad input (-1, "foo", "1e9"); a single negative test in test/framework/session_state_test.cc or similar would lock in the contract.
  • TSAN run on onnxruntime_mlas_test was reported clean — good. With three thread_local containers in play (pad_ptr, lhs_ptrs_cache_by_pad, lhs_buffer) and the new chunked write pattern, that was worth checking explicitly.
  • 85/85 checks green, 4 commits, 349 / 155 lines.

Bottom line

Every blocking concern from the prior review is addressed, the perf justification is now grounded in PP-OCR numbers and an MLAS-internal anchor for the threshold, the dispatch coupling went through the right platform-override seam, and the overflow-helper triplication is converged. Ship it. Squash on merge (4 commits, the second and fourth are review-driven cleanup that don't need to survive history). The remaining SelectConvRoute double-evaluation is a follow-up, not a blocker.

Comment thread onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h Outdated
Comment thread onnxruntime/core/mlas/lib/mlasi.h Outdated

@hariharans29 hariharans29 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I (and Copilot) left a few more comments. Also, some builds are failing - worth checking if they are to do with your change. Could you please take a look ?

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comment thread onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h
Comment thread onnxruntime/core/mlas/lib/convolve.cpp Outdated
Comment thread onnxruntime/core/providers/cpu/mlas_backend_kernel_selector_config_utils.h Outdated
@martin-klacer-arm

Copy link
Copy Markdown
Contributor Author

Thank you for the comments! They are all clear and sensible. Good catch with the NHWC padding symmetry correctness comment.

I have a patch completed resolving the comments, I'll push it here once it clears internal review.

@martin-klacer-arm

Copy link
Copy Markdown
Contributor Author

Hi @hariharans29, thank you for the review! I made changes to address the generated comments.

In addition, I found that the pipeline failures were indeed related to the patch, included in the latest commit is a fix moving from TryGetConfigEntry (not always supported) to GetConfigEntry (supported in all cases) in onnxruntime/core/providers/cpu/mlas_backend_kernel_selector_config_utils.h.

@hariharans29

Copy link
Copy Markdown
Member

Third pass — PR #28571

Verdict: still approve. Commit ebf3f4e ("Addressed review comments") plus the follow-up pipeline-failure fix are all positive changes. One real latent-bug fix slipped in alongside the requested edits, and the new config-option contract is now locked in by unit tests. Nothing blocking.


New since the last pass

1. NHWC padding gate is now strictly uniform — and that's a correctness fix

onnxruntime/core/mlas/lib/convolve.cpp (around line 1393-1398):

// The channels-last float convolution path is only implemented by the Arm® KleidiAI™
// SME conv override, which currently uses a single padding value in its LHS indirection table.
// Require uniform padding until separate H/W padding is supported there.
if (Padding[0] != Padding[2] ||
    Padding[1] != Padding[3] ||
    Padding[0] != Padding[1]) {
  return false;
}

The previous gate accepted any H-symmetric AND W-symmetric padding (e.g., {1, 2, 1, 2} was passing). Since ConvolveSme calls LhsPackImageDataSme(..., sh, sw, padding, ...) with a single scalar padding, the prior shape {1, 2, 1, 2} would silently compute wrong outputs (the kernel would use Padding[0]=1 for both H and W).

This isn't just a tightening for the new IGEMM route — MlasConvSupportsSymmetricChannelsLast2DFloatKernel is also the gate for the dense NHWC float path landed in #26834 / #28565. None of the existing tests use H≠W padding (all {1,1,1,1}), so the bug was never tripped. Real latent-bug fix. The matching pre-check inside SelectConvRoute (Padding[0] != Padding[1]) is consistent.

Worth a one-line note in the PR description acknowledging this is a behavioral change beyond the IGEMM regression scope — anyone bisecting an NHWC accuracy issue across this PR will appreciate it.

2. Enum and override renaming clarify scope

  • ConvRoute::{None, Igemm, GemmFallback}ConvRoute::{NoKleidiAi, IGemm, SGemmFallback}. NoKleidiAi is now self-documenting ("decline the conv, caller runs unchanged"); my prior nit on None being ambiguous is resolved.
  • MLAS_CONV_ROUTE_OVERRIDEMLAS_CONV_SGEMM_ROUTE_OVERRIDE with enum MLAS_CONV_SGEMM_ROUTE { MlasConvSGemmRouteDirect, MlasConvSGemmRouteDispatch }. The name now signals exactly what the hook controls — the choice between MlasSgemmOperation (direct) and MlasGemm (dispatched through backend overrides) inside MlasConvOperation. Not the whole conv route.
  • New helper MlasConvSGemmRoute in mlasi_kleidiai.h short-circuits on the user opt-out (use_kleidiai == false) before running any heuristic arithmetic. Correct order of operations.

3. Config-option parsing — contract is now tested

onnxruntime/test/providers/cpu/cpu_execution_provider_test.cc (around line 37-63) adds two tests:

  • MlasBackendKernelSelectorParsesKleidiAiConvIgemmMaxWork — positive path, checks 1234567 is parsed.
  • MlasBackendKernelSelectorRejectsInvalidKleidiAiConvIgemmMaxWork — negative path, verifies ORT_ENFORCE throws on "Not a Number" and that the error message mentions both the option key and the offending value.

Exactly the missing-test gap I flagged in the prior round. Note: the negative test catches OnnxRuntimeException, which assumes exceptions are enabled — fine, since gtest itself requires that. No need to add an MLAS_NO_EXCEPTION variant.

4. TryGetConfigEntryGetConfigEntry build fix

onnxruntime/core/providers/cpu/mlas_backend_kernel_selector_config_utils.h:

if (auto conv_igemm_max_work =
        config_options.GetConfigEntry(kOrtSessionOptionsMlasKleidiAiConvIgemmMaxWork)) {
  ORT_ENFORCE(TryParseStringWithClassicLocale<size_t>(*conv_igemm_max_work,
                                                       config.kleidiai_conv_igemm_max_work),
              "Invalid value for ", kOrtSessionOptionsMlasKleidiAiConvIgemmMaxWork,
              ": ", *conv_igemm_max_work, ". Expected a non-negative integer.");
}

Author's note explains TryGetConfigEntry isn't available in all build configs (which is what was failing CI). GetConfigEntry returning std::optional<std::string> works in every flavor of ConfigOptions we ship. Right call.

5. MlasMultiplyOverflowsSizeT canonicalization

Moved into mlasi.h (around line 122-147) as MLAS_FORCEINLINE, with __has_builtin guard for __builtin_mul_overflow and a manual if (b != 0 && a > SIZE_MAX/b) fallback. The local mul_overflow_size_t_builtin is deleted from mlasi_kleidiai.h and all three call sites (sgemm_kleidiai.cpp, sbgemm_kleidiai.cpp, the new SelectConvRoute paths) migrated. One name, one definition — addresses my prior triplication concern.

Tiny detail: when the builtin reports overflow it still writes truncated low bits to *out, but every caller checks the return value first and only consumes *out on the false path. Correct.


Tiny nits — not blocking

  • Inside the inline MlasConvSGemmRoute (which itself lives in the ArmKleidiAI namespace), the body explicitly qualifies ArmKleidiAI::SelectConvRoute(...) and ArmKleidiAI::ConvRoute::SGemmFallback. The qualifiers are redundant; unqualified SelectConvRoute and ConvRoute::SGemmFallback would read cleaner. Cosmetic.
  • The previously-open SelectConvRoute double-evaluation (once in CheckCapabilitiesSme, once via MlasConvSGemmRoute) is still open. Still a minor follow-up, not a blocker — the arithmetic is integer-only.

Bottom line

Every concern from the prior two rounds is addressed; the latest commits add a latent NHWC correctness fix, lock in the new config-option contract with two tests, fix the pipeline regression from TryGetConfigEntry, and converge on canonical helper names. CI is still settling on the latest push (ebf3f4e) — once it goes green this is ready. Squash on merge (5 commits → 1 logical change).

hariharans29
hariharans29 previously approved these changes Jun 12, 2026
@hariharans29

Copy link
Copy Markdown
Member

Can you please fix the conflict ?

@martin-klacer-arm

Copy link
Copy Markdown
Contributor Author

Hi @Hariharan29, thanks for the comment! I rebased the branch which should resolve the conflict. In addition, I added a small change to fix a numeric issue related to NCHW/NHWC correctness that was flagged by failing tests. With the latest commits, all tests are passing locally.

@hariharans29 hariharans29 added release:1.27.1 regression issues that demonstrate a regression in ORT functionality and need to be addressed immediately labels Jun 18, 2026
JonathanC-ARM and others added 7 commits June 19, 2026 15:50
previously, the convolution kernel for KLEIDIAI would allocate a large contiguous buffer
for the LHS (left-hand-side) matrix packing, which could consume excessive memory
and reduce cache efficiency.

This patch modifies the packing strategy to use a chunked approach:
- Introduce a compile-time upper bound for temporary LHS packing buffers
- Allocate a moderate-sized temporary buffer once.
- Pack LHS rows in chunks, perform computation, then reuse the buffer for the next chunk.

Benefits:
- Significantly reduces peak memory usage.
- Improves cache utilization and overall computation efficiency.
- Avoids potential memory allocation failures for large convolutions.

Performance improvement:
- Test with model https://huggingface.co/garavv/arcface-onnx on MTK D9500

    Before this patch
    ```
    ./build/RelWithDebInfo/onnxruntime_perf_test -x 1  -r 1000 arc.onnx

    Number of inferences per second: 4.25327
    ```

    After this patch
    ```
    ./build/RelWithDebInfo/onnxruntime_perf_test -x 1  -r 1000 arc.onnx

    Number of inferences per second: 5.03257
    ```

----------------------------------------------------------------------------------------------------------------
sme_with_patch | sme_without_patch  | Diff (μs) | Change % | Node Name
----------------------------------------------------------------------------------------------------------------
  11975.10 |    31235.30 |     19260.20 |    160.84% | StatefulPartitionedCall/ResNet34/conv2_block1_1_conv/Conv2D
   4514.80 |     7691.70 |      3176.90 |     70.37% | StatefulPartitionedCall/ResNet34/conv2_block2_1_conv/Conv2D
   4220.20 |     7120.70 |      2900.50 |     68.73% | StatefulPartitionedCall/ResNet34/conv2_block3_1_conv/Conv2D
   5429.20 |     8279.60 |      2850.40 |     52.50% | StatefulPartitionedCall/ResNet34/conv3_block1_1_conv/Conv2D
   4497.80 |     5478.40 |       980.60 |     21.80% | StatefulPartitionedCall/ResNet34/conv4_block1_1_conv/Conv2D
   3474.30 |     4351.80 |       877.50 |     25.26% | StatefulPartitionedCall/ResNet34/conv3_block3_1_conv/Conv2D
   3627.30 |     4504.00 |       876.70 |     24.17% | StatefulPartitionedCall/ResNet34/conv3_block4_1_conv/Conv2D
   5244.20 |     5961.10 |       716.90 |     13.67% | StatefulPartitionedCall/ResNet34/conv1_conv/Conv2D
   3439.80 |     4050.90 |       611.10 |     17.77% | StatefulPartitionedCall/ResNet34/conv3_block2_1_conv/Conv2D
   9749.80 |    10195.50 |       445.70 |      4.57% | StatefulPartitionedCall/ResNet34/conv2_block2_2_conv/Conv2D
   3814.00 |     4209.80 |       395.80 |     10.38% | StatefulPartitionedCall/ResNet34/conv5_block2_2_conv/Conv2D
   2715.90 |     3034.70 |       318.80 |     11.74% | StatefulPartitionedCall/ResNet34/conv4_block6_1_conv/Conv2D
   4089.10 |     4367.80 |       278.70 |      6.82% | StatefulPartitionedCall/ResNet34/conv5_block1_1_conv/Conv2D
   2698.00 |     2959.50 |       261.50 |      9.69% | StatefulPartitionedCall/ResNet34/conv4_block5_1_conv/Conv2D
   3869.20 |     4102.80 |       233.60 |      6.04% | StatefulPartitionedCall/ResNet34/conv5_block3_2_conv/Conv2D
   2767.90 |     2966.80 |       198.90 |      7.19% | StatefulPartitionedCall/ResNet34/conv4_block4_1_conv/Conv2D
   9652.10 |     9816.60 |       164.50 |      1.70% | StatefulPartitionedCall/ResNet34/conv2_block3_2_conv/Conv2D
   2897.50 |     3054.60 |       157.10 |      5.42% | StatefulPartitionedCall/ResNet34/conv4_block3_1_conv/Conv2D
   4601.20 |     4748.60 |       147.40 |      3.20% | StatefulPartitionedCall/ResNet34/conv5_block1_2_conv/Conv2D
   3134.00 |     3246.10 |       112.10 |      3.58% | StatefulPartitionedCall/ResNet34/conv4_block2_1_conv/Conv2D

Signed-off-by: Qxiang Xu <Qixiang.Xu@arm.com>
Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
 - Added SelectConvRoute function to mlasi_kleidiai.h to decide between
   GemmFallback and Igemm based on the convolution workload parameters
 - Updated CheckCapabilitiesSme function in convolve_kleidiai.cpp
   to use the new SelectConvRoute function

Co-authored-by: Damien Dooley <damien.dooley@arm.com>
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
- Move route selection behind the MLAS platform override hook so generic
  convolve.cpp doesn't depend directly on KleidiAI headers
- Replace local overflow helper with MlasMultiplyOverflowsSizeT and use
  overflow-checked arithmetic for output/work estimates
- Return effective kernel dimensions from route selection so capability
  checks don't recompute them
- Move the LHS packed-size sanity check out of the parallel worker
- Replace the per-tile heap allocation with a thread-local grow-only
  scratch buffer
- Document the IGEMM workload heuristic and add a session config
  override for mlas.kleidiai.conv_igemm_max_work
- Add comments for ConvRoute::None, single-output-channel routing
  and LHS chunk size constant

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
 * Updated mlasi_kleidiai.h ConvRoute enum names and comments
   in line with reviewer suggestions
 * Updated mlasi.h MLAS_CONV_ROUTE name, values and comments
   in line with reviewer suggestions
 * Changed MlasConvSupportsSymmetricChannelsLast2DFloatKernel
   to accurately reflect Arm® KleidiAI™ symmetric padding requirement
 * Added valid and invalid tests for ConvIgemmMaxWork session
   option to cpu_execution_provider_test.cc

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
 * Added check for NHWC layout in SelectConvRoute in onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h
 * Fallback routes in SelectConvRoute currently assume NCHW layout, added check to ensure NHWC stays on IGEMM

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
@hariharans29 hariharans29 enabled auto-merge (squash) June 19, 2026 18:22
@hariharans29 hariharans29 merged commit bd0cb9a into microsoft:main Jun 19, 2026
86 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

regression issues that demonstrate a regression in ORT functionality and need to be addressed immediately release:1.27.1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Performance] CPU Conv kernel regression for large kernels (9x9) on macOS ARM64 in ORT 1.24.x

4 participants