[MLAS] KleidiAI fix igemm regression#28571
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses a CPU convolution performance regression on ARM64 when the KleidiAI SME IGEMM path is selected, by introducing a new convolution route-selection heuristic and changing the KleidiAI LHS packing strategy to reduce peak temporary memory and improve cache locality on large workloads.
Changes:
- Added
ArmKleidiAI::SelectConvRoute(withConvRouteoptionsIgemm,GemmFallback,None) to decide whether to run KleidiAI IGEMM, fall back to GEMM, or use the existing path. - Updated KleidiAI convolution to pack the IGEMM LHS in bounded chunks instead of packing the full LHS up front.
- In the generic MLAS convolution implementation, routed certain workloads to
MlasGemm(instead ofMlasSgemmOperation) whenSelectConvRoutechoosesGemmFallback.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h | Adds ConvRoute + SelectConvRoute heuristic helpers for choosing conv execution route. |
| onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp | Reworks KleidiAI SME IGEMM LHS packing into chunked packing and integrates route selection into capability checks. |
| onnxruntime/core/mlas/lib/convolve.cpp | Uses SelectConvRoute to optionally switch im2col+SGEMM slices to MlasGemm for the fallback route. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@microsoft-github-policy-service agree company="Arm" |
|
Can you please address the Copliot comments ? |
|
Hi, thanks for the reminder. The comments Copilot left are sensible, I'm addressing them with a new patch that's currently work in progress. Once it's done and clears internal review, I'll add it to the PR. |
Thank you! |
|
Hi, I pushed a new patch addressing the Copilot comments, and left a reply with further details on one of the comments. |
Review — PR #28571
|
|
Thank you for the thorough review! The comments are sensible and understandable. I will update the PR as soon as we have a patch to address them finished and internally reviewed. |
|
Hi @hariharans29, thanks again for the thorough review. I've pushed an updated patch addressing the comments. Summary of the changes with respect to the previous review comment points:
Nits:
Additional verification and notesRe-running the regressing modelI reran the regressing PP-OCR model from #27633 on a macOS Arm64 device based on the original report. Using I didn't include a Snapdragon X run because the original issue reported the regression on macOS Arm64.
|
Re-review — PR #28571
|
|
Thank you for the comments! They are all clear and sensible. Good catch with the NHWC padding symmetry correctness comment. I have a patch completed resolving the comments, I'll push it here once it clears internal review. |
|
Hi @hariharans29, thank you for the review! I made changes to address the generated comments. In addition, I found that the pipeline failures were indeed related to the patch, included in the latest commit is a fix moving from |
Third pass — PR #28571Verdict: still approve. Commit New since the last pass1. NHWC padding gate is now strictly uniform — and that's a correctness fix
// The channels-last float convolution path is only implemented by the Arm® KleidiAI™
// SME conv override, which currently uses a single padding value in its LHS indirection table.
// Require uniform padding until separate H/W padding is supported there.
if (Padding[0] != Padding[2] ||
Padding[1] != Padding[3] ||
Padding[0] != Padding[1]) {
return false;
}The previous gate accepted any H-symmetric AND W-symmetric padding (e.g., This isn't just a tightening for the new IGEMM route — Worth a one-line note in the PR description acknowledging this is a behavioral change beyond the IGEMM regression scope — anyone bisecting an NHWC accuracy issue across this PR will appreciate it. 2. Enum and override renaming clarify scope
3. Config-option parsing — contract is now tested
Exactly the missing-test gap I flagged in the prior round. Note: the negative test catches 4.
|
|
Can you please fix the conflict ? |
ebf3f4e to
8571e8f
Compare
|
Hi @Hariharan29, thanks for the comment! I rebased the branch which should resolve the conflict. In addition, I added a small change to fix a numeric issue related to NCHW/NHWC correctness that was flagged by failing tests. With the latest commits, all tests are passing locally. |
previously, the convolution kernel for KLEIDIAI would allocate a large contiguous buffer for the LHS (left-hand-side) matrix packing, which could consume excessive memory and reduce cache efficiency. This patch modifies the packing strategy to use a chunked approach: - Introduce a compile-time upper bound for temporary LHS packing buffers - Allocate a moderate-sized temporary buffer once. - Pack LHS rows in chunks, perform computation, then reuse the buffer for the next chunk. Benefits: - Significantly reduces peak memory usage. - Improves cache utilization and overall computation efficiency. - Avoids potential memory allocation failures for large convolutions. Performance improvement: - Test with model https://huggingface.co/garavv/arcface-onnx on MTK D9500 Before this patch ``` ./build/RelWithDebInfo/onnxruntime_perf_test -x 1 -r 1000 arc.onnx Number of inferences per second: 4.25327 ``` After this patch ``` ./build/RelWithDebInfo/onnxruntime_perf_test -x 1 -r 1000 arc.onnx Number of inferences per second: 5.03257 ``` ---------------------------------------------------------------------------------------------------------------- sme_with_patch | sme_without_patch | Diff (μs) | Change % | Node Name ---------------------------------------------------------------------------------------------------------------- 11975.10 | 31235.30 | 19260.20 | 160.84% | StatefulPartitionedCall/ResNet34/conv2_block1_1_conv/Conv2D 4514.80 | 7691.70 | 3176.90 | 70.37% | StatefulPartitionedCall/ResNet34/conv2_block2_1_conv/Conv2D 4220.20 | 7120.70 | 2900.50 | 68.73% | StatefulPartitionedCall/ResNet34/conv2_block3_1_conv/Conv2D 5429.20 | 8279.60 | 2850.40 | 52.50% | StatefulPartitionedCall/ResNet34/conv3_block1_1_conv/Conv2D 4497.80 | 5478.40 | 980.60 | 21.80% | StatefulPartitionedCall/ResNet34/conv4_block1_1_conv/Conv2D 3474.30 | 4351.80 | 877.50 | 25.26% | StatefulPartitionedCall/ResNet34/conv3_block3_1_conv/Conv2D 3627.30 | 4504.00 | 876.70 | 24.17% | StatefulPartitionedCall/ResNet34/conv3_block4_1_conv/Conv2D 5244.20 | 5961.10 | 716.90 | 13.67% | StatefulPartitionedCall/ResNet34/conv1_conv/Conv2D 3439.80 | 4050.90 | 611.10 | 17.77% | StatefulPartitionedCall/ResNet34/conv3_block2_1_conv/Conv2D 9749.80 | 10195.50 | 445.70 | 4.57% | StatefulPartitionedCall/ResNet34/conv2_block2_2_conv/Conv2D 3814.00 | 4209.80 | 395.80 | 10.38% | StatefulPartitionedCall/ResNet34/conv5_block2_2_conv/Conv2D 2715.90 | 3034.70 | 318.80 | 11.74% | StatefulPartitionedCall/ResNet34/conv4_block6_1_conv/Conv2D 4089.10 | 4367.80 | 278.70 | 6.82% | StatefulPartitionedCall/ResNet34/conv5_block1_1_conv/Conv2D 2698.00 | 2959.50 | 261.50 | 9.69% | StatefulPartitionedCall/ResNet34/conv4_block5_1_conv/Conv2D 3869.20 | 4102.80 | 233.60 | 6.04% | StatefulPartitionedCall/ResNet34/conv5_block3_2_conv/Conv2D 2767.90 | 2966.80 | 198.90 | 7.19% | StatefulPartitionedCall/ResNet34/conv4_block4_1_conv/Conv2D 9652.10 | 9816.60 | 164.50 | 1.70% | StatefulPartitionedCall/ResNet34/conv2_block3_2_conv/Conv2D 2897.50 | 3054.60 | 157.10 | 5.42% | StatefulPartitionedCall/ResNet34/conv4_block3_1_conv/Conv2D 4601.20 | 4748.60 | 147.40 | 3.20% | StatefulPartitionedCall/ResNet34/conv5_block1_2_conv/Conv2D 3134.00 | 3246.10 | 112.10 | 3.58% | StatefulPartitionedCall/ResNet34/conv4_block2_1_conv/Conv2D Signed-off-by: Qxiang Xu <Qixiang.Xu@arm.com> Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
- Added SelectConvRoute function to mlasi_kleidiai.h to decide between GemmFallback and Igemm based on the convolution workload parameters - Updated CheckCapabilitiesSme function in convolve_kleidiai.cpp to use the new SelectConvRoute function Co-authored-by: Damien Dooley <damien.dooley@arm.com> Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
- Move route selection behind the MLAS platform override hook so generic convolve.cpp doesn't depend directly on KleidiAI headers - Replace local overflow helper with MlasMultiplyOverflowsSizeT and use overflow-checked arithmetic for output/work estimates - Return effective kernel dimensions from route selection so capability checks don't recompute them - Move the LHS packed-size sanity check out of the parallel worker - Replace the per-tile heap allocation with a thread-local grow-only scratch buffer - Document the IGEMM workload heuristic and add a session config override for mlas.kleidiai.conv_igemm_max_work - Add comments for ConvRoute::None, single-output-channel routing and LHS chunk size constant Signed-off-by: Martin Klacer <martin.klacer@arm.com>
* Updated mlasi_kleidiai.h ConvRoute enum names and comments in line with reviewer suggestions * Updated mlasi.h MLAS_CONV_ROUTE name, values and comments in line with reviewer suggestions * Changed MlasConvSupportsSymmetricChannelsLast2DFloatKernel to accurately reflect Arm® KleidiAI™ symmetric padding requirement * Added valid and invalid tests for ConvIgemmMaxWork session option to cpu_execution_provider_test.cc Signed-off-by: Martin Klacer <martin.klacer@arm.com>
* Added check for NHWC layout in SelectConvRoute in onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h * Fallback routes in SelectConvRoute currently assume NCHW layout, added check to ensure NHWC stays on IGEMM Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
8571e8f to
75f0249
Compare



Description
This PR fixes a convolution performance regression affecting some OCR models with large-kernel convolutions when the KleidiAI SME IGEMM convolution path is selected.
The change has 2 parts:
ArmKleidiAI::SelectConvRoutethat decides betweenIgemm,GemmFallbackandNonebased on convolution parameters and a workload size-based heuristic.The function
CheckCapabilitiesSmerunsSelectConvRouteand only returns true if the selected route isIgemm. The patch also adds a standard GEMM fallback to theConvRoutepossibilities, and runsMlasGemmif said fallback is selected. If the function selectsNone, then the convolution falls back toMlasSgemmOperation.Motivation and Context
Fixes #27633.