[GH 28654] Address keidiai security issue by yuslepukhin · Pull Request #28678 · microsoft/onnxruntime

yuslepukhin · 2026-05-26T19:37:56Z

This pull request improves the reliability and determinism of convolution kernel packing and expands the convolution unit test coverage to ensure buffer management is robust against allocation and reuse issues.

Reliability and determinism improvements:

In convolve_kleidiai.cpp, the packed buffer for the LHS (left-hand side) image data is now explicitly zero-initialized before use, ensuring that any stale heap data does not affect computations, especially in cases where some packing paths may not overwrite every byte.

Testing and validation enhancements:

In test_conv2d.h, the convolution buffer growth test is expanded to run both increasing and decreasing input channel (CI) orders, validating that the results remain stable regardless of invocation sequence and increasing the chances of catching buffer reuse or reallocation bugs.

hariharans29 · 2026-05-26T19:41:00Z

CC: @JonathanC-ARM - can you please take a look at this PR ?

Copilot

Pull request overview

This PR targets a reported KleidiAI convolution robustness/security concern by ensuring deterministic packed-buffer contents and by strengthening regression coverage around internal buffer growth/reuse behavior.

Changes:

Zero-initialize the KleidiAI Conv2D LHS packed buffer before running the LHS packing kernel.
Expand the existing KleidiAI Conv2D regression sequence to exercise both increasing and decreasing CI invocation orders.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp`	Adds explicit zero-initialization of the allocated packed LHS buffer to avoid stale bytes impacting computation/determinism.
`onnxruntime/test/mlas/unittest/test_conv2d.h`	Extends the KleidiAI Conv2D regression loop to cover both CI growth and shrink orders for better buffer reuse/reallocation coverage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    auto lhs = std::make_unique<std::byte[]>(lhs_size);
+    // Some ukernel packing paths may not overwrite every byte in the packed buffer for partial tiles.
+    // Start from a deterministic zero state so stale heap contents cannot influence later math.
+    std::fill_n(lhs.get(), lhs_size, std::byte{0});


hariharans29 · 2026-05-26T19:44:05Z

Review — PR #28678 `[GH 28654] Address keidiai security issue`

Verdict: does not fix the reported bug. I'd push back on merging this as-is.

The "fix" is a no-op

auto lhs = std::make_unique<std::byte[]>(lhs_size);
std::fill_n(lhs.get(), lhs_size, std::byte{0});

std::make_unique<T[]>(n) value-initializes the array (new T[n]()), which for std::byte (a trivially-default-constructible type) is zero-initialization per [expr.new]/22 and [dcl.init]. The std::fill_n is redundant — it zeroes a buffer that is already zero. The Copilot bot already flagged this and it's correct.

So the code change has zero observable behavior change in any conforming toolchain. The commit message and PR description claim "stale heap data" was affecting computations, but the allocation path being modified cannot produce stale bytes — it's a fresh make_unique per call, not a reused scratchpad.

This is not where the bug lives

Issue #28654 reports that running A.onnx then B.onnx (or vice versa) yields the wrong result for the second model — classic cross-invocation contamination. That requires state that persists across Compute() calls. The buffer being modified here is a stack-scoped std::unique_ptr allocated and freed inside LhsPackImageDataSme for a single call. It cannot carry state between invocations.

The actual scratchpad-reuse offenders in kleidiai/ are the thread_local buffers — g_kai_tls_qgemm.lhs_packed / lhs_base_table in qgemm_kleidiai.cpp, the g_kai_half_tls introduced by PR #28487, and the lhs_ptrs_cache_by_pad / pad_ptr caching in convolve_kleidiai.cpp itself (just above the modified line). The qgemm BatchSize→BatchN rename in #28487 and the lhs_ptrs_cache_by_pad keying fix in that same PR look much more like the real culprits than this change.

A short experiment that would have caught this: revert the fill_n, re-run the user's repro from #28654. I'd bet money the bug still reproduces.

Test change

The test_conv2d.h addition exercises both growth orders (small→large→small, then large→small→small) and is a legitimate improvement to the regression coverage. Keep it. But note: this test runs in MLAS unit tests, which means it only exercises MlasConv directly. The user's repro in #28654 goes through the full inference session with disable_prepacking, enable_cpu_mem_arena=0, etc. — the MLAS-level test cannot reach the higher-layer scratchpad that's likely contaminated.

Recommendations

Drop the std::fill_n or replace make_unique with make_unique_for_overwrite if the intent is to document non-init followed by explicit zeroing. Otherwise the comment is misleading.
Keep the test change but on its own commit.
Investigate the actual contamination source. Suggested starting points:
- g_kai_tls_qgemm.lhs_packed resize-without-clear semantics across two sessions with different shapes.
- The lhs_ptrs_cache_by_pad[cur_pad_ptr] cache in this same file — cur_pad_ptr is a heap pointer whose address may be reused across calls if the pad buffer is reallocated to the same address.
- Any static/thread_local state in the f16/f32 KleidiAI paths added recently.
Reproduce [Bug] Memory Contamination due to Scratchpad Reuse #28654 against the patched build before claiming the issue is closed. The PR currently links to the issue with "may be closed by this pull request" but there's no evidence the symptom goes away.

Process note

The PR title says "security issue" but #28654 is filed as a correctness bug ("Memory Contamination due to Scratchpad Reuse"). If there's a separate security analysis backing the "security" framing (information leak across sessions, etc.), please link it. Otherwise the title overstates the severity and the change does not match it.

Bottom line: the test improvement is fine, but the code change is a no-op and the linked issue will almost certainly still reproduce after this lands. Please do not close #28654 on the basis of this PR.

yuslepukhin · 2026-05-26T20:41:21Z

I would defer this to Kleidai

JonathanC-ARM · 2026-05-26T21:50:23Z

Hi @yuslepukhin, I had a look at the issue related to this pr. So I found that the issue is no longer reproducible in the latest version of convolve_kleidiai.cpp. My debugging pointed me towards an issue where we were invalidly reusing hashed cached values for rhs segments. The code that is a culprit for this has already be removed via #26834. So at least in the case of the latest code its not an issue any longer. For 1.26 however the key would be to disable he key would be to disable/remove that RHS cache path. A conservative release-branch fix would be to always repack RHS in RhsPackWeightsBiasSme instead of using the thread_local cache keyed by the partial weight hash.

yuslepukhin · 2026-05-26T22:23:24Z

Hi @yuslepukhin, I had a look at the issue related to this pr. So I found that the issue is no longer

The author of the issue claims it is a repro in 1.26 which is more recent than the PR

hariharans29 · 2026-05-26T22:37:41Z

Hi @yuslepukhin, I had a look at the issue related to this pr. So I found that the issue is no longer

The author of the issue claims it is a repro in 1.26 which is more recent than the PR

No it's not. the PR was merged after 1.26.0 - maybe 1.26.1 was produced after this PR made into main (not too sure about that), but this PR was never a cherry-pick candidate.

Also @JonathanC-ARM was able to repro OP's issue with 1.26.0 and the issue went away with the main branch

JonathanC-ARM · 2026-05-26T22:37:52Z

Hi @yuslepukhin, I had a look at the issue related to this pr. So I found that the issue is no longer

The author of the issue claims it is a repro in 1.26 which is more recent than the PR

Hi @yuslepukhin The particular pr was quite long running, it was opened last year but only merged 2 weeks ago. At first glance the date would lead you to believe it's very old.

I checked the latest code from 1.26 https://github.com/microsoft/onnxruntime/blob/v1.26.0/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp which definitely doesn't contain the change from the referenced pr.

Address keidiai security issue

00f29a1

yuslepukhin requested a review from Copilot May 26, 2026 19:37

yuslepukhin linked an issue May 26, 2026 that may be closed by this pull request

[Bug] Memory Contamination due to Scratchpad Reuse #28654

Open

Copilot started reviewing on behalf of yuslepukhin May 26, 2026 19:38 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

yuslepukhin mentioned this pull request May 26, 2026

[Bug] Memory Contamination due to Scratchpad Reuse #28654

Open

yuslepukhin closed this May 26, 2026

yuslepukhin deleted the yuslepukhin/scratch_pad_reuse branch May 26, 2026 22:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GH 28654] Address keidiai security issue#28678

[GH 28654] Address keidiai security issue#28678
yuslepukhin wants to merge 1 commit into
mainfrom
yuslepukhin/scratch_pad_reuse

yuslepukhin commented May 26, 2026

Uh oh!

hariharans29 commented May 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

hariharans29 commented May 26, 2026

Uh oh!

yuslepukhin commented May 26, 2026

Uh oh!

JonathanC-ARM commented May 26, 2026

Uh oh!

yuslepukhin commented May 26, 2026

Uh oh!

hariharans29 commented May 26, 2026 •

edited

Loading

Uh oh!

JonathanC-ARM commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yuslepukhin commented May 26, 2026

Uh oh!

hariharans29 commented May 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

hariharans29 commented May 26, 2026

Review — PR #28678 [GH 28654] Address keidiai security issue

The "fix" is a no-op

This is not where the bug lives

Test change

Recommendations

Process note

Uh oh!

yuslepukhin commented May 26, 2026

Uh oh!

JonathanC-ARM commented May 26, 2026

Uh oh!

yuslepukhin commented May 26, 2026

Uh oh!

hariharans29 commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JonathanC-ARM commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Review — PR #28678 `[GH 28654] Address keidiai security issue`

hariharans29 commented May 26, 2026 •

edited

Loading