Skip to content

[GH 28654] Address keidiai security issue#28678

Closed
yuslepukhin wants to merge 1 commit into
mainfrom
yuslepukhin/scratch_pad_reuse
Closed

[GH 28654] Address keidiai security issue#28678
yuslepukhin wants to merge 1 commit into
mainfrom
yuslepukhin/scratch_pad_reuse

Conversation

@yuslepukhin
Copy link
Copy Markdown
Member

This pull request improves the reliability and determinism of convolution kernel packing and expands the convolution unit test coverage to ensure buffer management is robust against allocation and reuse issues.

Reliability and determinism improvements:

  • In convolve_kleidiai.cpp, the packed buffer for the LHS (left-hand side) image data is now explicitly zero-initialized before use, ensuring that any stale heap data does not affect computations, especially in cases where some packing paths may not overwrite every byte.

Testing and validation enhancements:

  • In test_conv2d.h, the convolution buffer growth test is expanded to run both increasing and decreasing input channel (CI) orders, validating that the results remain stable regardless of invocation sequence and increasing the chances of catching buffer reuse or reallocation bugs.

@hariharans29
Copy link
Copy Markdown
Member

CC: @JonathanC-ARM - can you please take a look at this PR ?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets a reported KleidiAI convolution robustness/security concern by ensuring deterministic packed-buffer contents and by strengthening regression coverage around internal buffer growth/reuse behavior.

Changes:

  • Zero-initialize the KleidiAI Conv2D LHS packed buffer before running the LHS packing kernel.
  • Expand the existing KleidiAI Conv2D regression sequence to exercise both increasing and decreasing CI invocation orders.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp Adds explicit zero-initialization of the allocated packed LHS buffer to avoid stale bytes impacting computation/determinism.
onnxruntime/test/mlas/unittest/test_conv2d.h Extends the KleidiAI Conv2D regression loop to cover both CI growth and shrink orders for better buffer reuse/reallocation coverage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 449 to +452
auto lhs = std::make_unique<std::byte[]>(lhs_size);
// Some ukernel packing paths may not overwrite every byte in the packed buffer for partial tiles.
// Start from a deterministic zero state so stale heap contents cannot influence later math.
std::fill_n(lhs.get(), lhs_size, std::byte{0});
@hariharans29
Copy link
Copy Markdown
Member

Review — PR #28678 [GH 28654] Address keidiai security issue

Verdict: does not fix the reported bug. I'd push back on merging this as-is.

The "fix" is a no-op

auto lhs = std::make_unique<std::byte[]>(lhs_size);
std::fill_n(lhs.get(), lhs_size, std::byte{0});

std::make_unique<T[]>(n) value-initializes the array (new T[n]()), which for std::byte (a trivially-default-constructible type) is zero-initialization per [expr.new]/22 and [dcl.init]. The std::fill_n is redundant — it zeroes a buffer that is already zero. The Copilot bot already flagged this and it's correct.

So the code change has zero observable behavior change in any conforming toolchain. The commit message and PR description claim "stale heap data" was affecting computations, but the allocation path being modified cannot produce stale bytes — it's a fresh make_unique per call, not a reused scratchpad.

This is not where the bug lives

Issue #28654 reports that running A.onnx then B.onnx (or vice versa) yields the wrong result for the second model — classic cross-invocation contamination. That requires state that persists across Compute() calls. The buffer being modified here is a stack-scoped std::unique_ptr allocated and freed inside LhsPackImageDataSme for a single call. It cannot carry state between invocations.

The actual scratchpad-reuse offenders in kleidiai/ are the thread_local buffers — g_kai_tls_qgemm.lhs_packed / lhs_base_table in qgemm_kleidiai.cpp, the g_kai_half_tls introduced by PR #28487, and the lhs_ptrs_cache_by_pad / pad_ptr caching in convolve_kleidiai.cpp itself (just above the modified line). The qgemm BatchSizeBatchN rename in #28487 and the lhs_ptrs_cache_by_pad keying fix in that same PR look much more like the real culprits than this change.

A short experiment that would have caught this: revert the fill_n, re-run the user's repro from #28654. I'd bet money the bug still reproduces.

Test change

The test_conv2d.h addition exercises both growth orders (small→large→small, then large→small→small) and is a legitimate improvement to the regression coverage. Keep it. But note: this test runs in MLAS unit tests, which means it only exercises MlasConv directly. The user's repro in #28654 goes through the full inference session with disable_prepacking, enable_cpu_mem_arena=0, etc. — the MLAS-level test cannot reach the higher-layer scratchpad that's likely contaminated.

Recommendations

  1. Drop the std::fill_n or replace make_unique with make_unique_for_overwrite if the intent is to document non-init followed by explicit zeroing. Otherwise the comment is misleading.
  2. Keep the test change but on its own commit.
  3. Investigate the actual contamination source. Suggested starting points:
    • g_kai_tls_qgemm.lhs_packed resize-without-clear semantics across two sessions with different shapes.
    • The lhs_ptrs_cache_by_pad[cur_pad_ptr] cache in this same file — cur_pad_ptr is a heap pointer whose address may be reused across calls if the pad buffer is reallocated to the same address.
    • Any static/thread_local state in the f16/f32 KleidiAI paths added recently.
  4. Reproduce [Bug] Memory Contamination due to Scratchpad Reuse #28654 against the patched build before claiming the issue is closed. The PR currently links to the issue with "may be closed by this pull request" but there's no evidence the symptom goes away.

Process note

The PR title says "security issue" but #28654 is filed as a correctness bug ("Memory Contamination due to Scratchpad Reuse"). If there's a separate security analysis backing the "security" framing (information leak across sessions, etc.), please link it. Otherwise the title overstates the severity and the change does not match it.

Bottom line: the test improvement is fine, but the code change is a no-op and the linked issue will almost certainly still reproduce after this lands. Please do not close #28654 on the basis of this PR.

@yuslepukhin
Copy link
Copy Markdown
Member Author

I would defer this to Kleidai

@JonathanC-ARM
Copy link
Copy Markdown
Contributor

Hi @yuslepukhin, I had a look at the issue related to this pr. So I found that the issue is no longer reproducible in the latest version of convolve_kleidiai.cpp. My debugging pointed me towards an issue where we were invalidly reusing hashed cached values for rhs segments. The code that is a culprit for this has already be removed via #26834. So at least in the case of the latest code its not an issue any longer. For 1.26 however the key would be to disable he key would be to disable/remove that RHS cache path. A conservative release-branch fix would be to always repack RHS in RhsPackWeightsBiasSme instead of using the thread_local cache keyed by the partial weight hash.

@yuslepukhin yuslepukhin deleted the yuslepukhin/scratch_pad_reuse branch May 26, 2026 22:20
@yuslepukhin
Copy link
Copy Markdown
Member Author

Hi @yuslepukhin, I had a look at the issue related to this pr. So I found that the issue is no longer

The author of the issue claims it is a repro in 1.26 which is more recent than the PR

@hariharans29
Copy link
Copy Markdown
Member

hariharans29 commented May 26, 2026

Hi @yuslepukhin, I had a look at the issue related to this pr. So I found that the issue is no longer

The author of the issue claims it is a repro in 1.26 which is more recent than the PR

No it's not. the PR was merged after 1.26.0 - maybe 1.26.1 was produced after this PR made into main (not too sure about that), but this PR was never a cherry-pick candidate.

Also @JonathanC-ARM was able to repro OP's issue with 1.26.0 and the issue went away with the main branch

@JonathanC-ARM
Copy link
Copy Markdown
Contributor

Hi @yuslepukhin, I had a look at the issue related to this pr. So I found that the issue is no longer

The author of the issue claims it is a repro in 1.26 which is more recent than the PR

Hi @yuslepukhin The particular pr was quite long running, it was opened last year but only merged 2 weeks ago. At first glance the date would lead you to believe it's very old.

I checked the latest code from 1.26 https://github.com/microsoft/onnxruntime/blob/v1.26.0/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp which definitely doesn't contain the change from the referenced pr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Memory Contamination due to Scratchpad Reuse

4 participants