[GH 28654] Address keidiai security issue#28678
Conversation
|
CC: @JonathanC-ARM - can you please take a look at this PR ? |
There was a problem hiding this comment.
Pull request overview
This PR targets a reported KleidiAI convolution robustness/security concern by ensuring deterministic packed-buffer contents and by strengthening regression coverage around internal buffer growth/reuse behavior.
Changes:
- Zero-initialize the KleidiAI Conv2D LHS packed buffer before running the LHS packing kernel.
- Expand the existing KleidiAI Conv2D regression sequence to exercise both increasing and decreasing CI invocation orders.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp |
Adds explicit zero-initialization of the allocated packed LHS buffer to avoid stale bytes impacting computation/determinism. |
onnxruntime/test/mlas/unittest/test_conv2d.h |
Extends the KleidiAI Conv2D regression loop to cover both CI growth and shrink orders for better buffer reuse/reallocation coverage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| auto lhs = std::make_unique<std::byte[]>(lhs_size); | ||
| // Some ukernel packing paths may not overwrite every byte in the packed buffer for partial tiles. | ||
| // Start from a deterministic zero state so stale heap contents cannot influence later math. | ||
| std::fill_n(lhs.get(), lhs_size, std::byte{0}); |
Review — PR #28678
|
|
I would defer this to Kleidai |
|
Hi @yuslepukhin, I had a look at the issue related to this pr. So I found that the issue is no longer reproducible in the latest version of convolve_kleidiai.cpp. My debugging pointed me towards an issue where we were invalidly reusing hashed cached values for rhs segments. The code that is a culprit for this has already be removed via #26834. So at least in the case of the latest code its not an issue any longer. For 1.26 however the key would be to disable he key would be to disable/remove that RHS cache path. A conservative release-branch fix would be to always repack RHS in RhsPackWeightsBiasSme instead of using the thread_local cache keyed by the partial weight hash. |
The author of the issue claims it is a repro in 1.26 which is more recent than the PR |
No it's not. the PR was merged after 1.26.0 - maybe 1.26.1 was produced after this PR made into main (not too sure about that), but this PR was never a cherry-pick candidate. Also @JonathanC-ARM was able to repro OP's issue with 1.26.0 and the issue went away with the main branch |
Hi @yuslepukhin The particular pr was quite long running, it was opened last year but only merged 2 weeks ago. At first glance the date would lead you to believe it's very old. I checked the latest code from 1.26 https://github.com/microsoft/onnxruntime/blob/v1.26.0/onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp which definitely doesn't contain the change from the referenced pr. |
This pull request improves the reliability and determinism of convolution kernel packing and expands the convolution unit test coverage to ensure buffer management is robust against allocation and reuse issues.
Reliability and determinism improvements:
convolve_kleidiai.cpp, the packed buffer for the LHS (left-hand side) image data is now explicitly zero-initialized before use, ensuring that any stale heap data does not affect computations, especially in cases where some packing paths may not overwrite every byte.Testing and validation enhancements:
test_conv2d.h, the convolution buffer growth test is expanded to run both increasing and decreasing input channel (CI) orders, validating that the results remain stable regardless of invocation sequence and increasing the chances of catching buffer reuse or reallocation bugs.