Fix OOB reads in SoftmaxCrossEntropyLoss via label bounds validation#28004
Fix OOB reads in SoftmaxCrossEntropyLoss via label bounds validation#28004
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens runtime validation to prevent out-of-bounds reads by ensuring indices/sizes are validated before being used for tensor indexing in training loss ops (and also adds additional validation for MatMulBnb4 inputs).
Changes:
- Add label-value bounds validation for
SoftmaxCrossEntropyLossandSoftmaxCrossEntropyLossGrad(skippingignore_index) and validateweight_shape[0] == Con CPU. - Add regression tests that expect failures for out-of-range labels for SoftmaxCrossEntropyLoss (forward + grad).
- Add
K/N/block_sizepositivity checks and minimum-size validation for MatMulBnb4Bandabsmaxinputs (CPU + CUDA), plus new negative tests.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| orttraining/orttraining/training_ops/cpu/loss/softmax_cross_entropy_loss.cc | Adds CPU-side bounds validation for labels and weight shape in SoftmaxCrossEntropyLoss forward/grad paths. |
| orttraining/orttraining/test/training_ops/cuda/cross_entropy_test.cc | Adds new failure-mode tests for out-of-range labels (currently duplicated with new CPU test file). |
| orttraining/orttraining/test/training_ops/cpu/loss/cross_entropy_test.cc | Adds new CPU regression tests for out-of-range labels (currently duplicates CUDA test names). |
| onnxruntime/contrib_ops/cpu/quantization/matmul_bnb4.cc | Adds attribute positivity checks and input-size validation for MatMulBnb4 on CPU. |
| onnxruntime/contrib_ops/cuda/quantization/matmul_bnb4.cc | Adds attribute positivity checks and input-size validation for MatMulBnb4 on CUDA. |
| onnxruntime/test/contrib_ops/matmul_bnb4_test.cc | Adds new negative tests for undersized MatMulBnb4 inputs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add bounds checking for label tensor values in SoftmaxCrossEntropyLoss and SoftmaxCrossEntropyLossGrad to prevent out-of-bounds memory reads when processing untrusted ONNX models. The operators used label_data values as array indices into log_prob_data and weight_data buffers without validating they fall within [0, C) where C is the number of classes. A malicious model could embed arbitrary label values causing heap OOB reads. Changes: - Add label value validation in both forward and backward Compute methods - Add weight tensor size validation (weight_shape[0] == C) - Move weight_data[label] access after ignore_index check in grad path to prevent OOB when ignore_index is outside [0, C) - Add regression tests for both forward and backward paths Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
f2c15e3 to
f20400c
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Validate label values are within [0, C) to prevent out-of-bounds reads. | ||
| for (int64_t i = 0; i < N_D; i++) { | ||
| if (ignore_index != label_data[i]) { | ||
| ORT_RETURN_IF(label_data[i] < 0 || label_data[i] >= C, |
There was a problem hiding this comment.
Should we move that test in one of the loops below where the label is compared a second time to ignore_index?
There was a problem hiding this comment.
Should we move that test in one of the loops below where the label is compared a second time to ignore_index?
I think the separate validation pass is the better approach here, for a few reasons:
- Fail-fast: validates all labels before any computation begins, so we never produce partial/corrupt output on bad input.
- Parallel safety: the Grad path uses
TryParallelFor, and propagating errors out of parallel lambdas is tricky. A separate upfront check avoids that complexity. - Negligible cost: the label scan is O(N_D) vs O(N_D × C) for the main computation, so the extra pass should not be measurable in practice.
That said, I'm open to folding it in if you feel strongly about it. Just wanted to flag the tradeoffs.
Description
Fix out-of-bounds reads in
SoftmaxCrossEntropyLossandSoftmaxCrossEntropyLossGradwhen label values are outside[0, C).Same class of bug fixed in #27568 for
SparseSoftmaxCrossEntropy. CUDA kernels already have this check.Changes
[0, C)before indexing (forward + backward), skippingignore_indexweight_shape[0] == Cweight_data[label]access afterignore_indexcheck in grad weighted pathsTests
4 regression tests: label too large, negative label, OOB with weights, grad OOB.