Address CUDA ReduceL2 negative results and add test coverage#28681
Address CUDA ReduceL2 negative results and add test coverage#28681yuslepukhin wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes CUDA ReduceL1/ReduceL2 correctness for “no-op” reductions where the input and output element counts are the same (e.g., reducing a singleton axis, or scalar-like cases), ensuring norm semantics by producing non-negative results. It also adds regression tests targeting negative inputs for singleton-axis reductions across several data types.
Changes:
- CUDA: for
ReduceL1/ReduceL2no-op reductions (input_count == output_count), compute elementwise absolute value instead of memcpy/copy-through behavior. - CUDA: apply the same no-op norm handling across multiple reduction code paths (shared helper, core compute, and specialized int path).
- Tests: add new
ReduceL1/ReduceL2cases for negative inputs with singleton axes (keepdims and non-keepdims; multiple dtypes).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
onnxruntime/core/providers/cuda/reduction/reduction_ops.cc |
Applies Abs for CUDA no-op ReduceL1/ReduceL2 to enforce non-negative norm results instead of copying negative values. |
onnxruntime/test/providers/cpu/reduction/reduction_ops_test.cc |
Adds regression tests for singleton-axis reductions with negative inputs across multiple types (including FP16 when CUDA is enabled). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // For int64 inputs, values > 2^53 lose precision, but the final truncation to int64 | ||
| // makes the error at most ±1. |
| test.AddInput<int32_t>("data", {1}, {std::numeric_limits<int32_t>::min()}); | ||
| // abs(INT_MIN) overflows; we expect saturation to INT_MAX. | ||
| test.AddOutput<int32_t>("reduced", {}, {std::numeric_limits<int32_t>::max()}); | ||
| test.Run(OpTester::ExpectResult::kExpectSuccess, "", {kTensorrtExecutionProvider}); |
| test.AddOutput<int32_t>("reduced", {}, {std::numeric_limits<int32_t>::max()}); | ||
| // Only CPU handles saturation; CUDA casts to float (loses precision but doesn't overflow). | ||
| test.Run(OpTester::ExpectResult::kExpectSuccess, "", | ||
| {kTensorrtExecutionProvider, kCudaExecutionProvider, kRocmExecutionProvider}); |
tianleiwu
left a comment
There was a problem hiding this comment.
Review Summary
The approach is sound — using double accumulators for CPU integer reductions and saturating abs for CUDA no-op paths correctly eliminates UB and overflow. Tests are comprehensive.
One blocking issue: missing CUDA template instantiations will cause linker failures (see inline comment).
| template void Impl_SaturatingAbs<double>(cudaStream_t, const double*, double*, size_t); | ||
| template void Impl_SaturatingAbs<half>(cudaStream_t, const half*, half*, size_t); | ||
| template void Impl_SaturatingAbs<BFloat16>(cudaStream_t, const BFloat16*, BFloat16*, size_t); | ||
| template void Impl_SaturatingAbs<int32_t>(cudaStream_t, const int32_t*, int32_t*, size_t); |
There was a problem hiding this comment.
int64_t and int8_t — will cause a linker error.
The SPECIALIZED_REDUCEKERNEL_COMPUTEIMPL macro (expanded for int32_t, int64_t, int8_t at lines 834-836 of reduction_ops.cc) now calls Impl_SaturatingAbs<CudaT> guarded by a runtime cudnn_reduce_op check. Since the check isn't compile-time, the compiler emits the call for all three expansions, and the linker needs the symbols.
Even though ReduceL1/L2 aren't registered for int64_t/int8_t on CUDA (dead code at runtime), the linker still requires the instantiations.
template void Impl_SaturatingAbs<int64_t>(cudaStream_t, const int64_t*, int64_t*, size_t);
template void Impl_SaturatingAbs<int8_t>(cudaStream_t, const int8_t*, int8_t*, size_t);
This pull request improves the correctness and robustness of L1 and L2 reduction operations (norms) for both CPU and CUDA backends, especially for integer types. It addresses issues with undefined behavior and overflow when computing absolute values and sums/squares, particularly for edge cases like the minimum representable integer value and large accumulations. The changes also introduce comprehensive tests to ensure correct handling of these cases.
The most important changes are:
CPU backend: safer and more robust norm reductions
saturating_abshelper for integer types to safely compute absolute values, returning the maximum representable value instead of overflowing for values likeINT_MIN. (onnxruntime/core/providers/cpu/reduction/reduction_ops.h, onnxruntime/core/providers/cpu/reduction/reduction_ops.hR722-R782)ReduceAggregatorL1andReduceAggregatorL2to accumulate indoublefor integer types, avoiding overflow during sum or sum-of-squares, and saturate the result to the type's maximum value if necessary. (onnxruntime/core/providers/cpu/reduction/reduction_ops.h, [1] [2]CUDA backend: saturating absolute value and correct no-op reduction handling
Impl_SaturatingAbsfor CUDA, which safely computes absolute values for norm reductions, including handling of the minimum signed integer. (onnxruntime/core/providers/cuda/reduction/reduction_functions.cu, [1];onnxruntime/core/providers/cuda/reduction/reduction_functions.h, [2]Impl_SaturatingAbsfor norm operations (L1/L2) when the reduction is a no-op (input and output shapes are the same), ensuring the result is always non-negative and avoiding undefined behavior. (onnxruntime/core/providers/cuda/reduction/reduction_ops.cc, [1] [2] [3]Testing: improved coverage for edge cases
INT_MINand summation overflow, to verify correct saturation and behavior across data types and backends. (onnxruntime/test/providers/cpu/reduction/reduction_ops_test.cc, onnxruntime/test/providers/cpu/reduction/reduction_ops_test.ccR350-R449)