Skip to content

Address CUDA ReduceL2 negative results and add test coverage#28681

Open
yuslepukhin wants to merge 3 commits into
mainfrom
yuslepukhin/reducel2_negative
Open

Address CUDA ReduceL2 negative results and add test coverage#28681
yuslepukhin wants to merge 3 commits into
mainfrom
yuslepukhin/reducel2_negative

Conversation

@yuslepukhin
Copy link
Copy Markdown
Member

@yuslepukhin yuslepukhin commented May 26, 2026

This pull request improves the correctness and robustness of L1 and L2 reduction operations (norms) for both CPU and CUDA backends, especially for integer types. It addresses issues with undefined behavior and overflow when computing absolute values and sums/squares, particularly for edge cases like the minimum representable integer value and large accumulations. The changes also introduce comprehensive tests to ensure correct handling of these cases.

The most important changes are:

CPU backend: safer and more robust norm reductions

  • Added a saturating_abs helper for integer types to safely compute absolute values, returning the maximum representable value instead of overflowing for values like INT_MIN. (onnxruntime/core/providers/cpu/reduction/reduction_ops.h, onnxruntime/core/providers/cpu/reduction/reduction_ops.hR722-R782)
  • Modified ReduceAggregatorL1 and ReduceAggregatorL2 to accumulate in double for integer types, avoiding overflow during sum or sum-of-squares, and saturate the result to the type's maximum value if necessary. (onnxruntime/core/providers/cpu/reduction/reduction_ops.h, [1] [2]

CUDA backend: saturating absolute value and correct no-op reduction handling

  • Introduced Impl_SaturatingAbs for CUDA, which safely computes absolute values for norm reductions, including handling of the minimum signed integer. (onnxruntime/core/providers/cuda/reduction/reduction_functions.cu, [1]; onnxruntime/core/providers/cuda/reduction/reduction_functions.h, [2]
  • Updated reduction kernels to use Impl_SaturatingAbs for norm operations (L1/L2) when the reduction is a no-op (input and output shapes are the same), ensuring the result is always non-negative and avoiding undefined behavior. (onnxruntime/core/providers/cuda/reduction/reduction_ops.cc, [1] [2] [3]

Testing: improved coverage for edge cases

  • Added new tests for L1 reductions with negative values, singleton axes, and integer overflow scenarios, including cases for INT_MIN and summation overflow, to verify correct saturation and behavior across data types and backends. (onnxruntime/test/providers/cpu/reduction/reduction_ops_test.cc, onnxruntime/test/providers/cpu/reduction/reduction_ops_test.ccR350-R449)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes CUDA ReduceL1/ReduceL2 correctness for “no-op” reductions where the input and output element counts are the same (e.g., reducing a singleton axis, or scalar-like cases), ensuring norm semantics by producing non-negative results. It also adds regression tests targeting negative inputs for singleton-axis reductions across several data types.

Changes:

  • CUDA: for ReduceL1/ReduceL2 no-op reductions (input_count == output_count), compute elementwise absolute value instead of memcpy/copy-through behavior.
  • CUDA: apply the same no-op norm handling across multiple reduction code paths (shared helper, core compute, and specialized int path).
  • Tests: add new ReduceL1/ReduceL2 cases for negative inputs with singleton axes (keepdims and non-keepdims; multiple dtypes).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
onnxruntime/core/providers/cuda/reduction/reduction_ops.cc Applies Abs for CUDA no-op ReduceL1/ReduceL2 to enforce non-negative norm results instead of copying negative values.
onnxruntime/test/providers/cpu/reduction/reduction_ops_test.cc Adds regression tests for singleton-axis reductions with negative inputs across multiple types (including FP16 when CUDA is enabled).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/providers/cuda/reduction/reduction_ops.cc Outdated
Comment thread onnxruntime/core/providers/cuda/reduction/reduction_ops.cc Outdated
Comment thread onnxruntime/test/providers/cpu/reduction/reduction_ops_test.cc
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comment on lines +745 to +746
// For int64 inputs, values > 2^53 lose precision, but the final truncation to int64
// makes the error at most ±1.
test.AddInput<int32_t>("data", {1}, {std::numeric_limits<int32_t>::min()});
// abs(INT_MIN) overflows; we expect saturation to INT_MAX.
test.AddOutput<int32_t>("reduced", {}, {std::numeric_limits<int32_t>::max()});
test.Run(OpTester::ExpectResult::kExpectSuccess, "", {kTensorrtExecutionProvider});
test.AddOutput<int32_t>("reduced", {}, {std::numeric_limits<int32_t>::max()});
// Only CPU handles saturation; CUDA casts to float (loses precision but doesn't overflow).
test.Run(OpTester::ExpectResult::kExpectSuccess, "",
{kTensorrtExecutionProvider, kCudaExecutionProvider, kRocmExecutionProvider});
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

The approach is sound — using double accumulators for CPU integer reductions and saturating abs for CUDA no-op paths correctly eliminates UB and overflow. Tests are comprehensive.

One blocking issue: missing CUDA template instantiations will cause linker failures (see inline comment).

template void Impl_SaturatingAbs<double>(cudaStream_t, const double*, double*, size_t);
template void Impl_SaturatingAbs<half>(cudaStream_t, const half*, half*, size_t);
template void Impl_SaturatingAbs<BFloat16>(cudaStream_t, const BFloat16*, BFloat16*, size_t);
template void Impl_SaturatingAbs<int32_t>(cudaStream_t, const int32_t*, int32_t*, size_t);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Missing explicit instantiations for int64_t and int8_t — will cause a linker error.

The SPECIALIZED_REDUCEKERNEL_COMPUTEIMPL macro (expanded for int32_t, int64_t, int8_t at lines 834-836 of reduction_ops.cc) now calls Impl_SaturatingAbs<CudaT> guarded by a runtime cudnn_reduce_op check. Since the check isn't compile-time, the compiler emits the call for all three expansions, and the linker needs the symbols.

Even though ReduceL1/L2 aren't registered for int64_t/int8_t on CUDA (dead code at runtime), the linker still requires the instantiations.

template void Impl_SaturatingAbs<int64_t>(cudaStream_t, const int64_t*, int64_t*, size_t);
template void Impl_SaturatingAbs<int8_t>(cudaStream_t, const int8_t*, int8_t*, size_t);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants